Online Learning in Event-based Restricted Boltzmann Machines

Size: px

Start display at page:

Download "Online Learning in Event-based Restricted Boltzmann Machines"

Christian Cooper
5 years ago
Views:

1 University of Zürich and ETH Zürich Master Thesis Online Learning in Event-based Restricted Boltzmann Machines Author: Daniel Neil Supervisor: Michael Pfeiffer and Shih-Chii Liu A thesis submitted in fulfilment of the requirements for the degree of MSc UZH ETH in Neural Systems and Computation in the Sensors Group Institute of Neuroinformatics October 2013

2 Declaration of Authorship I, Daniel Neil, declare that this thesis titled, Online Learning in Event-based Restricted Boltzmann Machines and the work presented in it are my own. I confirm that: This work was done wholly or mainly while in candidature for a research degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly attributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: i

3 Abstract Online Learning in Event-based Restricted Boltzmann Machines by Daniel Neil Restricted Boltzmann Machines (RBMs) constitute the main building blocks of Deep Belief Networks and other state-of-the-art machine learning tools. It has recently been shown how RBMs can be implemented in networks of spiking neurons, which is advantageous because the necessary repetitive updates can be performed in an efficient asynchronous and event-driven manner. However, like any previously known method for training RBMs, the training process for event-based RBMs was performed offline. The offline training fails to exploit the computational advantages of spiking networks, and does not capture the online learning characteristics of biological systems. This thesis introduces the first online method of training event-based RBMs that combines the standard RBM-training method, called contrastive divergence (CD), with biologically inspired spike-based learning. The new rule, which we call evtcd, offers sparse and asynchronous weight updates in spiking neural network implementations of RBMs, and is the first online training algorithm for this architecture. Moreover, the algorithm is shown to approximate the previous offline training process. Performance of training was evaluated on the standard MNIST handwritten digit identification task, achieving 90.4% accuracy when combined with a linear decoder on the features extracted by a single event-based RBM. Finally, evtcd was applied to the realtime output of an event-based vision sensor and achieved 86.7% accuracy after only 60 seconds of training time and presentation of less than 2.5% of the standard training digits.

4 Acknowledgements This thesis could not have been accomplished without the original effort of Peter O Connor and his boundless persistence to make event-based Deep Belief Networks possible. Saee Paliwal has contributed her endless support and incomparable mathematical ability during the course of this work, and without our discussions I would have likely floundered in the great space of ideas. A very large thanks to Michael Pfeiffer for his expert knowledge and his key insights into the work, which critically moved the project forward at various times when it stalled. Finally, this document will hopefully be just one of many written by me at the Institute of Neuroinformatics, thanks largely to the encouragement, insights, and support of Shih-Chii Liu. I am deeply in all of your debt for your help. iii

5 Contents Declaration of Authorship Abstract Acknowledgements Contents i ii iii iv 1 Introduction 1 2 A Background in Restricted Boltzmann Machines and Deep Learning A Historical Introduction to Deep Learning Energy-Based Models Products of Experts RBMs and Contrastive Divergence Extensions to Standard Learning Rules Derivation of evtcd Spiking Restricted Boltzmann Machines evtcd, an Online Learning Rule for Spiking Restricted Boltzmann Machines 21 4 Implementation of evtcd Algorithm Recipe for Software Implementation Supervised Training with evtcd Test Methodology MNIST Time-stepped Training Methodology Extracting Spikes From Still Images Online Training Methodology A Software Implementation of Fixational Eye Movements Training Environment Java Reference Implementation Quantification of evtcd Training Improving Training through Parameter Optimizations Baseline Training Demonstration Learning Rate Number of Input Events iv

6 Contents v Batch Size Noise Temperature Persistent Contrastive Divergence Bounded Weights Inverse Decay Training as a Feature Extractor Online Training with Spike-Based Sensors Conclusions and Future Work Conclusions Future Work A Java Implementation 66 B Matlab Implementation 69 C EyeMove Implementation 73 Bibliography 75

7 Chapter 1 Introduction Deep networks, specifically Deep Belief Networks [1, 2] and Deep Boltzmann machines [3], are achieving state-of-the-art performance on classification tasks for images, videos, audio, and text [4 10]. Importantly, Restricted Boltzmann machines (RBMs) [11 13] underlie both of these approaches and recent work [14 17] has strongly pushed to investigate the possibility of implementing RBMs on networks of spiking neurons. The reason for this is two-fold. First, fast and efficient silicon architectures [18 21] designed specifically to accelerate spiking neural networks are emerging, and RBMs composed of spiking neurons would pair progress in machine learning with novel and powerful computing architectures. Additionally, it has been shown that the accuracy of networks on classification tasks is guaranteed to improve with larger size or more layers [1, 14], implying that higher accuracy could be achieved by just scaling up the network size. Therefore, scale is an important factor, and these neural network accelerators are all designed to run large networks of spiking neurons faster than a general-purpose computer. Second, RBMs implemented with spiking neurons offer a fundamental advantage when used in very large networks over traditional RBM approaches. As [14] points out, scale is not the dominant factor in processing time for the brain, unlike standard computational approaches, because the processing units are both parallel and event-driven. The brain adapts its processing speed to the rate of input, so the computational effort is proportional to the number of events. This so-called event-driven computational approach is a hallmark of neuromorphic designs that seek inspiration from the brain to build eventbased, asynchronous, and often low-power silicon systems [22 25]. The advantages of event-driven computation are well studied [16, 26, 27] and certain types of spiking neuron models can be implemented entirely in event-driven systems [28, 29]. Additionally, current event-driven neuromorphic sensors produce sparse outputs for vision [24, 27, 30] 1

8 Chapter 1. Introduction 2 and audition [31]. By designing algorithms that can run on spiking neural networks, it is possible to construct a complete hardware system using event-driven computation alone. However, the pre-existing implementations of RBMs composed of spiking neural networks [14 16] all use an offline and synchronous training algorithm. No training algorithm has yet been discovered for online training of RBMs composed of spiking neurons. The main question this thesis is concerned with is the following: can RBMs composed of spiking neurons be trained online? Specifically, there are three subgoals: 1. Derive a rule for online learning of an RBM composed of spiking neural networks; 2. Design an event-driven, asynchronous implementation of this rule to achieve high performance in scalable systems; 3. Demonstrate this training rule s effectiveness on a common benchmark task. In this thesis, I will introduce evtcd, an online learning rule inspired by biological rules of spike time-dependent plasticity (STDP) [32 34] that trains an RBM composed of leaky integrate-and-fire (LIF) spiking neurons. This training algorithm can be run in an entirely event-driven way by using updates from STDP with insights from contrastive divergence (CD) learning, the standard method of learning for RBMs. Beyond the description of this new algorithm, I demonstrate a proof sketch of how this learning rule approximates a previously demonstrated offline learning rule introduced in [14], which in turn arises from contrastive divergence learning. Finally, this thesis contributes a real-time implementation of evtcd to demonstrate the efficiency of this learning rule. An event-based image sensor [24], which produces image events instead of image frames, generates spike trains which are used to run the evtcd training algorithm on populations of spiking neurons. When applied to the commonly used MNIST handwritten digit identification task [35], the receptive fields of the neurons learn digit parts as they do in the standard frame-based training. After sixty seconds of real-time learning, the network learns features that perform better than an optimal linear decoder on the raw digits, achieving a classification accuracy of 86.7%. This thesis is structured as follows. In Chapter 2, the history of deep learning is introduced and the derivations of previous methods are analyzed for their applicability to spiking neural networks. Chapter 3 introduces evtcd and links the algorithm to the previously successful rate-based offline learning algorithm in [14]. In Chapter 4, an algorithm recipe is shown for implementing evtcd in software, and supervised learning

9 Chapter 1. Introduction 3 with the evtcd algorithm is explained. Following that, Chapter 5 explains the testing methodology and setup for obtaining results that analyze evtcd training. Chapter 6 then studies the behaviour of the algorithm under various parameters and extensions, and demonstrates rapid real-time learning of the MNIST handwritten digit dataset. Finally, Chapter 7 concludes by introducing ideas for future work.

10 Chapter 2 A Background in Restricted Boltzmann Machines and Deep Learning This chapter will introduce the prior work on RBMs and deep learning to lay the foundations for the introduction of the evtcd learning rule. In Section 2.1, a historical introduction will give an overview of the history and intuition of training RBMs. Subsequently, Sections 2.2 through 2.4 will focus on reproducing the mathematical derivations of RBM learning rules to understand the assumptions in them. Finally, standard extensions to the contrastive divergence learning rule will be briefly explained in Section 2.5, as they form the basis for many of the investigations found in Section A Historical Introduction to Deep Learning The origins of deep learning begin with Boltzmann machines [36]. Introduced in 1986, Boltzmann machines are an undirected bipartite probabilistic generative model. Though they are composed of two layers ( bipartite ), the connections between these layers are bidirectional (hence, undirected ) to allow the system to either pull external states into their internal representation or to generate data from their internal representation. These machines are probabilistic in that they encode the probabilities of types of inputs on which they are trained, and are modeled on a physical analogy to distributions of matter that probabilistically settle into low-energy configurations. Finally, Boltzmann machines are generative models, meaning they are capable of producing data that look like the inputs on which they have been trained. If, for example, a network is 4

11 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 5 trained on handwritten digits, a Boltzmann machine will, after training, produce digitlike patterns on the visible part of the system when allowed to freely sample from the distribution specified by the weights in the system. Section 2.2 addresses their mathematical formulation. Boltzmann machines are trained using a computationally intensive process in which the machines are annealed into low-energy states, and these states are used to guide a training algorithm to model the joint probabilities of the inputs presented to them (Equation 2.20). These machines are typically discussed as having a visible state and a hidden state, shown in Figure 2.1, in which the visible state corresponds to the data that is fed into the machine and the hidden state corresponds to some abstracted representation hidden from the outside world. The goal of the Boltzmann machine is to use the energy dynamics of the system to learn arbitrary distributions of the input data. The relationships that specify the distribution are mapped through the connection weights (Equation 2.2), which force the hidden units to represent the input distributions and cause the low energy states to correspond to probable configurations of the system [36]. Ultimately, this distribution matching is a very important task for learning [37]. It is a form of unsupervised learning in which the goal of the system is to design a probability distribution that can arbitrarily approximate an input distribution, learning only from samples from this unknown input distribution. This is an important goal because it means the system can perform inference on that model, calculate likelihoods of a given input, and produce samples similar to those it has been trained upon [13, 36, 37]. As will be shown later in this work, it also allows a form of supervised learning if the labels are presented as part of the joint distribution it needs to learn [2, 13]. RBMs, which were introduced under the name Harmoniums by [11], are a slight modification of the Boltzmann machine in that intra-layer connections have been removed to make units in the same layer conditionally independent, as seen in Figure 2.1. As mentioned before, both Boltzmann and Restricted Boltzmann machines can be trained in an unsupervised fashion, which means that no training labels are necessary for the system to learn the joint distribution of their inputs. Unfortunately, training a Boltzmann machine through simulated annealing takes considerable computational time because the system must be allowed to stabilize to an equilibrium. This is necessary to obtain a single sample from the data and model distributions, which are used to calculate the gradient of learning for minimizing the difference between these two distributions (see Equation 2.20 in Section 2.2), and thus is the limiting factor for training a dataset. If every weight update takes a significant amount of

12 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 6 Hidden Visible Boltzmann Machine Restricted Boltzmann Machine Figure 2.1: Diagrammatic view of a Boltzmann machine and a Restricted Boltzmann machine. Note that the Restricted Boltzmann machine lacks intra-layer connections. Figure taken from [14]. time to calculate (i.e., using Equation 2.20), this precludes training and hindered the adoption of Boltzmann machines [13, 37]. The Restricted Boltzmann machine attempts to address one significant issue of training the Boltzmann machine: the difficulty of obtaining a true sample from a Boltzmann machine. By removing intra-layer connections, inference becomes tractable within this model, as recognized by [12], and Gibbs sampling can be used to infer likely states of the model instead of annealing the system into equilibrium [12, 13]. After their initial introduction in the mid-1980s, progress on RBMs was sporadic and largely ineffectual in favor of training supervised shallow architectures with backpropagation, as in [38]. Moreover, shallow architectures are already universal, meaning that a shallow architecture could theoretically approximate any function given enough units. Although it can be exponentially more efficient to have deep layers than a single layer [37], no effective training algorithms were yet created that could train deep networks. Unfortunately, standard error-gradient techniques like backpropagation assign excessive error updates to the final layer in a deep architecture. The error used in learning effectively disappears after being propagated back through all layers [1, 37], causing very little training signal to reach the initial layers. This results in over-learning of the top layer and under-learning in lower layers, using the resources of a deep network inefficiently - and ultimately performing worse than a single well-trained layer [37, 39]. An alternative approach was championed by Yann Le Cun (for example, [37, 40]), which was to use convolutional networks. These networks have templates mapped across the

13 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 7 input space, combined with transformation layers. This effectively decreases the number of independent weights and allows for more efficient training, as well making the system more robust to certain transformations; depending on the architecture, it can be made to be more robust to both translation and scaling. Unfortunately, this approach imposes certain assumptions about the inputs (translation invariance, for example), and ultimately gives up the descriptive power of large numbers of weights in favor of simpler, more effective training. Then, in 2002, work was published which began shifting progress back towards deep networks [2, 13]. Data was becoming very plentiful, but, unfortunately, most of that data was unlabeled; a simple Internet search could easily yield troves of data, but it is not structured in ways that allows computers to easily learn from it [13]. To take advantage of this available information, unsupervised learning is a very powerful tool because it allows a computer to pre-learn from large volumes of unlabeled data, learning about the differences between classes that it sees. Then, when presented with labels, it can fine-tune its learning with a final supervised step. However, RBM took far too long to train using simulated annealing, but in this 2002 paper Hinton discovered a very effective approximation that works well in practice and enabled the rise of deep networks [37]. One way to obtain a sample from the RBM s model distribution is to begin a Markov chain starting with a current data sample. This Markov chain can alternately draw samples from the hidden layer given the visible, and the visible layer given the hidden (see Figure 2.2). It turns out this process is equivalent to a Markov Chain Monte Carlo (MCMC) sampling method known as Gibbs sampling, and Gibbs sampling in this context converges to the stationary distribution specified by the RBM regardless of the starting point. Informally, this means that no matter how the system is initialized, this process of repeatedly generating samples from the system causes those samples to gradually become closer to the distribution specified by the system. The first sample may be random data, but by the time the energy dynamics have adjusted the activations many, many times, the samples begin to be the types of samples specified by the RBM. Visible Layer Hidden Layer Q 0 Q 1 Hidden Layer Visible Layer Q n Visible Layer Hidden Layer Figure 2.2: The Markov chain used in contrastive divergence. Gibbs steps are taken to create samples closer to the equilibrium distribution than the original data sample.

14 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 8 Gibbs sampling is designed to generate approximate samples from a distribution where direct sampling is difficult. The key insight that underlies Gibbs sampling is that sampling from a conditional distribution may be easier than obtaining pure unbiased samples from the distribution. The process is surprisingly simple; for all joint variables in the distribution, hold all but one fixed, and draw a sample conditioned on the others. In the next step, use the updated sample value for that variable to draw a sample for a different variable. Intuitively, the process walks a sampling process towards the distribution specified by the parameters, and given infinite steps will yield samples from the distribution specified by the parameters. In practice, this process cannot be repeated indefinitely, hence the use of simulated annealing to try and stabilize the system with an energy-cooling process. Unfortunately, it cannot be known a priori how many steps are necessary to obtain a true enough sample from the model distribution, and in practice this number can be quite large and also quite variable. However, the key insight in the [13] paper is that a single step is sufficient to learn effectively. It is not obvious that this should be the case, however, and the next section deals with the mathematics to explain why this assumption is valid. Intuitively, this single-step sampling method known as CD-1 (see Equation 2.33) works because a sample drawn from a Gibbs step is closer to the model distribution specified by the RBM than the input data. Even though this model sample is still highly correlated with the current input, it still contains information about the gradient that would decrease the difference between the model distribution and the data distribution, and this single sample can be used to approximate a sample from the model distribution. A step then taken to minimize the difference between the model sample and the data sample will make the model distribution more like the data distribution. In practice, a single Gibbs step is quick to calculate and very effective [2, 13, 37]. The second key insight, which appeared a few years later in [2], was to use RBMs as a building block in training deep networks. Each layer in a deep network could be trained as an RBM and the whole system can be composed together out of RBMs. When training, the first layer is trained in an unsupervised way with the visible layer as the input layer and the layer above as the hidden layer. Then, to train the next layer, the learned weights from the first layer are fixed as the hidden layer becomes the visible layer of the next layer. This process continues until the whole network is trained in a greedy layer-wise fashion [39]. These deep networks will only be very sparingly addressed in this thesis, although there is much more room to explore. Now, the mathematics of these systems should be examined to firmly ground the theory on which the rest of this thesis is based.

15 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning Energy-Based Models We begin with a description of energy-based models, which contains both the original Boltzmann machine and the Restricted Boltzmann machine. This proof follows the outline given in the Appendix of [36] and updates the notation to be consistent with the formulae given elsewhere in this manuscript. In analogy to a physical system, we begin by defining the probability of a visible state in the Boltzmann machine as exponential in the energy: P (v) = h P (v, h) = h e E(v,h) e E(x,h) (2.1) x,h Here, the vector v denotes the current visible state of the system, recalling that the Boltzmann machine and RBM are both bipartite machines composed of a visible state v and a hidden state h. This equivalence marginalizes out the hidden states vector h given some visible vector v, and additionally relates the energy function to the probability of a state v. The normalizing factor on the denominator at the right of Equation 2.1, called the partition function by analogy to physics, simply rescales the energy of the current visible state by the sum over all possible system states. Note that to avoid confusion, x is used for the visible vector in the partition function summation rather than v. Where the symbol indicates the transpose operation, b an energy bias on the visible states, c an energy bias on the hidden states, and a weight matrix W providing an energy description of the joint activations of the hidden and visible states, the energy of a system configuration can be defined as: E(v, h) = b v c h h Wv (2.2) Or, equivalently, the sum over i visible neurons and j hidden neurons: E(v, h) = i b i v i j c j h j i,j w ij v i h j (2.3) The goal of this derivation will be to determine a method of updating the weights W to force the system to achieve certain co-activations of v and h, and the tool used will be minimizing the KL divergence. The KL divergence is a directional measure of similarity between distributions, and in this case we want to minimize the difference between the distribution of the visible data the machine is presented with and the distribution of data

16 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 10 it generates. Ideally, if we can find the gradient that minimizes the difference between the distribution of input states forced by the outside world and the states the system settles into according to its energy function, the system will eventually match the distribution of the input states specified by the data. Let P + (v, h) denote the probabilities of input data state (v, h), and P (v, h) denote the model probability of a state (v, h). D KL (P + P ) = x,h P + (v, h) log P + (v, h) P (v, h) (2.4) Noting that P + (v, h) is the true probability of the data and is independent of the parameters, it is a constant when taking the gradient with regard to w ij : D KL (P + P ) w ij = v,h P + (v, h) P (v, h) P (2.5) (v, h) w ij Now, we seek the term P (v,h) w ij, which we can calculate given the probability-energy relation given in Equation 2.1 and the energy-weight relation given in Equation 2.3. We pre-compute the partial derivative of the numerator in Equation 2.1 by using Equation 2.3: e E(x,h) w ij = v i h j e E(x,h) (2.6) Now this term is used in taking full form P (v,h) w ij : P (v) w ij = vi h j e E(v,h) h e E(x,h) x,h v + i h+ j e E(v,h) e E(x,h) x,h h ( ) 2 (2.7) e E(x,h) x,h Which can be simplified using the marginal probabilities in Equation 2.1: P (v) w ij = v i h j P (v, h) v i + h+ j P (v) P (v, h) (2.8) v,h h The expression found in Equation 2.8 is the term sought for the KL divergence, and can be substituted:

17 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 11 G w ij = v P + (v) P (v) v+ i h+ j P (v, h) + v h P + (v) P (v) v i h j P (v) P (v, h) (2.9) v,h By the rules of conditional probability: P + (v, h) = P + (h v)p + (v) (2.10) P (v, h) = P (h v)p (v) (2.11) And since the hidden states are chosen according to the same parameterized model, regardless of whether the visible states are given by the environment or whether the visible states are chosen from the model distribution: P (h v) = P + (h v) (2.12) Using these facts to prepare a term and simplify: P (x, h) P + (x) P (x) = P (h v)p (v) P + (v) P (v) (2.13) = P (h v)p + (v) = P + (h v)p + (v) (2.14) = P + (x, h) (2.15) Of course, since we are dealing with probabilities: P + (v) = 1 (2.16) v P (v) = 1 (2.17) v Substituting all this into 2.9, reproduced here, simplifies the equation:

18 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 12 G w ij = v P + (v) P (v) v+ i h+ j P (v, h) + P + (v) P (v) v i h j P (v) P (v, h) (2.18) h v v,h = v i + h+ j P (v, h) vi h j P (v, h) (2.19) v,h v,h Which is, in fact, just the difference between the expectations of the data distribution and the model distribution. Therefore, a weight update that decreases the distance between the model distribution P and the data distribution P + is proportional to the difference in expectations of the binary product v i h j between the data and model distributions. w ij v i h j data v i h j model (2.20) Unfortunately at this point, training requires obtaining the expectation from the model distribution, a difficult problem that took decades to work around. However, this formulation appears many times in subsequent derivations and is an important result for building on. Ultimately, contrastive divergence, the Siegert approach, and evtcd all work by generating an estimate of the correlations between visible and hidden layers, and then minimizing the difference between the correlations the model generates and the correlations the data produces, so this is an important derivation to understand. 2.3 Products of Experts By a very different route, a similar formulation was described in 2002 with the invention of product of experts (POE) [13] to yield the contrastive divergence learning rule. The derivation of the POE given here, following [13], will incorporate lessons from Gibbs sampling that are valid on POEs to yield a training rule that uniquely works for RBMs. Imagine a system composed of n experts, each of which independently computes the probability of some data vector d out of all possible data vectors c. These experts then multiply their probabilities to combine their opinions on the likelihood of the data. The intuition for this model is that the product of experts can constrain the system so that the final result is a sharper distribution than any individual model, unlike mixture models where each expert is summed with the other experts. In mixture models, the final distribution will be at least as broad as the tightest individual distribution. In contrast, a POE system can have individual experts that specialize in different dimensions, each

19 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 13 of which constrains different attributes to yield a very sharp posterior. For problems that factorize well into component decisions, this is very advantageous; for example, a system designed to detect human beings can be composed of a torso detector, a head detector, and a limb detector whose individual contributions can be combined to yield an overall human detector made of factorized experts. Many real-world problems factorize into attributes and provide the motivation for POE systems. Moreover, RBMs are POE models, and other types of experts are also allowed, but Boltzmann machines are not POEs. The POE can be formulated as follows: p(d θ 1...θ n ) = m p m(d θ m ) m p m(c θ m ) c (2.21) To train the POE, one possible goal is to maximize the probability of the data. Equivalently, the derivative of the log likelihood of the data d can be calculated with respect to the parameters θ m. Since the models are independent with their probabilities multiplied together, the derivative of the probability p with respect to the model parameter θ m can be found as follows: log p(d θ 1...θ n ) = log p m(d θ 1...θ n ) θ m θ m c p(c θ 1...θ n ) log p m(c θ m ) θ m (2.22) The first term, the derivative of the log probability of the data sample with respect to the parameters θ m, is controllable by design. If an expert model is chosen for which it is possible to find the derivative of the probability of the data with respect to the parameters used to calculate the data, then this is straightforward to calculate. Commonly, the expert used is a sigmoid: p m = e b+ j x j w j (2.23) which yields a probability p m based on the inputs x j given, respectively, the bias and weight parameters θ m = {b, w}. It is clear, in this case, that the first term in 2.22 is easily calculable: log p m (d θ 1...θ n ) = 1 θ m e b+ x j w j j e b+ j x j w j (2.24)

20 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 14 Returning to Equation 2.22, though its first term is easily calculable, the second term is more challenging. The second term combines the probability of the product of experts with the log probability of a single expert for all data vectors c. For any given data point c, it should be straightforward to calculate this term because it is the product two values: the POE probability p(c θ 1...θ n ) and the derivative of the log likelihood as in Equation With sufficient time to evaluate all data vectors c, this would not be a problem, but the size of the possible state space likely precludes calculating all possible vectors. For example, for a vector of k binary neurons, the number of combinations is exponential in k; there will be 2 k possible vectors, a number which grows very rapidly to be beyond computational power. However, this second term in Equation 2.22 can be alternately viewed as the expected derivative of the log probability of an expert on data. This means that if accurate samples can be drawn from the model distribution, the expectation can be approximated. To obtain samples from the model distribution, Gibbs sampling can be used to draw samples from a POE, unlike for Boltzmann machines. Now, the architecture of the POE (and RBMs) becomes very important: with no intra-layer connections, all units are conditionally independent of the other units in their layer. Since this is true, a Gibbs sampling chain can be run in which all hidden units are updated in parallel while visible units remain fixed, then all visible units are updated in parallel while hidden units remain fixed. Once this MCMC chain converges to the equilibrium distribution, the expectation can be calculated from the samples produced. However, there is a faster way that Hinton introduces in [13]. Imagine that the data is produced by some true distribution Q 0. For a concrete example, learning the joint probabilities of pixels being on together is a common task for a visual classifier, because the machine learns the relationships between pixels in an image. Now the POE has some model distribution Q which would like to approximate the true distribution Q 0. This naming convention was chosen to mimic the behavior of a Markov chain beginning with the true data Q 0 and ending at the model s equilibrium distribution Q after steps. The idea of contrastive divergence is to minimize the difference between the model distribution Q and the true distribution Q 0. Once again, a way to measure the difference between these two distributions is to use the Kullback-Liebler divergence. The KL divergence of these two distributions can be calculated for Q 0 and Q as: D KL (Q 0 Q ) = d p Q 0(d) log p Q 0(d) d p Q 0(d) log p Q (d) (2.25) = d Q 0 d log Q0 d d Q 0 d log Q d (2.26)

21 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 15 Note that the first term in Equation 2.26 is entirely dependent on the data and is not affected by the parameters of the model that specify Q. This means that during training, no update of the parameters will affect this constant, so it can be safely ignored. Unfortunately, calculating the second term will be intractable as it the expectation of model s log probability of the data taken with the probabilities of the true data. However, note that the value Q d is just the probability of the POE given some data and parameters, p(d θ 1...θ n ), and it is again an expectation. Revisiting Equation 2.22 with this understanding yields: log Q d θ m Q 0 = log pm (d θ m ) θ m Q 0 log pm (c θ m ) θ m Q (2.27) Here, finally, the key insight of contrastive divergence is applied. Imagine that instead of minimizing the KL divergence D KL (Q 0 Q ), the difference between the KL divergences D KL (Q 0 Q ) and D KL (Q 1 Q ) are minimized, where Q 1 refers to the distribution after one full Gibbs step of sampling. Since Q 1 is closer to the equilibrium distribution than Q 0, D KL (Q 0 Q ) exceeds D KL (Q 1 Q ) unless Q 0 = Q 1 meaning Q 0 = Q, in which case it will be no worse and learning is already perfectly done. This contrastive divergence will never be negative. Most importantly, however, the intractable expectation log pm(c θ m) θ m cancels out. Q (D KL (Q 0 Q ) D KL (Q 1 Q )) = log pm(d 0 θ m) θ θ log pm(d 1 θ m) m m Q 0 θ m Q 1 + Q1 θ m D KL (Q 1 Q ) Q 1 In this equation, the term d 0 and d 1 are introduced, and refer to the starting data and the data after one Gibbs sampling step, respectively. In [13], Hinton showed the final term Q1 D KL (Q 1 Q ) θ m was small and rarely opposed the direction of learning. By Q 1 ignoring this term, a very tractable learning rule was developed for POEs in general: log pm (d 0 θ m ) θ m θ m Q 0 log pm (d 1 θ m ) θ m Q 1 (2.28) This discovery is relevant because this contrastive divergence rule holds for systems composed of independent, probabilistic nodes who specify a probability with the product of their indepedent activations. For a network of spiking neurons encoding the joint probabilities of certain features being on, the difference in the expectations for the data and a quickly-derived model sample may be sufficient to train the neural network.

22 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning RBMs and Contrastive Divergence Finally, contrastive divergence on RBMs will be examined here because of the similarities between RBMs and the evtcd learning architecture. This section continues the derivation given in [13]. Contrastive divergence in RBMs draws from both the Boltzmann machine s KL divergence minimization and the POE s contrastive divergence form to yield a particular easy-to-compute result. Beginning again with the POE log-likelihood given in Equation 2.22: log p(d θ 1...θ n ) = log p m(d θ 1...θ n ) θ m θ m c p(c θ 1...θ n ) log p m(c θ m ) θ m (2.29) Imagining a Boltzmann machine with a single hidden node j, note that θ m = w j and the term log pm(d θm) θ m can be obtained from the Boltzmann derivation given above: The second term can also be calculated similarly: log p m (d w j ) w ij = s i s j d s i s j Q (j) (2.30) c p(c w) log p j(d w j ) w ij = s i s j Q s i s j Q (j) (2.31) Subtracting these two equations from each other per the derivation, and then taking the expectation over the whole dataset yields: log Q d w ij Q0 = Q0 Q wij = s i s j Q 0 s i s j Q (2.32) Finally, applying the contrastive divergence approximation gives: w ij (Q 0 Q ) (Q 1 Q ) = w ij s i s j Q 0 s i s j Q 1 (2.33) This is the final form contrastive divergence in RBMs. intuition behind some subsequent improvements to this rule. Next, we will examine the

23 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning Extensions to Standard Learning Rules This standard learning rule has been modified over time to yield practical improvements in learning, or performance optimizations to speed it up. This section will examine the most common of these extensions, and informally discuss why these extensions work. The first extension, which was introduced in the early papers, is the idea of batch training a set of samples in parallel to get a better estimate of the variance. In [13], Hinton introduces the idea of batch training to combat the effects of the variance of the samples. In training, the goal is to find a gradient using a sample to model a distribution, but often the variance of the sample can be so large as to swamp the gradient of the true distribution. By averaging the gradient over several parallel samples, the inter-sample variance is minimized compared to the learning of the true distribution. Moreover, the parallel processing enables vector operation speedups, which are a common method of optimization on modern computers. This idea is more fully explored in Section A common extensions to gradient learning rules is momentum. This is straightforward to implement, and acts by adding a percentage of the previous update to the current data point. When the training is far from converging to the solution, it quickly accelerates towards the goal by taking larger steps by combining the previous step and the current step together. The downside is that too much momentum can cause the learning to overshoot the goal distribution when the learning procedure is closer to the end. Decay is also commonly applied to networks to help regularize the weight. By slowly decaying values, overlearned solutions with a very strong weight decay back towards equilibrium, regularizing the system. If the decay is too fast, the learning procedure will perform worse, but the right amount of decay influences the system to be more stable to new data points and decrease overlearning. Persistent contrastive divergence is a significant contribution by [41], in which the equilibrium distribution is approach faster over time by continuing the Gibbs process with every new data point, moving the sample points closer to the equilibrium distribution. Interestingly, in this case the data and the model are mixed only through the weights; the Gibbs chain that samples from the model distribution is initialized at the first point and slowly steps towards equilibrium. Of course, the equilibrium distribution specified by the weights is also changing slowly, but if the learning rate is low enough then the process should yield samples more closely tied to the equilibrium distribution. The comparison between regular CD (called CD-1 because each model sample point comes from a distribution one step closer to the equilibrium distribution) and persistent contrastive divergence can be visualized in Figure 2.3. This idea was explored in neural networks in Section

24 Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 18 Finally, sparsity and selectivity have been incorporated into the learning procedures with [42]. This innovation was originally inspired by biological evidence indicating that neural receptive fields tend to be sparse, i.e. a neuron responds to a very specific set of stimuli, and selective, i.e. a neuron responds rarely over time. By incorporating a cost function and biasing the activations, neurons in an RBM can be forced to learn different stimuli than are covered by others, and to choose discriminative features by choosing receptive fields that occur only rarely. This scatters the receptive fields in a more dispersed way over the problem space and leads to improved generalization and test set performance. First Training Iteration First Training Iteration Hidden Layer... Q 0 Q 1 Hidden Layer... Hidden Layer... Q 0 Q 1 Hidden Layer... Visible Layer... Visible Layer... Visible Layer... Visible Layer... Second Training Iteration Hidden Layer... Q 0 Q 1 Hidden Layer... Second Training Iteration Hidden Layer... Q 0 Q 2 Hidden Layer... Visible Layer... Visible Layer... Visible Layer... Visible Layer... Third Training Iteration Hidden Layer... Q 0 Q 1 Hidden Layer... Third Training Iteration Hidden Layer... Q 0 Q 3 Hidden Layer... Visible Layer... Visible Layer... Visible Layer... Visible Layer... (a) CD-1 form of Contrastive Divergence. (b) Persistent Contrastive Divergence Figure 2.3: Comparison of CD-1 and PCD.

25 Chapter 3 Derivation of evtcd This chapter introduces an online, event-based training rule for spiking RBMs. In Section 3.1, previous work on training spiking RBMs is examined. Subsequently, the evtcd algorithm that forms the core contribution of this thesis is introduced in Section Spiking Restricted Boltzmann Machines Variations of training spiking RBMs have been tried, generally without too much success. In 1999, Hinton and Brown [43] investigated using sigmoids that spike over time (eschewing more biological neuron models), beginning an investigation into sequence learning. However, this work was largely not continued; in the next appearance, Teh and Hinton [44] applied contrastive divergence to continuously-valued face images using rates instead of binary spike events. Purely binary representations do not work well with the real-valued intensities of the face, so the authors proposed encoding a single intensity with multiple binary features to allow discretized variations in intensity. Alternatively, they viewed this as a rate-based code of binary events, and called their system RBMrate as a result. This was subsequently applied to time-series data in [45], which then moved further from the rate-based approximation of spiking to diffusion networks of continuous values. A significant advance was made when O Connor proposed [46] using a rate-based model of a neuron in his thesis work to encode the state values in contrastive divergence. In this way, networks of spiking LIF neurons can be trained offline using a rule very similar to standard contrastive divergence. However, instead of using binary activations adopted from sigmoid probabilities, the layer activations were taken to be the continuouslyvalued neuron spike rates given by the Siegert formula. Once trained, the weights are 19

26 Chapter 3. Derivation of evtcd 20 transferred from the rate-based model to a network of spiking LIF neurons, and the network of spiking neurons behaves according to the energy functions of an RBM. The framework introduced there forms the starting point for the work in this thesis. The actual function that transforms inputs and weights to a rate is called the Siegert function from [47] and has a difficult analytical form. For completeness and the relevance of this method, it will be given here, but the reader is encouraged to look to other works such as [48] for a more in-depth analysis of this function. Given excitatory input rate ρ e and inhibitory input rate ρ i, the following auxiliary variables can be calculated: µ Q = τ ( w e ρ e + w i ρ i ) (3.1) σ 2 Q = τ 2 ( w 2 e ρ e + w 2 i ρ i) (3.2) Υ = V rest + µ Q (3.3) Γ = σ Q (3.4) k = τ syn /τ (3.5) γ = ζ(1/2) (3.6) where τ syn is the synaptic time constant, τ is the membrane time constant, and ζ is the Riemann zeta function. With these auxiliary variables, the average firing rate ρ out of the neuron with resting potential V rest and reset potential V reset can be computed as [49]: ρ out = ( t ref + τ π Γ 2 (3.7) Vth +kγγ [ ] [ ( )] (u Υ) 2 1 u Υ exp 2Γ erf du) Γ. 2 V reset+kγγ Let the function r j = φ(r i, w, θ sgrt )/r max denote the resulting firing rates r j returned by the Siegert function φ with input rates r j, weight w, and parameter set θ sgrt. The Siegert-computed rate is then normalized by the max firing rate r max = 1/t ref. Since a neuron is unable to spike faster than 1/t ref spikes per second, this normalization maps the output of the Siegert function to the range [0, 1]. Now, for visible unit rates r i and hidden unit rate r j the contrastive divergence rule becomes : w ij r i r j Q 0 r i r j Q 1 (3.8)

27 Chapter 3. Derivation of evtcd 21 This works quite well in practice, yielding networks that can achieve accuracies greater than 95% on the MNIST handwritten digit benchmark task [14]. The receptive fields resemble the digit parts that constitute handwritten digits as can be seen in Figure 3.1. Figure 3.1: A visualization of the receptive fields of hidden layer neurons trained using the Siegert method. Each square represents the weights of the visible layer connected to that hidden layer neuron. Note here that the fields factorize handwritten digits into digit parts. Figure from [14]. Knowing that Equation 3.8 performs well for training rate-based approximations of LIF neurons, a link can be sought between a spike-based rule and this rate-based rule. 3.2 evtcd, an Online Learning Rule for Spiking Restricted Boltzmann Machines In addition to a possible performance improvement by switching from evaluating the complex Siegert function to a simple update rule that occurs upon spiking, there is a biological justification for examining spike-timing dependent plasticity (STDP). In the brain, neurons change their connection strengths in response to the relative time between the input neuron firing and the postsynaptic neuron firing (for a review of this so-called spike-timing dependent plasticity, see [32, 33] and Figure 3.2). Many variations of these STDP rules exist to capture the varieties of learning that spiking neurons display. Since a rate-based rule can describe the net result of a large number of spikes, could it be possible to design a spike-based rule that, in the rate-based limit, approximates the Siegert update rule given in Equation 3.8? We begin by separating the problem into identical subproblems specified by w ij + and w ij :

28 Chapter 3. Derivation of evtcd 22 Figure 3.2: Visualization of spike-timing dependent plasticity. This figure is taken from [50], a seminal experiment measuring the effect of spike timing on changes in synaptic strength. Shown here is the percent-change of synaptic strength as a result of the relative timing of presynaptic and postsynaptic spikes. In this experiment, presynaptic spikes arriving before the postsynaptic neuron fires will result in synaptic strengthening (causal reinforcement), and presynaptic spikes arriving after the postsynaptic neuron fires result in synaptic weakening (acausal depression). This differs from the STDP rule used in evtcd, which only responds to causal spikes. w ij r i r j Q 0 r i r j Q 1 (3.9) w ij = w ij + w ij (3.10) w ij + = η r ir j Q 0 (3.11) wij = η r ir j Q 1 (3.12) In this weight update rule, there are four different populations of neurons with their individual rates: for Q 0, which we will call the data layers, there are the r i rates representing the visible layer, and the r j rates representing the hidden layer. Similarly, for Q 1, which we will call the model layers, there are the r i rates representing the visible layer rates, and the r j rates representing the hidden layer rates. Since this problem decomposes cleanly into rates of four different populations of neurons, those in the visible and hidden layers of the data and model distribution, we begin by proposing a four-layer

29 Chapter 3. Derivation of evtcd 23 architecture (Figure 3.3). Unlike contrastive divergence, a spike-based learning rule requires populations to track states. In contrastive divergence, the sigmoids do not have continuity through time, so the notion of populations is not necessary; the state of a layer is a random sample, drawn according to the probability function of its given inputs. Here, however, networks of spiking LIF neurons maintain the states, so the four states of the network are represented by four physically distinct populations: the data visible, data hidden, model visible, and model hidden populations. Since the data and model distributions share a weight matrix, they must be matched in size, but the visible and hidden layers can be sized according to problem constraints. Similar to the standard RBM, there are no intra-layer connections, the weights propagate activations forward between visible and hidden layers, and the hidden-visible weights are the transpose of the forward weights. Data Layers Model Layers Hidden Layer... Hidden Layer... Visible Layer... W W' Visible Layer... W Figure 3.3: Architecture of a network used for event-based STDP-like updates. The evtcd algorithm relies on a network of four neural layers, each encoding a different set of rates. The arrows indicate the direction of information flow in the network. Importantly, the weight matrix W is shared between the data and the model layers and determines the connection strength between the visible and the hidden. The weight transpose W connects the data hidden layer back to the visible model layer. If the inputs are assumed to be Poisson-distributed with a rate specified by r, then the expectation of the number of spikes is r. We begin by proposing the following STDP-inspired weight update rule: w ij + = η if h + i = 1, v j + = 1 0 otherwise wij = η if h i = 1, vj = 1 0 otherwise (3.13) (3.14)

30 Chapter 3. Derivation of evtcd 24 Since two samples are unlikely to ever be 1 at exactly the same point in continuous time, we define a windowing period t win over which this statement can be valid (see Figure 3.4). Then, the ratio of number of spikes expected to occur in each window is the ratio of the size of the window to the overall unit time length of the rate: E[h i ] twin = h i twin = 1 t win r h,i (3.15) However, the update rule needs to be a causal model since it is designed to operate in real-time, and the system is not able to learn about inputs which have not yet occurred. However, the windowing on the rates for the Poisson distribution is just an average rate over a constant time period arbitrarily chosen, so we choose here to have the period end at the current time t and begin at t t win. In the limit, then, the expected number of h i events produced over a time period is the rate r h,i. Therefore, the update rule is equivalent: w ij h i twin v j twin Q 0 h i twin v j twin Q 1 (3.16) w ij r i r j Q 0 r i r j Q 1 (3.17) Where the spike states h i and v j indicate the presence or absence of a spike, and the time window t win denotes the time over which the expectation is carried out. Importantly, this is a spike-based rule and exceptionally sparse in its computation. Since both h i and v j must be 1 in order for learning to occur, and the time window is defined such that the spike h i occurs at the end of the window, a possible weight update needs only to be calculated when the hidden layer spikes. At that point, a neuron can check which of its inputs has spiked in the previous t win and either potentiate or depress its weight as specified by the rule. This rule, called the evtcd learning rule, is shown in Figure 3.4, and four examples are shown in Figure 3.5. Importantly, the evtcd rule only needs to be evaluated when a hidden neuron spikes. Since it is a product of two binary events, the product of v i h j is only 1 when both v j and h i are active. If either one is not active, then the system does not update the weights and the connecting weight w ij remains fixed. This will be a key feature that allows very sparse event-based computation. Next, we examine how to implement this architecture in practice.

31 Chapter 3. Derivation of evtcd 25 Data Layers Weight change Model Layers Weight change t win t win t pre - t post t pre - t post Figure 3.4: The evtcd rule derived for this work. The learning rule is divided into two halves, a weight-potentiating rule and a weight-depressing rule. Unlike most learning rules, spikes from one set of populations (the data layers) only potentiate the weight matrix, and spikes from another set of populations (the model layers) only depotentiate the weight matrix. In both cases, the weight change will only occur if a hidden layer spike ( post ) occurs after a visible layer spike ( pre ). In all other cases, the weight remains fixed. Data Layers Model Layers Hidden (post) Hidden (post) Visible (pre) w ij increase Visible (pre) w ij decrease Hidden (post) Hidden (post) Visible (pre) No change Visible (pre) No change Figure 3.5: Four examples of the applied evtcd rule. This diagram is divided into two halves, like the evtcd learning rule. Spikes on the left side occur in the data layers, and spikes on the right side occur in the model layers. The gray box preceding a hidden layer spike represents the time window t win, and spikes are vertical jumps with time on the horizontal axis. The upper left quadrant represents a weight increase for w ij connecting these two neurons because the visible layer spikes before the hidden layer and within the time window t win. The lower left produces no result because the visible layer spike occurred either before or after the hidden layer spike window. On the right, the model distribution performs identically, but its weight update results in a decrease of w ij rather than an increase since these spikes take place in the model layer.

32 Chapter 4 Implementation of evtcd This chapter describes the software implementation of the evtcd learning rule. Now that it has been shown mathematically in Equations 3.13 and 3.14, as well as diagrammatically in Figure 3.4, Section 4.1 presents a method for implementing the algorithm. In Section 4.2, a method of supervised training using the primarily unsupervised evtcd algorithm is introduced using an idea from earlier work on RBMs [13], allowing labels to guide the training of networks of spiking neurons. 4.1 Algorithm Recipe for Software Implementation The aforementioned evtcd algorithm is straightforward to implement for a software simulation. The process is as follows: 1. Begin by initializing auxiliary variables: (a) Membrane{1:4}, the membrane potentials of the four neuron layers; (b) Last Spiked{1:4}, the last time each neuron has spiked, by layer, in order to find whether a neuron is a possible cause of spiking; (c) Refrac End{1:4}, the time when the refractory period for a given neuron in a given layer last ended, used to determine if a neuron is currently refractory; (d) last update[1:4], the last time a neuron layer was updated with an incoming spike, used in calculating membrane potential decay; (e) Thr{1:2}, the thresholds for the visible and hidden neuron layers (shared between the data and the model); (f) W, the shared weight matrix. 26

33 Chapter 4. Implementation of evtcd 27 The matrix Last Spiked that stores the time of the previous spike should be initialized to a large negative value so as not to artificially potentiate weights from the initial spikes. Membrane, Refrac End, and last update can be initialized to zero. Thr, the spike threshold, is typically initialized to a value of 1 for all neurons. Finally, W is chosen to be initialized from the uniform distribution in [0, 1], because these uniformly excitatory weights cause more initial spikes for training than a Gaussian centered around zero. 2. Begin processing input spikes. For an event-based implementation, a priority queue of spikes is preferable, using a key comprised of the time and the layer; every insertion and extraction will be O (log(n)). Every spike should be a triple of (time, address, layer). This data structure is a convenience to accomplish a task biology does very simply: delaying spikes between their generation and arrival. Here, a priority queue helps to keep the spikes sorted according to times and layer, and ensures that the first spike processed is the one that happens first; biology takes care of this problem automatically. 3. For each input spike: (a) Decay the membrane potential on the receiving layer by e t/τ, calculating t from the spike time and the last update time for the receiving layer. (b) Add an impulse corresponding to the weight w i,j for the neuron i from the input spiking neuron j if the receiving neuron s refractory end period refrac end is less than the current time. Since the visual-to-hidden layer weights are W and the hidden-to-visual layer weights are W, index into the weight matrix appropriately based on the layer. (c) If desired, add noise to the neuron membrane potentials. (d) Examine the updated neurons, comparing their membrane potentials to the threshold membrane potentials. (e) For every neuron that exceeds the threshold for that neuron: i. Record a new refractory end period: refrac end{layer}[i] = spike time + t ref. ii. Reset the membrane potential: Membrane{layer}[i] = 0. iii. Record this time as the last time the neuron spiked: Last Spiked{layer}[i] = spike time. iv. Adjust the threshold by either lowering the threshold if this spike comes from the data distribution (making more spikes more likely in the model distribution), or by raising the threshold if this spike originates from a model distribution.

34 Chapter 4. Implementation of evtcd 28 v. If this layer is a hidden layer, an STDP weight update can be performed. If this layer is a data distribution layer (layers 0 or 1), then the weight w ij corresponding to an input neuron j, whose Last Spiked{layer-1}[j] is within the time windows t win, should be potentiated. If the current layer is instead from the model distribution, spikes that occurred within the previous window will be depotentiated. Regardless, if there was no spike from a preceding neuron, its weight is unaffected. vi. Add new spikes to the spike queue so downstream neurons receive these spikes. Note that in its most simplistic form, the exponential decay can be handled by bitshifting. The summation of input currents is a sum, learning requires only a lookup and a comparison, and updating a weight with a new value is only another summation. Even if the only operations that are available are bitshifting, addition, and subtraction, then this rule is implementable - making it ideal for low-compute architectures [21]. 4.2 Supervised Training with evtcd To measure the efficacy of the evtcd algorithm, it is necessary to objectively assess its accuracy. In Chapter 5, the exact methods of performance measurement will be explained, but it is worth discussing the training process by which supervised training can occur in the unsupervised process of distribution matching. The idea is quite old and goes back to the early days of training unsupervised learners; essentially, by making the label part of the input distribution, the system is forced to learn the relationship of the input distribution to the labels and cluster these elements together [1, 13]. See Figure 4.1 for an example; by taking the top label layer and concatenating it with the input layer, a new RBM can be made that trains in a supervised way when it performs distribution matching. After training this RBM as if it were a normal RBM, it can be unrolled back to its original configuration to perform classification. By separating the weights and biases from the joint layer correctly, the original three-layer architecture can be reconstructed. Then, passing the activations through the three layers, input to hidden to label, will classify an example according to the weights of the system.

35 Chapter 4. Implementation of evtcd 29 Label Layer W 2 Hidden Layer Hidden Layer {W 1, W 2 } W 1 Visual Input Layer Label Layer Visual Input Layer Figure 4.1: Method of supervised learning using unsupervised learning: by concatenating the label and input layer, and learning the joint representation of input and label, the system is forced to learn to cluster the labels with the data.

36 Chapter 5 Test Methodology This chapter describes the setup used to train and evaluate the evtcd networks. In Section 5.1, the dataset used to assess performance of the training algorithm is introduced. Following that, two implementations are described for achieving different aims. In Section 5.2, the methodology for the time-stepped simulation is described, and Section 5.3 discusses the methodology of using the network for online training of a spiking RBM. 5.1 MNIST The MNIST handwritten digit dataset is an extremely popular dataset used in machine learning, compiled by [35], and often used as a benchmark task for new learning algorithms. The challenge is straightforward: given 60,000 training digits, each a 28 by 28 pixel image with the digit in the center, correctly identify a handwritten digit from the 10,000 digit test set. A human achieves about 99.6% accuracy on this dataset, and the best algorithms in the world achieve equivalent performance [1, 35, 51]. Often, to boost performance, transformation like rotations, translations, and deformations are applied to the images to achieve a larger training set and increase its robustness [14, 37]. These transformations were not performed here. There are several attributes that make this dataset attractive for beginning investigations for a new machine learning rule. First, its modest dimensionality of 784 input dimensions is quite large compared to simple algorithms but relatively small compared to a modern computer s processing power. Second, the data can be represented as binary activations without losing its identifying characteristics; a pen mark can be characterized as either present or missing, and learning rules (including evtcd and CD [13]) initially only 30

37 Chapter 5. Test Methodology 31 supported binary data. Standard RBMs have now been extended to represent realvalued inputs [37], but that investigation has not been performed for evtcd training yet. Third, there is an interpretability to the weights of neurons, as they represent receptive fields over digits. By visually inspecting the weights of hidden neurons, it is possible to tell if the learning rule has resulted in proper receptive fields that decompose the input into component parts. Finally, members of the machine learning field are very familiar with this dataset and conclusions about an algorithm s strengths or weaknesses can be clearly seen on this common benchmark. Figure 5.1: Six digits from the MNIST corpus. Across the top row are examples of easily classified digits, and the bottom row contains digits 1, 2, and 8 that posed difficulty for the evtcd algorithm. Performance on the MNIST benchmark task can be found later in the results section of this work, specifically in Figure It is worth pointing out an important benchmark here, however: a least-squares (optimal) linear regression can achieve 86.03% classification accuracy when trained on the full 60,000 digit training set and all pixels. Ideally, the evtcd training algorithm would surpass this level of accuracy, given the nonlinear transformations that the spiking RBM performs. 5.2 Time-stepped Training Methodology For experimental and debugging reasons, the Matlab implementation was time-stepped. The task of the time-stepped implementation is to provide a platform to easily study the parameters of the system and find a method to optimize the overall operation of the evtcd algorithm.

38 Chapter 5. Test Methodology 32 The time-stepped testing methodology was consists of the following steps: 1. Load the MNIST handwritten digit database of 60,0000 training digits and 10,000 test digits. 2. Establish parameters for an evtcd training simulation, and initialize the network architecture for supervised training. 3. Draw spikes from each of the 60,000 digits in the training set, and pass these spikes as samples from the data-visible layer. 4. Train the network according to the evtcd algorithm, and dispose of spikes emitted from the model hidden layer. 5. At specific timepoints during the training, as shown in Section 6.1, save network snapshots for offline analysis. The frame-based MNIST database was transformed into spike trains as described in the next section (Section 5.2.1). Every digit was presented for 1/10th of a second of simulated time, with each on pixel emitting 10 spikes on average. The likelihood of a spike emitted from a pixel was proportional to its intensity, as explained in the following section (Section 5.2.1). In addition, the labels were used as part of the input. This supervised training method is described in Section 4.2 and shown in Figure 4.1. To learn the labels, the label layer and the input layer were concatenated into a single layer and the joint distribution of pixels and labels were learned together. This allows an objective metric of training performance: classification accuracy after 1 epoch on the MNIST handwritten digit classification task. To determine the network s choice for a presented digit, the output layer neuron with the most spikes was chosen as the selected digit. The architecture for the network is the one illustrated in Figure 6.1, with 784 neurons in the first layer, corresponding to the 28*28 pixel images of MNIST [35], 100 neurons in the hidden layer, and 10 neurons in the output layer corresponding to the 10 labels. This manuscript also contains the full source for a Matlab implementation of timestepped training, which can be found in Appendix B. This training begins by extracting spikes from MNIST images, a process detailed in the next section.

39 Chapter 5. Test Methodology 33 Figure 5.2: Drawing spikes with an increasing number of spikes from the MNIST handwritten digit database. Shown here are 10, 50, 100, 500, and 5000 spikes drawn from a sample four digit using Algorithm 1 shown in Section Extracting Spikes From Still Images This technique predates this thesis work, going back at least to [14], but the specification has not been fully described in print before. Given a sample image, the spike rate should converge in the limit as number of spikes increase to a rate-encoding of the image. The absolute spike rate should be a fixed parameter, but the relative spike rate should emphasize the bright pixels compared to the dark pixels. This can be accomplished by drawing a spike from the image with probability proportional to the pixel s intensity. The Matlab function randsample efficiently addresses this particular task, and generates spike trains as can be seen in Figure 5.2. This conversion from fixed image to spike train is used in this work whenever spike trains are needed from frames. The timing of each spike is randomly generated. Since a given number of spikes are emitted in a given amount of time, a spike time is randomly assigned to each spike and the rates average out over the presentation period correctly. Algorithm lists the source of this algorithm which efficiently generates spike trains from data vectors. 1 function [addr, times] = drawspikes(train_x, opts) 2 3 trials = size(train_x, 2); 4 addr = zeros(opts.numspikes, trials); 5 times = zeros(opts.numspikes, trials); 6 7 for trial = 1:trials 8 % Assign addresses 9 addr(:, trial) = randsample(numel(train_x(:, trial)), opts.numspikes, true, train_x(:, trial)); 11 % Assign times 12 times(:, trial) = sort(rand(size( addr(:, trial)))).*opts.timespan; 14 end Algorithm 1.

40 Chapter 5. Test Methodology Online Training Methodology Before discussing the online training methodology, it is necessary to describe the method of generating the input to the online training. To test the real-time nature of this system, it is necessary to use an image sensor that can produce a spiking output, such as the DVS spiking vision sensor [24]. Since the DVS produces spikes in response only to temporal contrast changes, it does not spike in static scenes. Therefore, either the scene or the sensor must be moved in intelligent ways to produce spikes in response to static images. Since such a model is necessary, it makes sense to adopt a model used in biology to solve the same problem A Software Implementation of Fixational Eye Movements The implementation described here was published by Engbert et al. [52] to model the fixational movements of eyes and includes fixational microsaccades. This model does not include the large saccadic eye movements which can be driven by top-down or bottom-up attention, but is designed solely to emulate the movement of an eye focusing on a particular point and the small movements it undergoes to prevent saturation of photoreceptors. Though the implementation described here moves the image of the world while keeping an eye fixed, rather than moving an eye while keeping the world fixed, this model stimulates the DVS camera in a biologically-realistic way. The model is composed of three factors: 1. A self-avoiding random walk designed to mimic the small tremors of eye muscles; 2. An energy well designed to pull the focus of the eye back to the center; 3. Occasional, small-amplitude rapid movements of the eye known as microsaccades to reach a new location. This first component is a self-avoiding random walk. Informally, this is modeled as a walk across a surface, where the direction chosen is the lowest neighbor. After stepping onto a position, the energy level of that position rises and becomes less desirable, slowly decaying back to its starting position. When that spot is encountered later on, it may have decayed back to be a desirable choice or may still be less desirable than its neighbors. This helps make the random walk self-avoiding, which is a better model of biology because the muscles do not suddenly reverse direction and return to the exact spot from which they have arrived [52].

41 Chapter 5. Test Methodology 35 Figure 5.3: Fixational eye movements. The background color represents the energy well that pulls the eye movements back to the center, the red circle indicates the current fixational point, and the black line connects 40 points previously chosen. This figure was generated by the implementation in Appendix C of the model introduced in [52]. Secondly, a model of fixational eye movements needs a factor to focus the eye towards the center of the region of interest. In this case, there is an energy well that is quadratically defined over the surface of interest, with its minimum at the center of the image. This can be combined with the energy state defined in the self-avoiding random walk. Choosing the position is then finding the minimum sum of the energy well and the walk surface, and taking a step to that position [52]. Thirdly, the eye makes small movements around the center of the image known as microsaccades. In the model proposed in [52], a microsaccade is triggered when the local energy of the current position is too high; when the level is over a threshold, the focus jumps from the current position to the global minimum energy. However, to model the

42 Chapter 5. Test Methodology 36 effects of the muscles on the eye, a cost is added that encourages the movement to occur predominantly along a major axis, either horizontal or vertical, rather than a biologically unrealistic diagonal movement. Figure 5.3 shows a visualization of the movement output as well as the energy landscape used to generate the movement. The fully described algorithm can be seen in Appendix C. The default parameters were initialized as per the [52] paper, and can be found in Table 5.1. Parameter Description Value lambda Slope of the potential 1 sinkeps Relaxation rate chi Vertical/Horizontal constraint for stabilizing 2*lambda microsaccade direction hc Critical value for triggering microsaccades 7.9 Table 5.1: Parameters for the eye movement, adapted from [52]. This code is used to generate offsets which shift an image on the screen in a biologically realistic way to stimulate the DVS system. This produces the visual input that is displayed on the screen in front of the DVS Training Environment The training environment for the jaer implementation consists of three components: the screen to display the data, the DVS system to receive the spike-based representation, and the computer running the jaer environment. The completed setup can be seen in Figure 5.4. In this environment, the training consists of rapidly displaying images from the Matlab environment and performing online learning within Java Java Reference Implementation The Java reference implementation relies on several environmental factors as well as constants to appropriately build receptive fields. The secondary visualizer displays digits at around 26 digits per second to the observing DVS. These spikes form input from the visible layer in the evtcd algorithm, triggering learning in the hidden neuron layers with the parameters shown in Table 5.2.

$Parameter t win tau ref inv decay eta tau recon tau thresh eta Description Window width of evtcd learning Refractory period Inverse decay offset to slowly aid learning Learning rate Reconstruction$

43 Chapter 5. Test Methodology 37 Figure 5.4: Photograph of the training environment. Parameter t win tau ref inv decay eta tau recon tau thresh eta Description Window width of evtcd learning Refractory period Inverse decay offset to slowly aid learning Learning rate Reconstruction time constant for visualizing reconstructions Membrane time constant Threshold learning rate Value 0.005s 0.001s 1e-5 1e s 0.100s 0 Table 5.2: Parameters for online training using the Java reference implementation.

44 Chapter 6 Quantification of evtcd Training After the description of an implementation of the evtcd training rule and the training methodology, this chapter evaluates the performance of the algorithm compared to other training methods and standard training paradigms. In Section 6.1, an examination of parameters is performed to assess the optimal parameter space of learning in the network. In Section 6.2, the training algorithm is combined with a linear decoder to achieve 90.3% accuracy on the MNIST handwritten digit identification task [35]. Finally, Section 6.3 demonstrates the real-time nature of learning, as the system was rapidly trained to form receptive fields for identifying digits, achieving 86.7% accuracy after training on 2.5% of the available data presented over 60 seconds. 6.1 Improving Training through Parameter Optimizations A major aim of this work is to give an intuitive understanding of how neuron and learning parameters can affect the evtcd training algorithm, as well as to propose insights for future work. The default setup for training, called the baseline parameters, can be found in Table 6.1 and the methodology for this training was described in Section 5.2. The following alterations to the standard training paradigm will be investigated: 1. Learning rate: how large should a standard weight update be? 2. Number of input events: how do different quantities of input spikes affect learning, and can the system degrade gracefully with less input? 3. Batch size: can training be parallelized for performance reasons without sacrificing accuracy? 38

45 Chapter 6. Quantification of evtcd Training Noise temperature: what role does membrane potential noise and stochasticity play in training? 5. Persistent Contrastive Divergence: does this powerful tool from regular CD also aid the evtcd training algorithm? 6. Weight limiting: can limiting the range of weights result in better training? 7. Inverse decay: can a constant potentiation in weights help the training process? These parameters and extensions will be examined in turn. Label Layer Hidden Layer Input Layer Figure 6.1: The architecture of the trained networks, to scale. The first layer is 784 neurons (trained on 28*28 pixel digits), the hidden layer is 100 neurons, and the label layer is 10 neurons Baseline Training Demonstration These parameter evaluations for the evtcd rule were compared to a trial run with a default, accurate, and reasonably-fast set of parameters referred to here as the baseline training parameters. This section will introduce the typical behaviour of the network using these parameters. As can be seen from Figure 6.2, the time evolution of the weights follows a very stereotyped pattern: after about 10,000 digit presentations, the random initialization of the weights causes each receptive field to begin navigating towards near local minima [53]. This happens when the hidden neuron successfully finds a component of the input space

46 Chapter 6. Quantification of evtcd Training 40 Parameter Description Value temperatures Variance of the noise 0.01, 0.01, 0.01, 0.01 epochs Number of times the entire data set is presented 1 for training eta Learning rate momentum Momentum of weight updates 0 decay Decay of weights 0 t win STDP rule window width 0.030s t refrac Refractory period of a neuron tau Membrane time constant inv decay Membrane time constant * eta batchsize Number of parallel training samples 10 thr Threshold of a neuron 1 Table 6.1: Parameters for the baseline Matlab-based implementation. that helps to factorize the image into parts, as discussed in [13, 37]. For the handwritten digits examined here, the digits factorize into digit parts such as a vertical element or a curve, and over successive presentations the hidden neurons begin developing receptive fields that corresponds to these factored elements. The baseline parameters result in the rapidly increasing accuracy score shown in Figure 6.3, eventually peaking at 79.13% classification accuracy. If training for longer than 1 epoch were desired, it would be beneficial to decrease the learning rate as it plateaus too early; training with a single epoch and reaching minimum error is a clear sign that the learning rate is too fast [54]. (a) Weights after 10,000 input digits (b) Weights after 30,000 input digits (c) Weights after 60,000 input digits Figure 6.2: The weights of four example hidden-layer neurons with increasing training examples. The brightness encodes the weight value, and each neuron can be seen becoming tuned to a particular set of digit regions.

47 Chapter 6. Quantification of evtcd Training Accuracy [%] Baseline Accuracy Digits Presented (thousands) Figure 6.3: Baseline accuracy of the evtcd training algorithm for one epoch of 60,000 training digits, eventually peaking at 81.46% classification accuracy. The overall record for this size network trained with evtcd is 81.5%, so this accuracy lies close to the peak accuracy achieved so far Learning Rate The most basic parameter of training is the learning rate, eta. This parameter determines the size of a weight update when a hidden layer neuron spikes, and controls how quickly the system changes its weights to approximate the input distribution. The learning rate that results in peak performance is smaller than for typical sigmoid networks, on the order of 10 3, compared to traditional CD which can train using an eta value of 1. When using the default threshold value for evtcd training, a single weight update using eta = 1 could cause the weight to exceed the threshold value of that neuron. In an ideal case, the system would recover and raise its threshold to compensate, but it is possible for the weights of the network to move the system into a nonspiking regime from which it will never recover (unlike a sigmoidal network). For this reason, using a smaller learning rate is preferable. As can be seen in Figure 6.5, a learning rate that is too fast learns quickly but then achieves its peak performance early and is unable to improve, overshooting the learning target [37]. On the other hand, an insufficient learning rate requires more learning iterations to reach its saturation learning level.

Chapter 6. Quantification of evtcd Training 42 1e 05 Learning Rate for Weight Updates 1e 04 1e 03 Baseline (5e 3) 1e 02 1e 01 1 10 20 30 40 50 60 Digits Presented (thousands) Figure 6.

48 Chapter 6. Quantification of evtcd Training 42 1e 05 Learning Rate for Weight Updates 1e 04 1e 03 Baseline (5e 3) 1e 02 1e Digits Presented (thousands) Figure 6.4: Effect of learning rate on receptive field formation. Horizontal axis is increasing with the number of presented digits, in thousands, and vertical axis indicates the value of the learning rate. Note slower learning rates take longer to develop receptive fields, but fast ones learn improper features and then stop learning e 05 1e 04 1e 03 Baseline (5e 3) 1e 02 1e 01 Accuracy [%] Digits Presented (thousands) Figure 6.5: Accuracy evolution with different learning rates. The baseline parameter of was chosen to balance the rapidity of the 0.01 learning rate with the more careful learning rate.

49 Chapter 6. Quantification of evtcd Training Number of Input Events The number of input events controls the amount of coincident information available for training in the evtcd algorithm. By raising the number of input events, a neuron is more likely to encounter joint activations of input and to develop a receptive field for those regions of input. On the other hand, by lowering the number of input events, it is possible that spikes will never overlap and a hidden neuron will never uncover joint probabilities to encode. Because of this, it is important to establish the membrane time constant tau and the STDP window t win in relation to the spike rate. In these experiments, tau and t win were scaled proportionately to the baseline input rate of 10 spikes per pixel per digit presentation (which results in a spike rate, in a maximally on pixel, of 100 Hz). No learning will occur if the spike rate drops low enough; in that case, the exponential decay relaxes the neuron s membrane potential to resting voltage before another spike comes in, so it is necessary to lengthen these windows to allow a fair comparison. The spike rate shown on the vertical axis in Figure 6.6 and in the legend of Figure 6.7 is the expected number of spikes an on pixel can send over the presentation of a digit (100 ms). Note that the accuracy reaches a peak around the chosen baseline parameter value of 10 spikes per pixel per digit presentation, and falls off in accuracy with either less or more input events. There appear to be 2 modes of peak performance in Figure 6.7, with one peak around 100 Hz, and a second peak at a much higher input rate of 250 Hz. Since the system is flexible in the number of event inputs, and the number of events dominates in the training time, it is better to choose the lower mode of 100 Hz for these initial investigations. Qualitatively, Figure 6.6 suggests that as the spike rates increase, the receptive field specialization increases as well. The receptive fields appear to be more detailed as the number of coincident input spikes increases, allowing more selectivity in the types of inputs that drive them.

Chapter 6. Quantification of evtcd Training 44 4 Input Rate [Spikes per "On" Pixel Per Digit] 8 Baseline (10) 12 15 25 1 10 20 30 40 50 60 Digits Presented (thousands) Figure 6.

50 Chapter 6. Quantification of evtcd Training 44 4 Input Rate [Spikes per "On" Pixel Per Digit] 8 Baseline (10) Digits Presented (thousands) Figure 6.6: Effect of input rates on receptive field formation. Horizontal axis is increasing with the number of presented digits, in thousands, and vertical axis indicates the number of spikes per digit. Note that with increasing input rates, the specificity of features appears to increase Baseline (10) Accuracy [%] Digits Presented (thousands) Figure 6.7: Accuracy evolution with different input rates. Here, the accuracy as a result of the input rate increases and reaches a peak at an input rate of 100 Hz (10 spikes per 100 ms), with a second mode at the much higher rate of 250 Hz.

51 Chapter 6. Quantification of evtcd Training Batch Size In the standard contrastive divergence model, taking a batch of data and processing it in parallel is an important step for two reasons [13, 55]. First, parallel processing is often more efficient on modern computers, and batch processing enables the learning algorithm to capitalize on heavily optimized matrix operations [37, 54]. Secondly, it decreases the variance of a single learning sample, preventing the algorithm from avoiding areas of high variance. In [13], Hinton refers to an analogy: when vibrating a thin sheet of metal, sand particles (following gradient descent) scattered over the surface will settle into regions between the oscillating peaks to avoid regions of high variance, even though the time-averaged mean everywhere is zero. Averaging a large number of parallel training iterations, then, will result in better learning of the true gradient [37, 54]. Batch training in this network is implemented as if there are batchsize parallel networks updating a common weight structure once per ms. Since the implementation is timestepped, all the weight updates happen in parallel based on the activity across the past 1 millisecond timestep. Each parallel network adds a vote to the direction of the gradient, and the average direction is taken with a normalized weight update. Their collective update has the same learning rate as a single step from the baseline training example, but incorporates more evidence about the correct gradient direction for learning. Because of this, a weight update from a batch run can achieve equal accuracy with fewer weight updates. Eventually, the training with parallel updates should provide a better estimate than a single sample point, so its accuracy should exceed training using a single point as described above [13]. The parallelization comes at effectively zero computational cost in the time-stepped implementation for small levels of parallelization (the values shown here). The execution time using batches of 10 is the same as using batches of 1, so a batchsize of 10 was chosen as the optimal parameter. This additionally coincides with previous suggestions of batch sizes equal to the number of classes in the data [54]. The receptive field forms in approximately the same number of weight presentations, and the accuracy suggests that each weight update is more valid than in a single batch case. Finally, the slow increase in accuracy of larger batch sizes appears promising for future investigations, as this could allow more accurate learning given more training time.

Chapter 6. Quantification of evtcd Training 46 2 Parallel Training Batch Size Baseline (10) 50 100 1 10 20 30 40 50 60 Digits Presented (thousands) Figure 6.

52 Chapter 6. Quantification of evtcd Training 46 2 Parallel Training Batch Size Baseline (10) Digits Presented (thousands) Figure 6.8: Effect of batch size on receptive field formation. Horizontal axis is increasing with the number of presented digits, in thousands, and vertical axis indicates the number of parallel-training network batches to calculate the learning gradient. Batches of size 10 and size 1 develop receptive fields equally fast, and the performance advantages of larger batch sizes made it preferable as a choice for a baseline parameter Baseline (10) Accuracy [%] Digits Presented (thousands) Figure 6.9: Accuracy evolution with increasingly parallel estimates of the gradient. Accuracy is slightly improved at moderate batch size values, and learning occurs much more slowly on a per-digit basis for larger batches.

53 Chapter 6. Quantification of evtcd Training Noise Temperature In the evtcd training algorithm, noise can have beneficial effects as well as the expected detrimental ones. This occurs for two reasons: first, noise helps to regularize the weights [37], and secondly, noise helps to ensure that the neurons always fire. As has been mentioned before, the evtcd learning rule only functions when neurons emit spikes, and the noise term helps to cause neurons to spike. Moreover, in the evtcd training algorithm, samples are obtained by propagating the activations of one layer to the next, and there can be significant losses in activations. Without a term like a bias to encourage more spiking, the activation can decay away; this is explored more fully in Section The noise term was added in two different ways: the first was to take a Gaussian with mean zero and variance as indicated by the vertical axis in Figure 6.10, then perturb the membrane potential of each neuron by this noise amount once per timestep (one millisecond). Note that the threshold was fixed at 1, so the variance on these plots can actually result in many erroneous spikes. The second method of noise is a more biologically sound method known as the Ornstein Uhlenbeck process [34, 56]. This method is a low-pass filtered Gaussian using a time constant of 25 ms [34]. This prevents the noise from rapidly fluctuating the membrane potential, instead providing a random offset that moves much more slowly. Interestingly, it appears that purely Gaussian noise helps the neurons to learn more quickly than in the absence of noise, due to their increased activity. The low-pass filtered noise results in more precisely-defined receptive fields, which reflects the observations found in Section

Chapter 6. Quantification of evtcd Training 48 0.000 Baseline (0.01) Noise Variance 0.050 0.200 0.400 0.

400 0.600 1 10 20 30 40 50 60 Digits Presented (thousands) (b) Receptive fields of neurons trained according to the Ornstein-Uhlenbeck process [34, 56]. Figure 6.

54 Chapter 6. Quantification of evtcd Training Baseline (0.01) Noise Variance Digits Presented (thousands) (a) Receptive fields of neurons trained with Gaussian noise Baseline (0.01) Noise Variance Digits Presented (thousands) (b) Receptive fields of neurons trained according to the Ornstein-Uhlenbeck process [34, 56]. Figure 6.10: Comparison of accuracy in neurons trained with evtcd under various noise rates. Note that the introduction of a small amount noise helps accelerate learning, and the system is able to develop features that ignore the noise.

55 Chapter 6. Quantification of evtcd Training Accuracy [%] Baseline (0.01) Digits Presented (thousands) (a) Receptive fields of neurons trained with Gaussian noise. Accuracy [%] Baseline (0.01) Digits Presented (thousands) (b) Receptive fields of neurons trained according to the Ornstein-Uhlenbeck process [34, 56]. Figure 6.11: Comparison of accuracy in neurons trained with evtcd under various noise rates. Surprisingly, the system is very stable in the presence of noise, and accuracy remains largely unaffected until quite significant noise is introduced.

56 Chapter 6. Quantification of evtcd Training Persistent Contrastive Divergence Persistent contrastive divergence, as originally introduced in [41] and show in Figure 2.3, creates a persistent Markov chain which is driven entirely separately from the input. The model distribution samples are generated separately from the data distribution, and each data point moves the Markov chain closer to the equilibrium distribution. This ignores the fact that the model, specified by the system weights, changes slightly with each weight update. Because of this process, the equilibrium distribution does not remain fixed; however, given a small enough learning rate the system gathers samples closer to the equilibrium distribution than the normal CD-1 algorithm can. However, when training spiking neural networks with evtcd, the model distribution is now a recurrent visible-hidden network. The only mixing with the input data comes from the weight matrix that is shared between the data and model distribution. Since the model distribution sampling process is run independently with no external input, its activity is driven nearly entirely with noise and is responsible for setting up a persistent recurrent network that maintains activity and produces digit-like samples. A true demonstration of the power of this balanced sampling approach is that the weights of the system coerce random membrane potential noise into persistent activation that corresponds to a real digit. This process is demonstrated in Figure 6.14, which visualizes digit reconstruction arising from the model layers sampling under persistent contrastive divergence. These confabulations are clearly distinct from real digits, but considering the network is of small size (100 hidden neurons) and trained for a single epoch, it does a remarkable job of creating digit-like patterns. There are clear digit parts, tending to be centrally located and continuous, and many of these are feasible approximations of digits. Overall, however, the claims of faster training for persistent contrastive divergence in CD do not seem to hold for evtcd. At all time points, the baseline accuracy outperforms the PCD-trained network as shown in Figure The weights can be qualitatively assessed in Figure 6.12.

57 Chapter 6. Quantification of evtcd Training 51 Off PCD On Digits Presented (thousands) Figure 6.12: Effect of persistent contrastive divergence on receptive field formation. Horizontal axis is increasing with the number of presented digits, in thousands, and vertical axis indicates the presence or absence of persistent contrastive divergence in learning. Features appear to require more time to appear with PCD, given the lack of direct input to the model layers Accuracy [%] Off On Digits Presented (thousands) Figure 6.13: Accuracy evolution with and without persistent contrastive divergence. Baseline parameters consistently outperform PCD.

58 Chapter 6. Quantification of evtcd Training 52 Figure 6.14: Demonstration of 9 digit reconstructions from activation on the visible model layer. Since the network samples freely, the units are not driven by external input but rather sampled from noise and guided by the energy function specified by the network weights.

59 Chapter 6. Quantification of evtcd Training Bounded Weights One major difference between the networks modeled in evtcd and in true biological networks is the large value range and high precision available to digital simulations. The double-precision default implementation allows membrane voltages in excess of 1000 volts and weight updates smaller than In this section, the possibility of capping weights is examined to determine if all that range is necessary and if losing precision might actually improve performance. Figures 6.15 and 6.16 demonstrate the effects of capping the weights at Largely it has no negative effects, but qualitatively alters the features that are selected. Weight capping also affects the initial distribution of weights. It has been suggested that the initialization of weights plays a very important role for learning, and that properly initializing the weights can save significant computational effort and have drastic results on the eventual accuracy [53, 54]. By initializing the weights closer to the extrema, the training decreases weights to yield features rather than sharpening weights that are already present. Interestingly, depriving the weights of much of their accuracy has little effect on the overall system. This could be a fruitful avenue for exploration in the future, as low-precision weights are necessary for some platform implementations [18, 21], and the full implications of different initialization regimes should be evaluated for possible performance improvements. Weights Capped? Off On Digits Presented (thousands) Figure 6.15: Effect of bounding weight magnitude on receptive field formation. Horizontal axis is increasing with the number of presented digits, in thousands, and vertical axis indicates whether weights were bound by [-.25,.25] Inverse Decay Finally, in an evtcd-specific optimization, a slow potentiation of all weights was added as a possible extension. Since these networks learn only when neurons spike, a constant positive learning offset can improve learning by steadily forcing all neurons to spike at

60 Chapter 6. Quantification of evtcd Training Accuracy [%] Off On Digits Presented (thousands) Figure 6.16: Accuracy evolution without and with weight bounding. Bounding weights may have some tangible improvement over the baseline accuracy or is, at the very least, not necessarily detrimental. least rarely. At every timestep, all weights in the weight matrix increase their value by a constant positive offset (here set at 0.01*eta), tending to cause the downstream neurons to spike. If the spike ended up being fallacious, the weight penalty will punish the incorrect spike and strongly depotentiate the weight at a value of -eta, but if the spike was correct it will be reinforced. In either case, the features become more selective. This tends to speed up the learning rate by causing more initial activity and forcing neurons to choose to adopt receptive fields rather than remain generic in their selectivity, as can be seen in Figure Like momentum, this parameter should be adopted early in the training to encourage appropriate initialization, then decreased over time. Consistent weight inflations prove detrimental to the overall accuracy of the system when the weights are closer to their equilibrium values, since it causes undesirable shifts in the energy distribution. Because the constant positive increase tends to cause added spiking, it has a tendency to shift the receptive fields over time to respond to novel stimuli instead of approaching an equilibrium.

61 Chapter 6. Quantification of evtcd Training 55 Inverse Decay Per ms 0 Baseline (0.001) Digits Presented (thousands) Figure 6.17: Effect of inverse decay on receptive field formation. Horizontal axis is increasing with the number of presented digits, in thousands, and vertical axis indicates the ratio of the constant increase in weight to eta. That is, the baseline increase per timestep (1 ms) is 1/1000 * eta. Note that the receptive fields without inverse decay are largely undifferentiated due to lack of spiking, so a small amount is beneficial Accuracy [%] Baseline (0.001) Digits Presented (thousands) Figure 6.18: Accuracy evolution with and without inverse decay. Though inverse decay helps the network develop receptive fields, too much decreases the eventual accuracy of the system and lengthens training time.

62 Chapter 6. Quantification of evtcd Training Training as a Feature Extractor Label Layer evtcd Linear System evtcd + Linear Label Layer W (trained by evtcd) Hidden Layer W (trained by evtcd) Input Layer Label Layer W (trained by optimal linear decoder) Input Layer W (trained by optimal linear decoder) Hidden Layer W (trained by evtcd) Input Layer Figure 6.19: Architectures for the evtcd-trained network, the linear regression network, and the combination network. Besides supervised training, evtcd can also be used to train networks to extract features in a purely unsupervised way. The common technique of unsupervised learning examines the input, extracts joint correlations, and clusters the data. This process can be used to learn receptive fields to reduce the dimensionality of the data (for example, as in [2, 57]), while preserving relevant information. Moreover, if desired, the output of the reduced layer can then be trained in a more traditional approach using another classification technique [13]. To begin, evtcd was used to train the spiking RBM in a purely unsupervised way to establish relevant receptive fields for the data. Then, the activations of the network in response to the MNIST training set (60,000 digits) were recorded. This process yields a new training set of reduced dimensionality, of size trials by hidden-layer-size, and a linear regressor was trained on this training set. The architecture can be seen in Figure The combination network composed of the evtcd network and the linear regression network recorded the highest performance of the architectures tried here, achieving 90.03% accuracy. It is a powerful result that the evtcd unsupervised learning method can reduce the dimensionality of the data from 784 pixels (28*28) to 225 and still achieve a better score than a linear regression alone. The confusion matrices in Figure 6.20 indicate the challenging digits for the learning algorithm. The evtcd algorithm, when used for supervised training, had the most difficulty with the digit 5. The tested network selected 8, 3, and 6 as often alternative candidates when presented with a 5. The linear regression generally confused the same digits, but had fewer mistakes overall. On the other hand, there are a few mistakes which appear in the linear classification result but not in the evtcd result; for example, the linear classifier had more difficulty identifying 5 s that were actually 8 s,

63 Chapter 6. Quantification of evtcd Training 57 Correct Digit Chosen Digit 100 (a) Confusion matrix of an RBM trained with the evtcd learning algorithm. Correct Digit Correct Digit Chosen Digit Chosen Digit 100 (b) Confusion matrix of MNIST digits using linear classification on the pixels. 100 (c) Confusion matrix of combination learning, using linear regression on the output of an RBM trained with evtcd Figure 6.20: Confusion matrices of classification using evtcd learning, optimal linear regression, and combination evtcd and linear regression. Across the vertical axis is the correct digit, and the horizontal axis is the digit chosen. Color indicates accuracy, in percent, of guessing the chosen digit. Note the common difficulty of distinguishing 4 s from 9 s and 5 s from 3 s, for example. and 1 s that were actually 4 s. Qualitatively, it appears that the confusion matrix of the combination network is the intersection of the mistakes of each network individually. Additionally, after training, an advantage of the evtcd-trained networks is a level of interpretability to the weights, unlike for a pure linear classifier. In Figure 6.21, the final receptive fields of the digits 0 through 9 are shown, with 0 on the upper left and 4 on the upper right. For the linear network, the values shown here are the linear relationship of that index to the input pixels, and can be thought of as the receptive fields. The features are generally not intuitive, though a 1 -like receptive field can be made out for the 1 digit, and a dim representation of a 6 appears for

64 Chapter 6. Quantification of evtcd Training 58 the 6 digit (Figure 6.21a). However, by linearly weighting the receptive fields of the evtcd-trained network, much more intuitive features appear: though noisy, all of the weighted receptive fields in Figure 6.21b suggest the form of the digit they are supposed to represent. Finally, a comparison of the performance of techniques appears in Figure Though the learning algorithm has much to improve before achieving state-of-the-art accuracy, it nonetheless surpasses the optimal linear methods and achieves impressive accuracy for a single training epoch executing on spiking neurons. (a) Linear decoder weights. (b) STDP-trained linear classifier weights. Figure 6.21: Examination of the weights of a linear classifier built on top of the dimensionality-reducing STDP-trained system. A. The purely linear classifier weights do not build particularly intuitive representations of their sensitivities (with the exception of 1 ). B. The combination network weights the receptive fields of the RBM to produce much more representative versions of their digits, though noisier and blurred.

65 Chapter 6. Quantification of evtcd Training Accuracy [%] evtcd Linear Regression Lin+evtCD Siegert State of Art Figure 6.22: Accuracy of learning on the MNIST dataset [2, 35, 46]. The combined linear and evtcd method presented in Section 6.2 achieves 90.3% accuracy. As a supervised learning algorithm, evtcd peaks at 81.46% identification accuracy.

Chapter 6. Quantification of evtcd Training 60 6.

66 Chapter 6. Quantification of evtcd Training Online Training with Spike-Based Sensors To demonstrate the rapidity with which these networks can be trained, evtcd was used to quickly train a network online with the spiking DVS image sensor [58]. The input rate to the network is limited by the refresh rate of the display used to train the network; in this case, 30 FPS was the maximum digit presentation speed at a 60 Hz refresh rate with a blank frame between each digit. During 58 seconds, 1500 digits were presented to the spiking DVS system which produced events used as inputs to the evtcd algorithm. These digits comprise 2.5% of the typical MNIST training set. The algorithm trained a 14*14 = 196 neuron hidden layer in a purely unsupervised way, developing receptive fields that correspond to their digit inputs. The weights for this hidden layer can be found in Figure 6.24; though clearly less ordered than the full epoch training examples shown in Section 6.1, the receptive fields display the qualitative features expected of a system trained on handwritten digits. After the training examples are presented, the network weights are saved and a linear classifier is trained on the spiking output of the network in response to the digits, as in Section 6.2. The final performance of this system again exceeds that of a pure linear classifier operating on the full MNIST training set, and achieves an 86.7% classification accuracy. This is promising result after processing such a small percentage of the training data. Figure 6.23: Screenshot of the Java-based implementation. Shown here at the left is the weight matrix, with currently updating neurons framed in blue. The original input to the system can be seen in red in the upper right, and next to it is the live reconstruction of that digit, performed by the model layer, shown in blue.

67 Chapter 6. Quantification of evtcd Training 61 Figure 6.24: Weights of the network learned by evtcd from 1500 digits presented over 58 seconds. Qualitatively, these features correspond to those seen in the earlier Section 6.2, factorizing the input into digit parts.

Deep unsupervised learning

Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.