arxiv: v5 [cs.lg] 28 Mar 2017

Size: px

Start display at page:

Download "arxiv: v5 [cs.lg] 28 Mar 2017"

Bennett Strickland
6 years ago
Views:

1 Equilibrium Propagation: Briging the Gap Between Energy-Base Moels an Backpropagation Benjamin Scellier an Yoshua Bengio * Université e Montréal, Montreal Institute for Learning Algorithms March 3, 217 arxiv: v5 [cs.lg] 28 Mar 217 Abstract We introuce Equilibrium Propagation, a learning framework for energy-base moels. It involves only one kin of neural computation, performe in both the first phase when the preiction is mae an the secon phase of training after the target or preiction error is reveale. Although this algorithm computes the graient of an objective function just like Backpropagation, it oes not nee a special computation or circuit for the secon phase, where errors are implicitly propagate. Equilibrium Propagation shares similarities with Contrastive Hebbian Learning an Contrastive Divergence while solving the theoretical issues of both algorithms: our algorithm computes the graient of a well efine objective function. Because the objective function is efine in terms of local perturbations, the secon phase of Equilibrium Propagation correspons to only nuging the preiction fixe point, or stationary istribution towars a configuration that reuces preiction error. In the case of a recurrent multi-layer supervise network, the output units are slightly nuge towars their target in the secon phase, an the perturbation introuce at the output layer propagates backwar in the hien layers. We show that the signal back-propagate uring this secon phase correspons to the propagation of error erivatives an encoes the graient of the objective function, when the synaptic upate correspons to a stanar form of spike-timing epenent plasticity. This work makes it more plausible that a mechanism similar to Backpropagation coul be implemente by brains, since leaky integrator neural computation performs both inference an error back-propagation in our moel. The only local ifference between the two phases is whether synaptic changes are allowe or not. We also show experimentally that multi-layer recurrently connecte networks with 1, 2 an 3 hien layers can be traine by Equilibrium Propagation on the permutation-invariant MNIST task. 1 Introuction The Backpropagation algorithm to train neural networks is consiere to be biologically implausible. Among other reasons, one major reason is that Backpropagation requires a special computational circuit an a special kin of computation in the secon phase of training. Here we introuce a new learning framework calle Equilibrium Propagation, which requires only one computational circuit an one type of computation for both phases of training. Just like Backpropagation applies to any ifferentiable computational graph an not just a regular multi-layer neural network, Equilibrium Propagation applies to a whole class of energy base moels the prototype of which is the continuous Hopfiel moel. In section 2, we revisit the continuous Hopfiel moel Hopfiel, 1984 an introuce Equilibrium Propagation as a new framework to train it. The moel is riven by an energy function whose minima correspon to preferre states of the moel. At preiction time, inputs are clampe an the network relaxes to a fixe point, corresponing to a local minimum of the energy function. The preiction is then rea out on the output units. This correspons to the first phase of the algorithm. In the secon phase of the training framework, when the target values for output units are observe, the outputs are nuge towars their targets an the network relaxes to a new but nearby fixe point which correspons to slightly smaller preiction error. The learning rule, which is prove to perform graient escent on the square error, is a kin of contrastive Hebbian learning rule in which we learn make more probable the secon-phase fixe point by reucing its energy an unlearn make less probable the first-phase fixe point by increasing its energy. However, our learning rule is not the usual contrastive Hebbian learning rule an it also iffers from Boltzmann machine learning rules, as iscusse in sections 4.1 an 4.2. During the secon phase, the perturbation cause at the outputs propagates across hien layers in the network. Because the propagation goes from outputs backwar in the network, it is better thought of as a back-propagation. It is *Y.B. is also a Senior Fellow of CIFAR 1

2 shown by Bengio an Fischer 215; Bengio et al. 217 that the early change of neural activities in the secon phase correspons to the propagation of error erivatives with respect to neural activities. Our contribution in this paper is to go beyon the early change of neural activities an to show that the secon phase also implements the back-propagation of error erivatives with respect to the synaptic weights, an that this upate correspons to a form of spike-timing epenent plasticity, using the results of Bengio et al In section 3, we present the general formulation of Equilibrium Propagation: a new machine learning framework for energy-base moels. This framework applies to a whole class of energy base moels, which is not limite to the continuous Hopfiel moel but encompasses arbitrary ynamics whose fixe points or stationary istributions correspon to minima of an energy function. In section 4, we compare our algorithm to the existing learning algorithms for energy-base moels. The recurrent back-propagation algorithm introuce by Pinea 1987; Almeia 1987 optimizes the same objective function as ours but it involves a ifferent kin of neural computation in the secon phase of training, which is not satisfying from a biological perspective. The contrastive Hebbian learning rule for continuous Hopfiel nets escribe by Movellan 199 suffers from theoretical problems: learning may eteriorate when the free phase an clampe phase lan in ifferent moes of the energy function. The Contrastive Divergence algorithm Hinton, 22 has theoretical issues too: the CD 1 upate rule may cycle inefinitely Sutskever an Tieleman, 21. The equivalence of back-propagation an contrastive Hebbian learning was shown by Xie an Seung 23 but at the cost of extra assumptions: their moel requires infinitesimal feeback weights an exponentially growing learning rates for remote layers. Equilibrium Propagation solves all these theoretical issues at once. Our algorithm computes the graient of a soun objective function that correspons to local perturbations. It can be realize with leaky integrator neural computation which performs both inference in the first phase an back-propagation of error erivatives in the secon phase. Furthermore, we o not nee extra hypotheses such as those require by Xie an Seung 23. Note that algorithms relate to ours base on infinitesimal perturbations at the outputs were also propose by O Reilly 1996; Hertz et al Finally, we show experimentally in section 5 that our moel is trainable. We train recurrent neural networks with 1, 2 an 3 hien layers on MNIST an we achieve.% training error. The generalization error lies between 2% an 3% epening on the architecture. The coe for the moel is available 1 for replicating an extening the experiments. 2 The Continuous Hopfiel Moel Revisite: Equilibrium Propagation as a More Biologically Plausible Backpropagation In this section we revisit the continuous Hopfiel moel Hopfiel, 1984 an introuce Equilibrium Propagation, a novel learning algorithm to train it. Equilibrium Propagation is similar in spirit to Backpropagation in the sense that it involves the propagation of a signal from output units to input units backwar in the network, uring the secon phase of training. Unlike Backpropagation, Equilibrium Propagation requires only one kin of neural computations for both phases of training, making it more biologically plausible than Backpropagation. However, several points still nee to be eluciate from a biological perspective. Perhaps the most important of them is that the moel escribe in this section requires symmetric weights, a question iscusse at the en of this paper. 2.1 A Kin of Hopfiel Energy Previous work Hinton an Sejnowski, 1986; Friston an Stephan, 27; Berkes et al., 211 has hypothesize that, given a state of sensory information, neurons are collectively performing inference: they are moving towars configurations that better explain the observe sensory ata. We can think of the neurons configuration as an explanation or interpretation for the observe sensory ata. In the energy-base moel presente here, that means that the units of the network graually move towars lower energy configurations that are more probable, given the sensory input an accoring to the current "moel of the worl" associate with the parameters of the moel. We enote by u the set of units of the network 2, an by θ = W, b the set of free parameters to be learne, which inclues the synaptic weights W ij an the neuron biases b i. The units are continuous-value an woul correspon to average voltage potential across time, spikes, an possibly neurons in the same minicolumn. Finally, ρ is a nonlinear activation function such that ρu i represents the firing rate of unit i. We consier the following energy function E, a kin of Hopfiel energy, first stuie by Bengio an Fischer 215; Bengio et al. 215a,b, 217: Eu := 1 u 2 i 1 W ijρu iρu j 2 2 i i j i For reasons of convenience, we use the same symbol u to mean both the set of units an the value of those units b iρu i. 1 2

However, to make the connection to backpropagation more obvious, we will consier more specifically a layere architecture with no skip-layer connections an no lateral connections within a layer Figure

3 Note that the network is recurrently connecte with symmetric connections, that is W ij = W ji. The algorithm presente here is applicable to any architecture so long as connections are symmetric, even a fully connecte network. However, to make the connection to backpropagation more obvious, we will consier more specifically a layere architecture with no skip-layer connections an no lateral connections within a layer Figure 1. In the supervise setting stuie here, the units of the network are split in three sets: the inputs x which are always clampe just like in other moels such as the conitional Boltzmann machine, the hien units h which may themselves be split in several layers an the output units y. We use the notation y for the targets, which shoul not be confuse with the outputs y. The set of all units in the network is u = {x, h, y}. As usual in the supervise learning scenario, the output units y aim to replicate their targets y. The iscrepancy between the output units y an the targets y is measure by the quaratic cost function C := 1 2 y y 2. 2 The cost function C also acts as an external potential energy for the output units, which can rive them towars their target. A novelty in our work, with respect to previously stuie energy-base moels, is that we introuce the total energy function F, which takes the form F := E + βc, 3 where β is a real-value scalar that controls whether the output y is pushe towars the target y or not, an by how much. We call β the influence parameter or clamping factor. The total energy F is the sum of two potential energies: the internal potential E that moels the interactions within the network, an the external potential βc that moels how the targets influence the output units. Contrary to Boltzmann Machines where the visible units are either free or fully clampe, here the real-value parameter β allows the output units to be weakly clampe. 2.2 The Neuronal Dynamics Figure 1: The input units x are always clampe. The state variable s inclues the hien units h an output units y. The targets are enote by y. The network is recurrently connecte with symmetric connections. Left. Equilibrium Propagation applies to any architecture, even a fully connecte network. Right. The connection with Backpropagation is more obvious when the network has a layere architecture. 3

4 We enote the state variable of the network by s = {h, y} which oes not inclue the input units x since they are always clampe. We assume that the time evolution of the state variable s is governe by the graient ynamics s t = F. 4 Unlike more conventional artificial neural networks, the moel stuie here is a continuous-time ynamical system escribe by the ifferential equation of motion Eq. 4. The total energy of the system ecreases as time progresses θ, β, x an y being fixe since F t = F s t = s 2 t. 5 The energy stops ecreasing when the network has reache a fixe point: F t = s t = F =. 6 The ifferential equation of motion Eq. 4 can be seen as a sum of two forces that act on the temporal erivative of s: s t = E β C. 7 The internal force inuce by the internal potential the Hopfiel energy, Eq. 1 on the i-th unit is E = ρ s i W ijρu j + b i s i, 8 i j i while the external force inuce by the external potential Eq. 2 on h i an y i is respectively β C h i = an β C y i = βy i y i. 9 The form of Eq. 8 is reminiscent of a leaky integrator neuron moel, in which neurons are seen as performing leaky temporal integration of their past inputs. Note that the hypothesis of symmetric connections W ij = W ji was use to erive Eq. 8. As iscusse in Bengio an Fischer 215, the factor ρ s i woul suggest that when a neuron is saturate firing at the maximal or minimal rate so that ρ s i, its state is not sensitive to external inputs, while the leaky term rives it out of the saturation regime, towars its resting value s i =. The form of Eq. 9 suggests that when β =, the output units are not sensitive to the outsie worl y. In this case we say that the network is in the free phase or first phase. On the contrary, when β >, the external force rives the output unit y i towars the target y i. In this case, we say that the network is in the weakly clampe phase or secon phase. Finally, a more likely ynamics woul inclue some form of noise. The notion of fixe point is then replace by that of stationary istribution. In Appenix C, we present a stochastic framework that naturally extens the analysis presente here. 2.3 Free Phase, Weakly Clampe Phase an Backpropagation of Errors In the first phase of training, the inputs are clampe an β = the output units are free. We call this phase the free phase an the state towars which the network converges is the free fixe point u. The preiction is rea out on the output units y at the fixe point. In the secon phase which we call weakly clampe phase, the influence parameter β is change to a small positive value β > an the novel term βc ae to the energy function Eq. 3 inuces a new external force that acts on the output units Eq. 9. This force moels the observation of y: it nuges the output units from their free fixe point value in the irection of their target. Since this force only acts on the output units, the hien units are initially at equilibrium at the beginning of the weakly clampe phase, but the perturbation cause at the output units will propagate in the hien units as time progresses. When the architecture is a multi-layere net Figure 1. Right, the perturbation at the output layer propagates backwars across the hien layers of the network. This propagation is thus better thought of as a backpropagation. The net eventually settles to a new fixe point corresponing to the new positive value of β which we call weakly clampe fixe point an enote by u β. Remarkably, the perturbation that is back-propagate uring the secon phase correspons to the propagation of error erivatives. It was first shown by Bengio an Fischer 215 that, starting from the free fixe point, the early changes of neural activities uring the weakly clampe phase cause by the output units moving towars their target 4

5 approximate some kin of error erivatives with respect to the layers activities. They consiere a regular multi-layer neural network with no skip-layer connections an no lateral connections within a layer. In this paper, we show that the weakly clampe phase also implements the back-propagation of error erivatives with respect to the synaptic weights. In the limit β, the upate rule W ij 1 ρ u β i ρ u β j ρ u i ρ u j 1 β gives rise to stochastic graient escent on J := 1 2 y y 2, where y is the state of the output units at the free fixe point. We will state an prove this theorem in a more general setting in section 3. In particular, this result hols for any architecture an not just a layere architecture Figure 1 like the one consiere by Bengio an Fischer 215. The learning rule Eq. 1 is a kin of contrastive Hebbian learning rule, somewhat similar to the one stuie by Movellan 199 an the Boltzmann machine learning rule. The ifferences with these algorithms will be iscusse in section 4. We call our learning algorithm Equilibrium Propagation. In this algorithm, leaky integrator neural computation as escribe in section 2.2, performs both inference in the free phase an error back-propagation in the weakly clampe phase. 2.4 Connection to Spike-Timing Depenent Plasticity Spike-Timing Depenent Plasticity STDP is believe to be a prominent form of synaptic change in neurons Markram an Sakmann, 1995; Gerstner et al., 1996, an see Markram et al. 212 for a review. The STDP observations relate the expecte change in synaptic weights to the timing ifference between postsynaptic spikes an presynaptic spikes. This is the result of experimental observations in biological neurons, but its role as part of a learning algorithm remains a topic where more exploration is neee. Here is an attempt in this irection. Experimental results by Bengio et al. 215a show that if the temporal erivative of the synaptic weight W ij satisfies W ij t ρu i uj t, 11 then one recovers the experimental observations by Bi an Poo 21 in biological neurons. Xie an Seung 2 have stuie the learning rule W ij ρu ρuj i. 12 t t Note that the two rules Eq. 11 an Eq. 12 are the same up to a factor ρ u j. An avantage of Eq. 12 is that it leas to a more natural view of the upate rule in the case of tie weights stuie here W ij = W ji. Inee, the upate shoul take into account the pressures from both the i to j an j to i synapses, so that the total upate uner constraint is W ij t ρu i ρuj t + ρu j ρui t = ρuiρuj. 13 t By integrating Eq. 13 on the path from the free fixe point u to the weakly clampe fixe point u β uring the secon phase, we get Wij W ij t = t t ρuiρujt = ρu iρu j = ρ u β i ρ u β j ρ u i ρ u j, 14 which is the same learning rule as Eq. 1 up to a factor 1/β. Therefore the upate rule Eq. 1 can be interprete as a continuous-time integration of Eq. 12, in the case of symmetric weights, on the path from u to u β uring the secon phase. We propose two possible interpretations for the synaptic plasticity in our moel. First view. In the first phase, a anti-hebbian upate occurs at the free fixe point W ij ρ u i ρ u j. In the secon phase, a Hebbian upate occurs at the weakly-clampe fixe point W ij +ρ u β i ρ u β j. Secon view. In the first phase, no synaptic upate occurs: W ij =. In the secon phase, when the neurons state move from the free fixe point u to the weakly-clampe fixe point u β, the synaptic weights follow the "tie version" of the continuous-time upate rule W ij ρui ρu j + ρu t t t j ρu i. t 5

6 3 A Machine Learning Framework for Energy Base Moels In this section we generalize the setting presente in section 2. We lay own the basis for a new machine learning framework for energy-base moels, in which Equilibrium Propagation plays a role analog to Backpropagation in computational graphs to compute the graient of an objective function. Just like the Multi Layer Perceptron is the prototype of computational graphs in which Backpropagation is applicable, the continuous Hopfiel moel presente in section 2 appears to be the prototype of moels which can be traine with Equilibrium Propagation. In our new machine learning framework, the central object is the total energy function F : all quantities of interest fixe points, cost function, objective function, graient formula can be efine or formulate irectly in terms of F. Besies, in our framework, the preiction or fixe point is efine implicitly in terms of the ata point an the parameters of the moel, rather than explicitly like in a computational graph. This implicit efinition makes applications on igital harware such as GPUs less practical as it nees long inference phases involving numerical optimization of the energy function. But we expect that this framework coul be very efficient if implemente by analog circuits, as suggeste by Hertz et al The framework presente in this section is eterministic, but a natural extension to the stochastic case is presente in Appenix C. 3.1 Training Objective In this section, we present the general framework while making sure to be consistent with the notations an terminology introuce in section 2. We enote by s the state variable of the network, v the state of the external worl i.e. the ata point being processe, an θ the set of free parameters to be learne. The variables s, v an θ are real-value vectors. The state variable s spontaneously moves towars low-energy configurations of an energy function Eθ, v, s. Besies that, a cost function Cθ, v, s measures how goo is a state is. The goal is to make low-energy configurations have low cost value. For fixe θ an v, we enote by s θ,v a local minimum of E, also calle fixe point, which correspons to the preiction from the moel: s θ,v arg min Eθ, v, s. 15 s Here we use the notation arg min to refer to the set of local minima. The objective function that we want to optimize is Jθ, v := C θ, v, s θ,v. 16 Note the istinction between the cost function C an the objective function J: the cost function is efine for any state s, whereas the objective function is the cost associate to the fixe point s θ,v. Now that the objective function has been introuce, we efine the training objective for a single ata point v as fin arg min θ Jθ, v. 17 A formula to compute the graient of J will be given in section 3.3 Theorem 1. Equivalently, the training objective can be reformulate as the following constraine optimization problem: fin subject to arg min Cθ, v, s 18 θ,s E θ, v, s =, 19 where the constraint E θ, v, s = is the fixe point conition. For completeness, a solution to this constraine optimization problem is given in Appenix B as well. Of course, both formulations of the training objective lea to the same graient upate on θ. Note that, since the cost Cθ, v, s may epen on θ, it can inclue a regularization term of the form λ Ω θ, where Ω θ is a L 1 or L 2 norm penalty for example. In section 2 we ha s = {h, y} for the state variable, v = {x, y} for the state of the outsie worl, θ = W, b for the set of learne parameters, an the energy function E an cost function C were of the form E θ, v, s = E θ, x, h, y an C θ, v, s = C y, y. 3.2 Total Energy Function Following section 2, we introuce the total energy function F θ, v, β, s := Eθ, v, s + β Cθ, v, s, 2 6

7 where β is a real-value scalar calle influence parameter. Then we exten the notion of fixe point for any value of β. The fixe point or energy minimum, enote by s β θ,v, is characterize by F θ, v, β, s β θ,v = 21 an 2 F θ, v, β, s β 2 θ,v is a symmetric positive efinite matrix. Uner mil regularity conitions on F, the implicit function theorem ensures that, for fixe v, the funtion θ, β s β θ,v is ifferentiable. 3.3 The Learning Algorithm: Equilibrium Propagation Theorem 1 Deterministic version. The graient of the objective function with respect to θ is given by the formula J 1 F θ, v = lim θ, v, β, s β θ,v F θ, v,, s θ,v, 22 β β or equivalently J C θ, v = θ, v, s 1 E θ,v + lim θ, v, s β θ,v E θ, v, s θ,v. 23 β β Theorem 1 will be prove in Appenix A. Note that the parameter β in Theorem 1 nee not be positive We only nee β. Using the terminology introuce in section 2, we call s θ,v the free fixe point, an s β θ,v the nuge fixe point or weakly-clampe fixe point in the case β >. Moreover, we call a free phase resp. nuge phase, or weakly-clampe phase a proceure that yiels a free fixe point resp. nuge fixe point, or weakly-clampe fixe point by minimizing the energy function F with respect to s, for β = resp. β. Theorem 1 suggests the following two-phase training proceure. Given a ata point v: 1. Run a free phase until the system settles to a free fixe point s θ,v an collect F θ, v,, s θ,v. 2. Run a nuge phase for some β such that β is "small", until the system settles to a nuge fixe point s β θ,v an collect F θ, v, β, s β θ,v. 3. Upate the parameter θ accoring to θ 1 β F θ, v, β, s β θ,v F θ, v,, s θ,v. 24 Consier the case β >. Starting from the free fixe point s θ,v which correspons to the preiction, a small change of the parameter β from the value β = to a value β > causes slight moifications in the interactions in the network. This small perturbation makes the network settle to a nearby weakly-clampe fixe point s β θ,v. Simultaneously, a kin of contrastive upate rule for θ is happening, in which the energy of the free fixe point is increase an the energy of the weakly-clampe fixe point is ecrease. We call this learning algorithm Equilibrium Propagation. Note that in the setting introuce in section 2.1 the total energy function Eq. 3 is such that F W ij = ρu iρu j, in agreement with Eq. 1. In the weakly clampe phase, the novel term 1 2 β y y 2 ae to the energy E with β > slightly attracts the output state y to the target y. Clearly, the weakly clampe fixe point is better than the free fixe point in terms of preiction error. The following proposition generalizes this property to the general setting. Proposition 2 Deterministic version. The erivative of the function β C θ, v, s β θ,v at β = is non-positive. Proposition 2 will also be prove in Appenix A. This proposition shows that, unless the free fixe point s θ,v is alreay optimal in terms of cost value, for β > small enough, the weakly-clampe fixe point s β θ,v achieves lower cost value than the free fixe point. Thus, a small perturbation ue to a small increment of β woul nuge the system towars a state that reuces the cost value. This property shes light on the upate rule Theorem 1, which can be seen as a kin of contrastive learning rule somehow similar to the Boltzmann machine learning rule where we learn make more probable the slightly better state s β θ,v by reucing its energy an unlearn make less probable the slightly worse state s θ,v by increasing its energy. However, our learning rule is ifferent from the Boltzmann machine learning rule an the contrastive Hebbian learning rule. The ifferences between these algorithms will be iscusse in section

The graient of the objective function is also compute analytically thanks to the Backpropagation algorithm a.k.a automatic ifferentiation. Right.

8 Figure 2: Comparison between the traitional framework for Deep Learning an our framework. Left. In the traitional framework, the state of the network f θ v an the objective function Jθ, v are explicit functions of θ an v an are compute analytically. The graient of the objective function is also compute analytically thanks to the Backpropagation algorithm a.k.a automatic ifferentiation. Right. In our framework, the free fixe point s θ,v is an implicit function of θ an v an is compute numerically. The nuge fixe point s β θ,v an the graient of the objective function are also compute numerically, following our learning algorithm: Equilibrium Propagation. 3.4 Another View of the Framework In sections 3.1 an 3.2 as well as in section 2 we first efine the energy function E an the cost function C, an then we introuce the total energy F := E + βc. Here we propose an alternative view of the framework, where we reverse the orer in which things are efine. Given a total energy function F which moels all interactions within the network as well as the action of the external worl on the network, we can efine all quantities of interest in terms of F. Inee, we can efine the energy function E an the cost function C as Eθ, v, s := F θ, v,, s an Cθ, v, s := F θ, v,, s, 26 where F an F are evaluate with the argument β set to. Obviously the fixe points s θ,v an s β θ,v are irectly efine in terms of F, an so is the objective function Jθ, v := C θ, v, sθ,v. The learning algorithm Theorem 1 is also formulate in terms of F. 3 From this perspective, F contains all the information about the moel an can be seen as the central object of the framework. For instance, the cost C represents the marginal variation of the total energy F ue to a change of β. 3 The proof presente in Appenix A will show that E, C an F nee not satisfy Eq. 2 but only Eq

9 As a comparison, in the traitional framework for Deep Learning, a moel is represente by a ifferentiable computational graph in which each noe is efine as a function of its parents. The set of functions that efine the noes fully specifies the moel. The last noe of the computational graph represents the cost to be optimize, while the other noes represent the state of the layers of the network, as well as other intermeiate computations. In the framework for machine learning propose here the framework suite for Equilibrium Propagation, the analog of the set of functions that efine the noes in the computational graph is the total energy function F. 3.5 Backpropagation Vs Equilibrium Propagation In the traitional framework for Deep Learning Figure 2, left, each noe in the computational graph is an explicit ifferentiable function of its parents. The state of the network ŝ = f θ v an the objective function Jθ, v = C θ, v, f θ v are compute analytically, as functions of θ an v, in the forwar pass. The Backpropagation algorithm a.k.a automatic ifferentiation enables to compute the error erivatives analytically too, in the backwar pass. Therefore, the state of the network ŝ = f θ v forwar pass an the graient of the objective function J θ, v backwar pass can be compute efficiently an exactly. 4 In the framework for machine learning that we propose here Figure 2, right, the free fixe point ŝ = s θ,v is an implicit function of θ an v, characterize by E θ, v, s θ,v =. The free fixe point is compute numerically, in the free phase first phase. Similarly the nuge fixe point s β θ,v is an implicit function of θ, v an β, an is compute numerically in the nuge phase secon phase. Equilibrium Propagation estimates for the particular value of β chosen in the secon phase the graient of the objective function J θ, v base on these two fixe points. The requirement for numerical optimization in the first an secon phases make computations inefficient an approximate. The experiments in section 5 will show that the free phase is fairly long when performe with a iscrete-time computer simulation. However, we expect that the full potential of the propose framework coul be exploite on analog harware instea of igital harware, as suggeste by Hertz et al Relate Work In section 2.3 we have iscusse the relationship between Equilibrium Propagation an Backpropagation. In the weakly clampe phase, the change of the influence parameter β creates a perturbation at the output layer which propagates backwars in the hien layers. The error erivatives an the graient of the objective function are encoe by this perturbation. In this section we iscuss the connection between our work an other algorihms, starting with Contrastive Hebbian Learning. Equilibrium Propagation offers a new perspective on the relationship between Backpropagation in feeforwar nets an Contrastive Hebbian Learning in Hopfiel nets an Boltzmann machines Table 1. Backprop Equilibrium Prop Contrastive Hebbian Learning Almeia-Pineia First Phase Forwar Pass Free Phase Free Phase or Negative Phase Free Phase Secon Phase Backwar Pass Weakly Clampe Phase Clampe Phase or Positive Phase Recurrent Backprop Table 1: Corresponence of the phases for ifferent learning algorithms: Back-propagation, Equilibrium Propagation our algorithm, Contrastive Hebbian Learning an Boltzmann Machine Learning an Almeia-Pineia s Recurrent Back-Propagation 4.1 Link to Contrastive Hebbian Learning Despite the similarity between our learning rule an the Contrastive Hebbian Learning rule CHL for the continuous Hopfiel moel, there are important ifferences. First, recall that our learning rule is 1 W ij lim ρ u β i ρ u β j ρ u i ρ u j, 27 β β where u is the free fixe point an u β is the weakly clampe fixe point. The Contrastive Hebbian Learning rule is W ij ρ u i ρ u j ρ u i ρ u j, 28 4 Here we are not consiering numerical stability issues ue to the encoing of real numbers with finite precision. 9

10 where u is the fully clampe fixe point i.e. fixe point with fully clampe outputs. We choose the notation u for the fully clampe fixe point because it correspons to β + with the notations of our moel. Inee Eq. 9 shows that in the limit β +, the output unit y i moves infinitely fast towars y i, so y i is immeiately clampe to y i an is no longer sensitive to the internal force Eq. 8. Another way to see it is by consiering Eq. 3: as β +, the only value of y that gives finite energy is y. The objective functions that these two algorithms optimize also iffer. Recalling the form of the Hopfiel energy Eq. 1 an the cost function Eq. 2, Equilibrium Propagation computes the graient of J = 1 2 y y 2, 29 where y is the output state at the free phase fixe point u, while CHL computes the graient of J CHL = E u E u. 3 The objective function for CHL has theoretical problems: it may take negative values if the clampe phase an free phase stabilize in ifferent moes of the energy function, in which case the weight upate is inconsistent an learning usually eteriorates, as pointe out by Movellan 199. Our objective function oes not suffer from this problem, because it is efine in terms of local perturbations, an the implicit function theorem guarantees that the weakly clampe fixe point will be close to the free fixe point thus in the same moe of the energy function. We can also reformulate the learning rules an objective functions of these algorithms using the notations of the general setting section 3. For Equilibrium Propagation we have 1 θ lim β β F θ, v, β, s β θ,v F θ, v,, s θ,v an Jθ, v = F θ, v,, s θ,v. 31 As for Contrastive Hebbian Learning, one has F θ θ, v,, s F θ,v θ, v,, s θ,v an J CHLθ, v = F θ, v,, s θ,v F θ, v,, s θ,v, 32 where β = an β = are the values of β corresponing to free an fully clampe outputs respectively. Our learning algorithm is also more flexible because we are free to choose the cost function C as well as the energy funtion E, whereas the contrastive function that CHL optimizes is fully etermine by the energy function E. 4.2 Link to Boltzmann Machine Learning Again, the log-likelihoo that the Boltzmann machine optimizes is etermine by the Hopfiel energy E, whereas we have the freeom to choose the cost function in the framework for Equilibrium Propagation. As iscusse in Section 2.3, the secon phase of Equilibrium Propagation going from the free fixe point to the weakly clampe fixe point can be seen as a brief backpropagation phase with weakly clampe target outputs. By contrast, in the positive phase of the Boltzmann machine, the target is fully clampe, so the correct version of the Boltzmann machine learning rule requires two separate an inepenent phases Markov chains, making an analogy with backprop less obvious. Our algorithm is also similar in spirit to the CD algorithm Contrastive Divergence for Boltzmann machines. In our moel, we start from a free fixe point which requires a long relaxation in the free phase an then we run a short weakly clampe phase. In the CD algorithm, one starts from a positive equilibrium sample with the visible units clampe which requires a long positive phase Markov chain in the case of a general Boltzmann machine an then one runs a short negative phase. But there is an important ifference: our algorithm computes the correct graient of our objective function in the limit β, whereas the CD algorithm computes a biase estimator of the graient of the log-likelihoo. The CD 1 upate rule is provably not the graient of any objective function an may cycle inefinitely in some pathological cases Sutskever an Tieleman, 21. Finally, in the supervise setting presente in Section 2, a more subtle ifference with the Boltzmann machine is that the output state y in our moel is best thought of as being part of the latent state variable s. If we were to make an analogy with the Boltzmann machine, the visible units of the Boltzmann machine woul be v = {x, y}, while the hien units woul be s = {h, y}. In the Boltzmann machine, the state of the external worl is inferre irectly on the visible units because it is a probabilistic generative moel that maximizes the log-likelyhoo of the ata, whereas in our moel we make the choice to integrate in s special latent variables y that aim to match the target y. 1

11 4.3 Link to Recurrent Back-Propagation Directly connecte to our moel is the work by Pinea 1987; Almeia 1987 on recurrent back-propagation. They consier the same objective function as ours, but formulate the problem as a constraine optimization problem. In Appenix B we erive another proof for the learning rule Theorem 1 with the Lagrangian formalism for constraine optimization problems. The beginning of this proof is in essence the same as the one propose by Pinea 1987; Almeia 1987, but there is a major ifference when it comes to solving Eq. 75 for the costate variable λ. The metho propose by Pinea 1987; Almeia 1987 is to use Eq. 75 to compute λ by a fixe point iteration in a linearize form of the recurrent network. The computation of λ correspons to their secon phase, which they call recurrent back-propagation. However, this secon phase oes not follow the same kin of ynamics as the first phase the free phase because it uses a linearization of the neural activation rather than the fully non-linear activation. 5 From a biological plausibility point of view, having to use a ifferent kin of harware an computation for the two phases is not satisfying. By contrast, like the continuous Hopfiel net an the Boltzmann machine, our moel involves only one kin of neural computations for both phases. 4.4 The Moel by Xie & Seung Previous work on the back-propagation interpretation of contrastive Hebbian learning was one by Xie an Seung 23. The moel by Xie an Seung 23 is a moifie version of the Hopfiel moel. They consier the case of a layere MLP-like network, but their moel can be extene to a more general connectivity, as shown here. In essence, using the notations of our moel section 2, the energy function that they consier is E X&S u := 1 γ i u 2 i γ j W ijρu iρu j 2 i<j i i γ i b iρu i. 33 The ifference with Eq. 1 is that they introuce a parameter γ, assume to be small, that scales the strength of the connections. Their upate rule is the contrastive Hebbian learning rule which, for this particular energy function, takes the form EX&S W ij u E X&S u = γ j ρ u i ρ u j ρ u i ρ u j 34 W ij W ij for every pair of inices i, j such that i < j. Here u an u are the fully clampe fixe point an free fixe point respectively. Xie an Seung 23 show that in the regime γ this contrastive Hebbian learning rule is equivalent to back-propagation. At the free fixe point u, one has E X&S i u = for every unit s 6 i, which yiels, after iviing by γ i an rearranging the terms s i = ρ s i W ijρ u j + γ j i W ijρ u j + bi. 35 j<i In the limit γ, one gets s i ρ s i j<i Wijρu j + b i, so that the network almost behaves like a feeforwar net in this regime. As a comparison, recall that in our moel section 2 the energy function is the learning rule is 1 W ij lim β β j>i Eu := 1 u 2 i W ijρu iρu j 2 i i<j i E W ij u β b iρu i, 36 E u 1 = lim ρ u β i ρ u β j ρ u i ρ u j, 37 W ij β β an at the free fixe point, we have E i u = for every unit s i, which gives s i = ρ s i W ijρ u j + bi. 38 j i Here are the main ifferences between our moel an theirs. In our moel, the feeforwar an feeback connections are both strong. In their moel, the feeback weights are tiny compare to the feeforwar weights, which makes the 5 Reccurent Back-propagation correspons to Back-propagation Through Time BPTT when the network converges an remains at the fixe point for a large number of time steps. 6 Recall that in our notations, the state variable s oes not inclue the clampe inputs x, whereas u inclues x. 11

12 recurrent computations look almost feeforwar. In our secon phase, the outputs are weakly clampe. In their secon phase, they are fully clampe. The theory of our moel requires a unique learning rate for the weights, while in their moel the upate rule for W ij with i < j is scale by a factor γ j see Eq. 34. Since γ is small, the learning rates for the weights vary on many orers of magnitue in their moel. Intuitively, these multiple learning rates are require to compensate for the small feeback weights. 5 Implementation of the Moel an Experimental Results In this section, we provie experimental evience that our moel escribe in section 2 is trainable, by testing it on the classification task of MNIST igits LeCun an Cortes, The MNIST ataset of hanwritten igits consists of 6, training examples an 1, test examples. Each example x in the ataset is a gray-scale image of 28 by 28 pixels an comes with a label y {, 1,..., 9}. We use the same notation y for the one-hot encoing of the target, which is a 1-imensional vector. Recall that our moel is a recurrently connecte neural network with symmetric connections. Here, we train multilayere networks with 1, 2 an 3 hien layers, with no skip-layer connections an no lateral connections within layers. Although we believe that analog harware woul be more suite for our moel, here we propose an implementation on igital harware a GPU. We achieve.% training error. The generalization error lies between 2% an 3% epening on the architecture Figure 3. For each training example x, y in the ataset, training procees as follows: 1. Clamp x. 2. Run the free phase until the hien an output units settle to the free fixe point, an collect ρ u i ρ u j for every pair of units i, j. 3. Run the weakly clampe phase with a "small" β > until the hien an output units settle to the weakly clampe fixe point, an collect ρ u β i ρ u β j. 4. Upate each synapse W ij accoring to W ij 1 β ρ u β i ρ u β j ρ u i ρ u j. 39 The preiction is mae at the free fixe point u at the en of the first phase relaxation. The preicte value y pre is the inex of the output unit whose activation is maximal among the 1 output units: y pre := arg max i y i. 4 Note that no constraint is impose on the activations of the units of the output layer in our moel, unlike more traitional neural networks where a softmax output layer is use to constrain them to sum up to 1. Recall that the objective function that we minimize is the square of the ifference between our preiction an the one-hot encoing of the target value: J = 1 2 y y Finite Differences Implementation of the ifferential equation of motion. First we clamp x. Then the obvious way to implement Eq. 4 is to iscretize time into short time lapses of uration ɛ an to upate each hien an output unit s i accoring to s i s i ɛ F i θ, v, β, s. 42 This is simply one step of graient escent on the total energy F, with step size ɛ. For our experiments, we choose the har sigmoi activation function ρs i = s i 1, where enotes the max an the min. For this choice of ρ, since ρ s i = for s i <, it follows from Eq. 8 an Eq. 9 that if h i < then F h i θ, v, β, s = h i >. This force prevents the hien unit h i from going in the range of negative values. The same is true for the output units. Similarly, s i cannot reach values above 1. As a consequence s i must remain in the omain s i 1. Therefore, rather than the stanar graient escent Eq. 42, we will use a slightly ifferent upate rule for the state variable s: s i s i ɛ F θ, v, β, s i 12

13 Figure 3: Training an valiation error for neural networks with 1 hien layer of 5 units top left, 2 hien layers of 5 units top right, an 3 hien layers of 5 units bottom. The training error eventually ecreases to.% in all three cases. This little implementation etail turns out to be very important: if the i-th hien unit was in some state h i <, then Eq. 42 woul give the upate rule h i 1 ɛh i, which woul imply again h i < at the next time step assuming ɛ < 1. As a consequence h i woul remain in the negative range forever. Choice of the step size ɛ. We fin experimentally that the choice of ɛ has little influence as long as < ɛ < 1. What matters more is the total uration of the relaxation t = n iter ɛ where n iter is the number of iterations. In our experiments we choose ɛ =.5 to keep n iter = t/ɛ as small as possible so as to avoi extra unnecessary computations. Duration of the free phase relaxation. We fin experimentally that the number of iterations require in the free phase to reach the free fixe point is large an grows fast as the number of layers increases Table 2, which consierably slows own training. More experimental an theoretical investigation woul be neee to analyze the number of iterations require, but we leave that for future work. Duration of the weakly clampe phase. During the weakly clampe phase, we observe that the relaxation to the weakly clampe fixe point is not necessary. We only nee to initiate the movement of the units, an for that we use the following heuristic. Notice that the time constant of the integration process in the leaky integrator equation Eq. 8 is τ = 1. This time constant represents the time neee for a signal to propagate from a layer to the next one with "significant amplitue". So the time neee for the error signals to back-propagate in the network is Nτ = N, where N is the number of layers hiens an output of the network. Thus, we choose to perform N/ɛ iterations with step size ɛ = Implementation Details an Experimental Results To tackle the problem of the long free phase relaxation an spee-up the simulations, we use persistent particles for the latent variables to re-use the previous fixe point configuration for a particular example as a starting point for the next free phase relaxation on that example. This means that for each training example in the ataset, we store the state of the hien layers at the en of the free phase, an we use this to initialize the state of the network at the next epoch. This metho is similar in spirit to the PCD algorithm Persistent Contrastive Divergence for sampling from other energy-base moels like the Boltzmann machine Tieleman, 28. We fin that it helps regularize the network if we choose the sign of β at ranom in the secon phase. Note that the 13

14 weight upates remain consistent thanks to the factor 1/β in the upate rule W ij 1 ρ β Inee, the left-erivative an the right-erivative of the function β ρ u β i ρ u β j u β i ρ u β j ρ u i ρ u j. at the point β = coincie. Although the theory presente in this paper requires a unique learning rate for all synaptic weights, in our experiments we nee to choose ifferent learning rates for the weight matrices of ifferent layers to make the algorithm work. We o not have a clear explanation for this fact yet, but we believe that this is ue to the finite precision with which we approach the fixe points. Inee, the theory requires to be exactly at the fixe points, but in practice we minimize the energy function by numerical optimization, using Eq. 43. The precision with which we approach the fixe points epens on hyperparameters such as the step size ɛ an the number of iterations n iter. Let us enote by h, h 1,, h N the layers of the network where h = x an h N = y an by W k the weight matrix between the layers h k 1 an h k. We choose the learning rate α k for W k so that the quantities W k W k for k = 1,, N are approximately the same in average over training examples, where W k represents the weight change of W k after seeing a minibatch. The hyperparameters chosen for each moel are shown in Table 2 an the results are shown in Figure 3. We initialize the weights accoring to the Glorot-Bengio initialization Glorot an Bengio, 21. For efficiency of the experiments, we use minibatches of 2 training examples. Architecture Iterations Iterations ɛ β α 1 α 2 α 3 α 4 first phase secon phase Table 2: Hyperparameters. The learning rate ɛ is use for iterative inference Eq. 43. β is the value of the clamping factor in the secon phase. α k is the learning rate for upating the parameters in layer k. 6 Discussion, Looking Forwar From a biological perspective, a troubling issue in the Hopfiel moel is the requirement of symmetric weights between the units. Note that the units in our moel nee not correspon exactly to actual neurons in the brain it coul be groups of neurons in a cortical microcircuit, for example. It remains to be shown how a form of symmetry coul arise from the learning proceure itself for example from autoencoer-like unsupervise learning or if a ifferent formulation coul eliminate the symmetry requirement. Encouraging cues come from the observation that enoising autoencoers without tie weights often en up learning symmetric weights Vincent et al., 21. Another encouraging piece of evience, also linke to autoencoers, is the theoretical result from Arora et al. 215, showing that the symmetric solution minimizes the autoencoer reconstruction error between two successive layers of rectifying ReLU units, suggesting that symmetry may arise as the result of an aitional objective function making successive layers form an autoencoer. Also, Lillicrap et al. 214 show that the backpropagation algorithm for feeforwar nets also works when the feeback weights are ranom, an that in this case the feeforwar weight ten to align with the feeback weights. Another practical issue is that we woul like to reuce the negative impact of a lengthy relaxation to a fixe point, especially in the free phase. A possibility is explore by Bengio et al. 216 an was initially iscusse by Salakhutinov an Hinton 29 in the context of a stack of RBMs: by making each layer a goo autoencoer, it is possible to make this iterative inference converge quickly after an initial feeforwar phase, because the feeback paths agree with the states alreay compute in the feeforwar phase. Regaring synaptic plasticity, the propose upate formula can be contraste with theoretical synaptic learning rules which are base on the Hebbian prouct of pre- an postsynaptic activity, such as the BCM rule Bienenstock et al., 1982; Intrator an Cooper, The upate propose here is particular in that it involves the temporal erivative of the postsynaptic activity, rather than the actual level of postsynaptic activity. Whereas our work focuses on a rate moel of neurons, see Felman 212 for an overview of synaptic plasticity that goes beyon spike timing an firing rate, incluing synaptic cooperativity nearby synapses on the same enritic subtree an epolarization ue to multiple consecutive pairings or spatial integration across nearby locations on the enrite, as well as the effect of the synapse s istance to the soma. In aition, it woul be interesting to stuy upate rules which epen on the statistics of triplets or quaruplets of spikes timings, as in Froemke an Dan 22; Gjorgjievaa et al These effects are not consiere here but future work shoul consier them. Another question is that of time-varying input. Although this work makes back-propagation more plausible for the case of a static input, the brain is a recurrent network with time-varying inputs, an back-propagation through time seems even less plausible than static back-propagation. An encouraging irection is that propose by Ollivier et al. 215; 14

arxiv: v4 [cs.lg] 26 Sep 2016

arxiv: v4 [cs.lg] 26 Sep 2016 Equilibrium Propagation: Briging the Gap Between Energy-Base Moels an Backpropagation Benjamin Scellier an Yoshua Bengio Université e Montréal, Montreal Institute for Learning Algorithms September 27,