Bayesian synaptic plasticity makes predictions about plasticity experiments in vivo

Size: px

Start display at page:

Download "Bayesian synaptic plasticity makes predictions about plasticity experiments in vivo"

Dora Benson
5 years ago
Views:

1 Bayesian synaptic plasticity makes predictions about plasticity experiments in vivo Laurence Aitchison and Peter E. Latham October 13, 014 arxiv: v [q-bio.nc] 10 Oct 014 Abstract Humans and other animals learn by updating synaptic weights in the brain. Rapid learning allows animals to adapt quickly to changes in their environment, giving them a large selective advantage. As brains have been evolving for several hundred million years, we might expect biological learning rules to be close to optimal, by exploiting all locally available information in order to learn as rapidly as possible. However, no previously proposed learning rules are optimal in this sense. We therefore use Bayes theorem to derive optimal learning rules for supervised, unsupervised and reinforcement learning. As expected, these rules prove to be significantly more effective than the best classical learning rules. Our learning rules make two predictions about the results of plasticity experiments in active networks. First, we predict that learning rates should vary across time, increasing when fewer inputs are active. Second, we predict that learning rates should vary across synapses, being higher for synapses whose presynaptic cells have a lower average firing rate. Finally, our methods are extremely flexible, allowing the derivation of optimal learning rules based solely on the information that is assumed, or known, to be available to the synapse. This flexibility should allow for the derivation of optimal learning rules for progressively more complex and realistic synaptic and neural models allowing us to connect theory with complex biological reality. 1 Introduction Synapses connect neurons, allowing an action potential in the presynaptic neuron to influence the membrane potential of the postsynaptic neuron. The size of this membrane potential change is known as the postsynaptic potential PSP, or alternatively the synaptic weight. Synaptic weights are updated based on information available locally at the synapse, for instance pre and postsynaptic activity. Theoretical neuroscientists have noted that there are three types of learning, supervised learning, reinforcement learning and unsupervised learning, corresponding to three essential cognitive tasks [1, ]. First, supervised learning involves making predictions e.g. about future events from a set of inputs [3]. For instance, a simple supervised learning task is to predict a single scalar output from a vector of inputs. To learn how to accomplish this task, the system repeatedly receives an input, makes a prediction and is given an error signal 1

2 saying whether the prediction was too high or too low. It is believed that the cerebellum uses supervised learning to make short timescale predictions e.g. predicting air puffs from aural cues [4] and to assist in fine motor control [5]. Second, reinforcement learning involves learning behavioural responses that maximize reward e.g. food reward [6]. For instance, a simple reinforcement learning task is to map an input vector to an action represented by a single scalar value. To learn how to accomplish this task, the system repeatedly receives an input, chooses an action, and is told the reward for that action. It is believed that the basal ganglia uses reinforcement learning to perform the critical task of action selection []. Finally, unsupervised learning involves finding structure in the inputs, perhaps by finding a direction in the input space with an interesting non-gaussian distribution [7]. It is believed that the cortex performs unsupervised learning, allowing us to understand and utilize the incredible volume of sensory data entering the brain [8]. Local learning rules have been proposed for supervised [], reinforcement [9] and unsupervised [10, 11, 1] learning. These rules work well when given a large amount of data whose statistical properties do not change over time. However, being non-bayesian, these rules do not exploit all locally available information in order to learn as rapidly as possible, which becomes problematic when there is less data available, or when the data changes over time. We therefore derived Bayesian synaptic learning rules. These learning rules do use all locally available information up to approximations, are more effective than optimized classical learning rules, and make predictions about the results of plasticity experiments in active networks. Results We begin by writing down probabilistic generative models for the information received by a neuron. Then we derive Bayesian learning rules using the statistical structure given by the generative model. We demonstrate, in simulation, that our learning rules are far more effective than classical rules. Finally, our learning rules give strong predictions about the results of plasticity experiments under realistic patterns of network activity..1 The model The neuron s output, y, is a weighted sum of its imputs, x i, plus noise, y = i x i w i + γ y η. 1 where η is standard Gaussian noise, P η = N η; 0, 1, and every future mention of η is independently drawn standard Gaussian noise. In practice, PSPs, or equivalently synaptic weights, w i, are noisy, in large part due to stochastic vesicle release [13]. We model these noisy PSPs using w i = m i + km i η i, where m i is the mean PSP and km i is the variance. We chose this particular relationship between the mean and variance of PSPs in order to match experimental data [14]. As y is not constrained to be positive, we can think of y as

3 either the membrane potential, or the difference between the actual firing rate and some baseline firing rate. For supervised and reinforcement learning, the number of spikes, x i, is drawn from a Poisson distribution, whose rates are chosen to mimic the distribution of rates in cortex see Methods. These neurons receive a feedback signal, f, which is a deterministic function of a noisy error signal, δ = y opt y + γ f η. 3 Here, y opt is the optimal output, which is defined in terms of the optimal weights, w opt,j, y opt = j w opt,j x j. 4 Note that the neuron s eventual goal will be to set its average weights, m i, as close as possible to the optimal weights, w opt,i, so that the cell s outputs, y are as close as possible to the optimal outputs, y opt. For supervised learning with continuous feedback, the feedback, f, is the noisy error signal, f = δ. 5 For supervised learning with binary feedback, f = 1 if the noisy error signal is above a threshold, θ, θ,and f = 1 otherwise, { 1 if θ < δ f y, y opt = 6 1 otherwise. Finally, for reinforcement learning the feedback, or reward, reports the magnitude of the noisy error signal, but not its direction, f = δ. 7 In supervised and reinforcement learning, there was a feedback signal that depends on w opt, so the goal was to find w opt from the feedback signal. In contrast, for unsupervised learning, there is no feedback signal. Instead, there is interesting statistical structure in the inputs, x, that depends on w opt, so the goal is to find w opt solely from the statistics of the inputs. For a Bayesian algorithm to find such interesting structure, and to generate data on which to test our model neuron we must first write down a generative model, describing how to generate inputs, x with the desired interesting structure. We chose to have the inputs, x, be Gaussian in every direction except w opt. Along w opt, we set the distribution over x to be Laplacian. In order to enforce that structure, we generated y opt on every time step from P y opt = Laplacian y opt ; 0, b, 8 where b is chosen so that the variance along w opt is similar to the variance in other directions. In order to generate a set of inputs, x, that is consistent with both y opt and the firing rate distribution, we sample x from, P x y opt N x; 0, Λ δ j w opt,j x j y opt 9 3

4 where the diagonal elements of Λ are chosen to match the variance of a Poisson process with the appropriate rate, and the off-diagonal elements are set to 0, Λ ij = δ ij ν i. 10 With these inputs any x i can be positive on one timestep and negative on another, so there is no distinction between excitatory and inhibitory cells. However, it is possible to interpret x i not as the inputs themselves, but as the difference between the actual input and the mean input. Using this interpretation, we could set the mean input to be large and positive for excitatory cells, and large and negative for inhibitory cells giving distinct excitatory and inhibitory populations. Finally, we do not expect the optimal weights to remain the same over all time, but to change slowly as the surrounding neural circuits and the statistics of the environment change. While these changes may actually be quite systematic, to a single synapse deep in the brain they will appear to be random. We formalise these random changes by assuming that the logarithm of the optimal weights, λ opt,i = log w opt,i follows a discretised Ornstein-Uhlenbeck process, λ opt,i t + 1 λ opt,i t = 1 1 σprior λ opt,i t µ prior + µ prior + η i. τ τ 11 where t, and the timescale τ are written in terms of time steps. We chose this particular noise process for three reasons. First, w opt,i = e λopt,i takes only positive values, so excitatory weights cannot become inhibitory, and vice versa. Second, spine sizes obey this stochastic process [15], and while synaptic weights are not spine sizes, they are correlated [16]. Third, this noise process gives a log-normal stationary distribution of weights, as is observed [14].. Inference The synapse s goal is to set its mean weight, m i t + 1, as close as possible to the optimal weight, w opt,i t + 1. Of course, the synapse does not know the optimal weight, so the best it can do is to set m i t + 1 equal to the expected value of the optimal weight given all past data, m i t + 1 = E [w opt,i t + 1 D i t]. 1 Here, D i t = {d i t, d i t 1... d i 1} is all past data, and d i t is the data at time step t. The data at time t, denoted d i t, consists of the presynaptic activity, x i, the postsynaptic activity, y, the actual PSP, w i, a feedback signal, f if performing supervised or reinforcement learning, and information about the membrane time constant. Knowledge of the membrane time constant is not essential for learning, but proves to be useful for setting the learning rate on short timescales. In particular, the learning rate will turn out to depend on s like = j x j s j + km j, 13 where s i = Var [w i D i t 1]. 14 4

5 Notice that s like, like the membrane time constant, is a measure of the overall level of input regardless of whether those inputs are excitatory or inhibitory, so, while s like and the membrane time constant are not exactly the same, the synapse should be able to estimate s like reasonably accurately from the membrane time constant, especially if the cell has many inputs. We therefore assume that s like is known by the synapse, even if, in reality, the cell has some uncertainty about the value of s like. To compute P w opt,i t + 1 D i t, and hence E [w opt,i t + 1 D i t], the neuron takes the distribution at the previous time step, P w opt,i t D i t 1, uses Bayes theorem to incorporate new data, then takes into account the random changes in optimal weights across time. To take random changes into account, the neuron marginalises out the weights at the previous time step, P w opt,i t + 1 D i = dw opt,i P w opt,i t + 1 w opt,i P w opt,i D i. 15 Here and in future, all quantities without an explicitly specified time are evaluated at time step t, so for instance w opt,i = w opt,i t. To compute P w opt,i D i, the neuron must take its distribution at the previous time step, P w opt,i D i t 1, and use Bayes theorem to incorporate the data received at timestep t, P w opt,i D i P d i w opt,i P w opt,i D i t The resulting inferred distributions can be far too complex to deal with computationally, and certainly too complex for a synapse to work with, so we need to specify a family of approximate distributions. We chose to approximate the distribution over the log-weight, λ opt,i = log w opt,i using a Gaussian, with mean µ i, and variance σ i, P λ opt,i Dt 1 = N λ opt,i ; µ i, σ i. 17 This approximation has two advantages. First, w opt,i = e λopt,i takes only positive values, leading to learning rules that cannot, for instance, take an excitatory synapse and turn it inhibitory. Second, if the synapse is not given any data, then the dynamics Equation 11 imply that the distribution over λ opt,i will be Gaussian, which can be captured perfectly by this approximating distribution. In order to set the parameters, µ i and σi, of the approximate distribution, we do not need the full distribution, P λ opt,i t + 1 D i, but only its mean and variance, which substantially simplifies the learning rules. In particular, for supervised and reinforcement learning the update rules are see Methods for a full derivation µ i t + 1 = E [λ opt,i t + 1 D i ] = 1 1 µ i + E [ δ f, s ] τ like, x i, w i + xi w i m i x i m i σi s like,i σi t + 1 = Var [λ opt,i t + 1 D i ] = σi Var [ δ f, s like, x ] i, w i s like,i τ s like,i σ i x i m i s like,i µ prior + µ prior, 18a + σ like τ 18b 5

6 A Supervised continuous B Supervised binary Weight mv Weight mv Time min C Reinforcement D Unsupervised 10.0 Optimal weight Inferred weight Weight mv Weight mv Error bars Time min Time min Time min Figure 1: Bayesian learning rules track the true weight, and estimate uncertainty. The blue line is the true weight, the red line represents the mean of the inferred distribution, eµi, and the red area represents two standard deviation error bars, eµi ±σi. The time course is 3 time constants. A Supervised learning, continuous feedback. B Supervised learning, binary feedback. C Reinforcement learning. D Unsupervised learning. where slike,i = slike xi kmi, 19 and the exact form for E δ f, slike, xi, wi and Var δ f, slike, xi, wi are given in the methods. While these rules may look complex, the update rule for the mean is similar to the original delta rule []. In particular, the change in weight is given by the product of three terms. First, xi, ensures that the synpse only gets updated if the input was on. Second, there is an error signal, E δ f, slike, xi, wi + xi wi mi, stating whether this particular synaptic weight should increase or decrease. Third, there is an adaptive learning rate, mi σi /slike,i. The learning rule for σi is similar, though more difficult to interpret..3 Simulations Simulating each of these rules gives good results the inferred log-weight, µi tracks the optimal log-weight, λopt,i, with λopt,i straying outside of the two standard deviation error bars about 5% of the time Figure 1. We used 1000 inputs, and a time step of 10 ms. For supervised learning we used a time constant, τ, of 1, 000 s or around 15 minutes, and for reinforcement and unsupervised learning we used a time constant of 10, 000 s or around 1 hours. The firing rate distribution was chosen to be intermediate between relatively narrow ranges found by some [17], and the extremely broad ranges found by others [18]. In particular, we use a log-normal distribution, with median at 1 Hz, and with 95% of firing rates being between 0.1 Hz and 10 Hz. The parameters of the dynamics, µprior were chosen by fitting a lognormal distribution to measured synaptic and σprior weights [14]. We found that our Bayesian learning rules are more effective than standard classical learning rules with any learning rate Figure. We evaluated perfor6

7 Classical MSE Bayesian MSE A Supervised continuous Learning rate Classical MSE Bayesian MSE B Supervised binary Learning rate C Reinforcement 15 D Unsupervised 15 Classical MSE Bayesian MSE 10 5 Classical MSE Bayesian MSE Learning rate Learning rate Figure : Bayesian learning rules have a lower mean square error than classical learning rules. The black line is the mean square error for the classical learning rule with a given learning rate x-axis, divided by the mean square error for the Bayesian learning rule, which is constant as it is parameter-free. The red line is 1. A Supervised learning, continuous feedback. B Supervised learning, binary feedback. C Reinforcement learning. D Unsupervised learning. mance using the mean square error MSE between w opt,i and m i. More details, including the classical learning rules used for each type of learning are given in the Methods..4 Predictions Our learning rules make two predictions, both of which are intuitively sensible and experimentally testable. First, the learning rate should vary across time, falling when many inputs are active, and rising when few inputs are active in our equations, this occurs because of the 1/s like,i term in Equation 18a. We can see that this prediction is intuitively sensible by considering a time step upon which only one input fires. In this case, the synaptic weight associated with the input that fired is entirely responsible for any error, so that synaptic weight should be updated drastically. In contrast, if many cells fire then it is not clear which synaptic weights are responsible for an error, so it is sensible to make only small updates to the weights. In our learning rules this occurs because s like, which appears in the denominator of the learning rate in Equation 18a, increases as the number of active inputs increases, Equation 13. We therefore predict fast learning during plasticity protocols in silent networks, when only the stimulated cells are active, and slow learning in active networks. There is some circumstantial evidence in favour of this prediction, in that weights can be changed within a few minutes during plasticity experiments in silent slices [19], whereas weight changes in vivo, in active networks occur over much longer timescales [15]. However, a direct test of this prediction remains necessary. Second, learning rates should vary across synapses, being lower for synapses whose pre-synaptic cells have a higher average firing rate. We can see that this 7

8 s m A Supervised continuous ν s m B Supervised binary ν s m C Reinforcement ν s m D Unsupervised ν Figure 3: Uncertainty, and hence learning rate, increases as firing rate decreases. A Supervised learning, continuous feedback. B Supervised learning, binary feedback. C Reinforcement learning. D Unsupervised learning. is intuitively sensible by comparing a synapse whose presynaptic cell fires rarely to a synapse whose presynaptic cell fires frequently. If the pre-synaptic cell fires rarely, the synapse is given only a few opportunities to update the weight, so should make large updates. In contrast, if the presynaptic cell fires rapidly, then the synapse is given many opportunities to update the weight, so should update the weight only a little for each presynaptic spike. In our learning rules, this occurs because lower firing rates give the cell fewer opportunities to learn, leading to higher uncertainty and hence higher learning rates. More formally, we show in the Methods that, δm i m i s i m i 1 νi. 0 The first relationship comes from manipulating the learning rules for the mean Equation 18a, and the second comes from mean field calculations detailed in Methods, and confimed in simultations Figure 3. 3 Discussion We derived Bayes optimal learning rules, which allow a synpase to exploit all locally available information in order to learn as rapidly as possible. These rules allow synapses to learn far more rapidly than optimized classical methods when performing supervised, unsupervised and reinforcement learning tasks. Furthermore, our Bayes optimal learning rules suggest that learning rates should vary across time and across synapses. In particular, learning rates should be higher at times when there are active fewer inputs into a cell, and learning rates should be higher for synapses whose presynaptic cells fire more infrequently. Both of these predictions can, in principle, be tested, something that we hope will occur in the next generation of plasticity experiments. 8

9 Out of necessity, we have made a variety of specific choices about the information synapses have available to them, and about the statistical properties of a neuron s inputs. However, our framework is unique in providing the machinery to derive a learning rule for any combination of locally available information and input statistics. There are therefore two important directions for future work: considering more complex and realistic locally available information, and considering more complex and realistic input statistics. There are many directions to explore concerning the information available to a synapse. First, the synapse might have access to more information than we assumed, perhaps it might have some information about the activity of the surrounding synapses, allowing the synapse to learn something about the correlations between its weight and surrounding weights. Second, the synapse might have access to less information than we assumed, perhaps the postsynaptic activity does not backpropagate effectively to the synapse, or the synapse is not aware of a presynaptic spike because of a vesicle release failure. Finally, with a detailed characterisation of the electrical properties of dendrites, it may be possible to find a learning rule that is based only on the membrane potential and calcium signals at the synapse. In each of these cases the resulting learning rules may well make interesting and testable predictions. Our choices about the statistics of the inputs were intended to be a first step, so there are again many interesting directions to explore. Neural networks exhibit a variety of different global behaviours, from oscillations with changing power and frequency [0] to up-down states [1]. Furthermore, there are debates about the structure of neural activity whether it is correlated [] or whether it exists on a low-dimensional manifold [3]. When considering learning in realistic neural networks, it will be important for neurons and synapses to take into account both the structure and state of neural activity. The flexibility of our approach provides a means to connect relatively abstract theory to complex biological reality, an approach that, we hope, will significantly increase our understanding of synaptic plasticity. Acknowledgements We would like to acknowledge Jesper Sjöstrum for valuable feedback on this manuscript. This work was supported by the Gatsby Charitable Foundation. References [1] K. Doya, Complementary roles of basal ganglia and cerebellum in learning and motor control, Current Opinion in Neurobiology, vol. 10, no. 6, pp , 000. [] P. Dayan and L. F. Abbott, Theoretical Neuroscience. Cambridge, MA: MIT Press, 001. [3] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin, The elements of statistical learning: data mining, inference and prediction, The Mathematical Intelligencer, vol. 7, no., pp ,

10 [4] J. J. Kim and R. E. Thompson, Cerebellar circuits and synaptic mechanisms involved in classical eyeblink conditioning, Trends in Neurosciences, vol. 0, no. 4, pp , [5] E. R. Kandel, J. H. Schwartz, T. M. Jessell, and others, Principles of Neural Science. McGraw-Hill New York, 000. [6] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning. MIT Press, [7] N. Intrator and L. N. Cooper, Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions, Neural Networks, vol. 5, no. 1, pp. 3 17, 199. [8] K. Koch, J. McLean, R. Segev, M. A. Freed, M. J. Berry II, V. Balasubramanian, and P. Sterling, How much the eye tells the brain, Current Biology, vol. 16, pp , 006. [9] H. Sprekeler, G. Hennequin, and W. Gerstner, Code-specific policy gradient rules for spiking neurons, in Advances in Neural Information Processing Systems, pp , 009. [10] D. O. Hebb, The organization of behavior: A neuropsychological approach. John Wiley & Sons, [11] R. P. Rao and D. H. Ballard, Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects, Nature Neuroscience, vol., no. 1, pp , [1] J. Zylberberg, J. T. Murphy, and M. R. DeWeese, A sparse coding model with synaptically local plasticity and spiking neurons can account for the diverse shapes of v1 simple cell receptive fields, PLoS Computational Biology, vol. 7, no. 10, p. e10050, 011. [13] T. Branco and K. Staras, The probability of neurotransmitter release: variability and feedback control at single synapses, Nature Reviews Neuroscience, vol. 10, no. 5, pp , 009. [14] S. Song, P. J. Sjöström, M. Reigl, S. Nelson, and D. B. Chklovskii, Highly nonrandom features of synaptic connectivity in local cortical circuits, PLoS Biology, vol. 3, no. 3, p. e68, 005. [15] Y. Loewenstein, A. Kuras, and S. Rumpel, Multiplicative dynamics underlie the emergence of the log-normal distribution of spine sizes in the neocortex in vivo, The Journal of Neuroscience, vol. 31, pp , June 011. [16] M. Matsuzaki, N. Honkura, G. C. Ellis-Davies, and H. Kasai, Structural basis of long-term potentiation in single dendritic spines, Nature, vol. 49, no. 6993, pp , 004. [17] D. H. O Connor, S. P. Peron, D. Huber, and K. Svoboda, Neural activity in barrel cortex underlying vibrissa-based object localization in mice, Neuron, vol. 67, no. 6, pp ,

11 [18] K. Mizuseki and G. Buzsáki, Preconfigured, skewed distribution of firing rates in the hippocampus and entorhinal cortex, Cell Reports, vol. 4, no. 5, pp , 013. [19] P. J. Sjöström, G. G. Turrigiano, and S. B. Nelson, Rate, timing, and cooperativity jointly determine cortical synaptic plasticity, Neuron, vol. 3, no. 6, pp , 001. [0] S. Ray and J. H. Maunsell, Differences in gamma frequencies across visual cortex restrict their possible use in computation, Neuron, vol. 67, no. 5, pp , 010. [1] A. Destexhe, S. W. Hughes, M. Rudolph, and V. Crunelli, Are corticothalamic up states fragments of wakefulness?, Trends in Neurosciences, vol. 30, no. 7, pp , 007. [] A. S. Ecker, P. Berens, G. A. Keliris, M. Bethge, N. K. Logothetis, and A. S. Tolias, Decorrelated neuronal firing in cortical microcircuits, Science, vol. 37, no. 5965, pp , 010. [3] M. Y. Byron, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani, Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity, in Advances in neural information processing systems, pp , 009. [4] S. J. Julier and J. K. Uhlmann, A new extension of the kalman filter to nonlinear systems, in Int. symp. aerospace/defense sensing, simul. and controls, vol. 3, pp. 3, Orlando, FL, Methods 4.1 Bayesian update rules for supervised and reinforcement learning Here we show how to update the approximate posterior distribution over λ opt,i given new data. We begin by rewriting Equations 15 and 16 in terms of λ opt,i rather than w opt,i, P λ opt,i t + 1 D i = dλ opt,i P λ opt,i t + 1 λ opt,i P λ opt,i D i, 1 P λ opt,i D i P d i λ opt,i P λ opt,i D i t 1. Our approach is to approximate P λ opt,i D i by a Gaussian, in which case, Equation 1 is a Gaussian integral that can be computed straightforwardly. However, computing a Gaussian approximation to P λ opt,i D i is non-trivial. We begin by writing out the posterior in full, P λ opt,i D i = P λ opt,i f, y, s like, x i, w i, λ opt,i, D i t 1. 3 As f is a deterministic function of δ, we can introduce and integrate out δ, P λ opt,i D i = dδp λ opt,i δ, y, s like, x i, w i, D i t 1 P δ f, s like, x i, w i 4 11

12 This step converts one hard inference problem, inferring λ opt,i from the feedback signal, into two easier inference problems, inferring δ from the feedback signal, and inferring λ opt,i from δ. The process of inferring λ opt,i from δ does not depend on details of the feedback signal, so we consider this problem first. We then consider the posterior over δ induced by various types of feedback signal. In order to find the distribution over λ opt,i given δ we use Bayes theorem, P λ opt,i δ, y, s like, x i, w i, D i t 1 = P δ y, s like, x i, w i, λ opt,i P λopt,i D i t 1. 5 The prior, P λ opt,i D i t 1, is given by the approximating distribution Equation 17, but the correct form for the likelihood, P δ y, s like, x i, w i, λ opt,i is less clear. In order to find this likelihood, we note that if all the inputs were known then we could write P y x, w i, λ opt,i = N y; m cell + x i w i m i, km j x j + γ y 6 j i and similarly, P y opt x, w i, λ opt,i = N y opt ; m cell + x i e λ opt,i m i, s jx j 7 where, j i m cell = j m j x j. 8 Therefore, δ, as defined by Equation 3, does not depend on m cell, P δ x, w i, λ opt,i = N δ; x i e λ opt,i m i xi w i m i, s like km i + s i x i. 9 In fact, the distribution over δ depends only on x through x i and s like, P δ x, w i, λ opt,i = P δ s like, x i, w i, λ opt,i 30 = N δ; x i e λ opt,i m i xi w i m i, s like km i + s i x i. 31 where s like,i is given by Equation 19. Now that we have found the likelihood Equation 31, we must think about how to deal with the nonlinearity in the mean of the likelihood. We can linearise the problematic term using e λopt,i m i 1 + λ opt,i µ i. 3 While this approximation does not have exactly the right behaviour around λ opt,i = µ i, it does have the correct expectation, expected gradient with respect 1

13 to λ opt,i, and approximately the correct variance properties that are important to ensure the stability and accuracy of the resulting learning rule. This approximation, in combination with Equation 31, gives P λ opt,i δ, y, s like, x i, w i, D i t 1 33 exp δ x i m i λ opt,i µ i w i m i exp s like,i x i s i λ opt,i µ i σi. 34 Rewriting the first term in the product as a Gaussian over λ opt,i, using the fact that s i m i σ i, gives, λ opt,i µ i + δ+xiwi mi x im i = exp s exp λ opt,i µ i like,i σi σi. 35 x i m i This distribution can be written as a Gaussian over λ opt,i, with mean, µ λopt,i δ = µ i + δ + x i w i m i x im i σ i s like,i 36 and variance, σ λ opt,i δ = σ i 1 σ i x i m i s like,i. 37 In order to find the next approximate distribution over λ opt,i Equation 17, we only need to know the mean and variance of P λ opt,i f, y, x i, w i, not the full distribution. Therefore, instead of integrating out δ, as prescribed by Equation 4, we use the law of totat expectation and variance to compute the mean and variance of λ opt,i directly, µ post,i = µ i + E [δ f, y, x i, w i ] + x i w i m i x im i σ i σ post,i = σ i + s like,i 38a s like Var [δ f, y, x i, w i ] σi x i m i s like,i s. 38b like,i Finally, computing the Gaussian integral in Equation 1, and thereby taking into account the random changes in the weights across time is straightforward. We simply propagate the posterior distribution through the dynamics Equation 11, µ i t + 1 = σ i t + 1 = 1 1 µ post,i, 39a τ 1 1 σ τ post,i + σ prior. 39b τ Substituting Equation 38 into Equation 39 gives Equation

14 Finally, we must find the mean and variance of the posterior distribution over δ. We use Bayes theorem, P δ f, s like, x i, w i = P f δ P δ s like, x i, w i. 40 The prior is given by integrating out λ opt,i in Equation 31, P δ s like, x i, w i = N δ; xi w i m i, s like,i, 41 but the likelihood, P f δ, is specific to the feedback signal, and hence to the type of learning, as described below Supervised learning: continuous feedback For supervised learning with continuous feedback, the likelihood is a delta function, P f δ = δδ f, 4 so the posterior over δ Equation 40 is also delta function located at f Supervised learning: binary feedback For supervised learning with binary feedback, the likelihood is a step function, { Θ δ θ if f = 1 P f δ = 43 Θ δ θ if f = 1 so the posterior over δ Equation 40 is a truncated Gaussian, whose mean and variance can be computed using library functions Policy search For reinforcement learning, the likelihood is a pair of delta functions, P f δ = Θ f δδ f + δδ + f, 44 so the posterior over δ Equation 40 is a pair of delta-functions, with different weights, whose mean and variance can again be computed straightforwardly. 4. Unsupervised learning Unsupervised learning is considerably more difficult than supervised or reinforcement learning. We proceed by first we describing how to generate inputs from the distribution in Equation 9, then by describing how to perform inference. 14

15 4..1 Generating x We can rewrite the distribution in Equation 9 using the fact that a Gaussian with 0 variance is a delta-function, P x w opt, y opt lim L 0 N x; b, Λ N y opt ; j w opt,j x j, L 45 This has the same form as a standard linear Gaussian inference problem, where the first term, N x; b, Λ, is the prior, and the second term is the likelihood. The solution to such problems is well-known, P x w opt, y opt = N x; µ x yopt, Σ 46 where Σ = lim L 0 Λw optw T optλ Λ L + w opt Λw opt 47 µ x yopt = lim L 0 Σ L 1 y opt w opt + Λ 1 b. 48 We need to further manipulate µ x yopt before its limit becomes obvious, µ x yopt = lim L 0 Λw optw T optλ Λ L + w opt Λw opt L 1 y optw opt + Λ 1 b. 49 = b + y opt w T optb w T optλw opt Λw opt 50 We can therefore sample x given y opt by taking, x = x prior + Λw opt yopt woptλw T w T optx prior opt 51 where P x prior = N x prior ; b, Λ, 5 which can be confirmed by computing the moments of x that result from this process. 4.. Inference Now, we wish to compute the posterior, P λ opt,i y, s like, x i. While we would apply Bayes theorem directly, it turns out to be easier to introduce and integrate out another variable, as we did for supervised and reinforcement learning. In this case, the relevant variable is y opt, P λ opt,i y, s like, x i = dy opt P λ opt,i y opt, y, s like, x i P yopt y, s like, x i

16 The two distributions in the above expression are given by Bayes theorem, P λ opt,i y opt, y, s like, x i P λopt,i P x i y opt, s like, λ opt,i P y yopt, s like, λ opt,i, x i, 54 P y opt y, x i P y opt P x i y opt, s like P y yopt, s like, x i. 55 We must compute all the distributions in Equations 54 and 55. First, the distribution over x i given y opt and λ opt,i is given by Equation 50 and Equation 47 with b = 0, and where we approximate e λopt,i with its expected value, P x i y opt, λ opt,i N x i ; y optν i w opt,i n νwopt,i, ν i ν i m i + s i n νwopt,i. 56 The distribution over x i given y opt is similar, but e λopt,i is replaced by its expected value, y opt ν i m i P x i y opt N x i ; n, νwopt,i ν i ν i m i + s i n νwopt,i. 57 Note that in both of the above expressions, we have approximated w opt Λw opt using its expected value, w T optλw opt νw opt,i = ν e µ prior+σ prior. 58 Second, the distribution over y is given y opt again given by Equations 6 and 7, P y y opt, w i, λ opt,i = N y opt ; y + x i e λ opt,i m i xi w i m i, s like,i 59 P y y opt, w i, λ opt,i = N y opt ; y + x i e λ opt,i m i, s like,i 60 In order to actually compute the mean and variance of λ opt,i, we perform many of the algebraic steps in code, which both simplifies the required derivations, and reduces the probability of errors. First, we compute the mean and variance of the distribution over y opt by combining, in code, the two Gaussians in Equation 57 and Equation 60. We combine the resulting Gaussian with the Laplacian prior in Equation 8 using generic routines. In particular, a generic version of this problem can be written, x µ x m P x exp σ 61 b exp P x exp x µ σ x µ σ + x m b x m b if x < m otherwise. 6 16

17 Writing this distribution as Truncated Gaussians and incorporating constant factors A and B, we get, A exp P x = B exp x x µ+ σ b σ µ σ b σ if x < m otherwise. 63 To solve for A and B, we exploit the fact that at x = m, the probability density is continuous, and the distribution must normalize. The resulting distribution is a mixture of two truncated Gaussians, whose mean and variance can be computed using standard routines. Finally, we compute the distribution over λ opt,i for a given y opt using a similar approach, combining Equation 56, Equation 59, and the approximation, Equation 3 in code. Finally, while we could sample multiple values of y opt, and compute the resulting mean and variance of λ opt,i, this is slow and gives a stochastic learning rule. Instead, an accurate, deterministic estimate is given by a sigma-point approximation [4], which involves sampling y opt at only the mean plus the standard deviation, and the mean minus the standard deviation. 4.3 Classical learning rules In order to make comparisons in Figure, we need to specify classical learning rules for each type of learning. Each classical rule has a learning rate, α, which is changed in Figure. For supervised learning with continuous feedback, we use the delta rule, m i t + 1 = m i t + αfx i t. 64 For supervised learning with binary feedback, we use something similar to the delta rule, where the updates have been calibrated to ensure that increases balance decreases, m i t + 1 = m i t + α [1 f = 1 P f = 1 1 f = 1 P f = 1] 65 where 1 is the indicator function, which is 1 if the condition is true, and 0 if the condition is false. The probability P f = 1 is computed by taking an average over past observations, and is necessary ensure that there is bias caused by having a threshold that is far from 0. For reinforcement learning, we use a standard policy gradient method, m i t + 1 = m i t α f E [f] w i m i 66 where the expected loss, E [f] is computed over past trials. For unsupervised learning, we use a standard hebbian like learning rule, inspired by our equations for inference, m i t + 1 = m i t + α 1 a 1 y signy b 67 Note that if x i is positive and y is large and positive, then the weight is increased as a is typically smaller than 1, whereas if y is small, then the weight may fall. 17

18 4.4 Mean field We wish to find relationships between the measured quantities, m i and σ i. To do so, we look at Equation 18b in steady state i.e. where, σ i t + 1 = σ i t. To simplify our derivations, we neglect the reduction in the variance caused by the slow drift towards the mean the 1 1/τ term, as this only becomes important when there is very little incoming information. Although s prior s post, s prior x i, and s prior vary rapidly over short time periods, we expect their variations to be largely uncorrelated, so we can take averages separately, which allows us to write, σ τ = s like,i s post s like,i By rearranging and taking a square root, we get 4.5 Predictions t s 4 i ν i/m i. 68 s like,i s i m i νi To derive Equation 0, we ask how a weight should change during an LTP protocol. As µ i is the mean of the log-weight, The learning rules Equation 18a give, t δm i m i δµ i. 70 As s i σ i m i, we can write, δm i m i σ i m i. 71 δm i m i s i m i. 7 Finally, substituting the mean field result in Equation 69 gives Equation 0. 18

CSE/NB 528 Final Lecture: All Good Things Must. CSE/NB 528: Final Lecture

CSE/NB 528 Final Lecture: All Good Things Must 1 Course Summary Where have we been? Course Highlights Where do we go from here? Challenges and Open Problems Further Reading 2 What is the neural code? What