DO HEBBIAN SYNAPSES ESTIMATE ENTROPY?

Size: px

Start display at page:

Download "DO HEBBIAN SYNAPSES ESTIMATE ENTROPY?"

Julian Morton
6 years ago
Views:

1 DO HEBBIAN SYNAPSES ESTIMATE ENTROPY? Deniz Erdogmus, Jose C. Principe, Kenneth E. Hild II CNE, Electrical & Computer Engineering Dept. Universit of Florida, Gainesville, F 36. [deniz,principe,hild]@cnel.ufl.edu Abstract. Hebbian learning is one of the mainstas of biologicall inspired neural processing. Hebb s rule is biologicall plausible, and it has been etensivel utilized in both computational neuroscience and in unsupervised training of neural sstems. In these fields, Hebbian learning became snonmous for correlation learning. But it is nown that correlation is a second order statistic of the data, so it is sub-optimal when the goal is to etract as much information as possible from the sensor data stream. In this paper, we demonstrate how information learning can be implemented using Hebb s rule. Thus the paper brings a new understanding to how neural sstems could, through Hebb s rule, etract information theoretic quantities rather than merel correlation. INTRODUCTION Hebb s rule states: When an aon of cell A is near enough to ecite cell B or repeatedl or consistentl taes part in firing it, some growth or metabolic change taes place in one or both cells such that A s efficienc, as one of the cells firing B, is increased []. This principle has been translated in the neural networs literature as: In Hebbian learning, the weight connecting a neuron to another is incremented proportional to the product of the input to the neuron and its output []. This formulation then can be shown to maimize the correlation between the input and the output of the neuron whose weights are updated through the use of Widrow s stochastic gradient on a correlation-based cost function [,3]. Correlation has been declared the basis of learning, and Hebb s law has mostl shaped our understanding of the operation of the brain and the process of learning. However, we now that correlation describes simple second order statistics of the data, and as such it is sub-optimal when the goal is to process information. Our brains must eploit much more than simpl correlation from the neural activit in order to achieve their level of performance in information processing. From an abstract perspective, the learning-from-eamples scenario starts with a data set, which globall conves information about a real world event, and the goal is to capture the information in the weights of a learning machine. Information Theor IT should be the mathematical infrastructure used to quantif this change in state because it is the best possible approach to deal with manipulation of information [4]. Shannon, in a 948 classical paper, laid down the foundations of IT [5]. IT has had a tremendous impact in the design of efficient and reliable communication

2 sstems [6,7] because it is able to answer two e questions: what is the best possible minimal code for our data, and what is the maimal amount of information which can be transferred through a particular channel. In spite of its practical origins, IT is a deep mathematical theor concerned with the ver essence of the communication process. IT has also impacted statistical mechanics b providing a clearer understanding of the nature of entrop, as was illustrated b Janes [8]. Due to the emphasis on principles, information theor has also plaed a role in eplaining biological information processing. Barlow enunciated the principle of redundanc reduction [9] building on earlier wor of Attneave on visual perception [] and utilized it to train unsupervised learning the weights of a linear neural networ. He also proposed factorial or minimum entrop coding []. The development of sparse codes has been recentl advanced b man researchers [,3]. Atic [4] and Atic and Redlich [5] demonstrated that feature etraction could be accomplished from nois inputs b maimizing mutual information between the input and the output, which is a form of redundanc reduction. The recent interest in sparse representations has been triggered b Fields [6], and a subsequent paper in Nature, which demonstrated emergence of simple cell receptive fields when learning a sparse code for natural images [3]. Probabl the most insightful application of information theor to brain theor is inser s principle of maimum information preservation, also called the Infoma principle [7]. inser enunciated a self-organizing principle for neuronal assemblies based on the simple idea that the role of a neuronal assembl is to transfer to its output as much information as possible about its input. He formulated this principle in analog to the channel capacit theorem as the maimization of the mutual information between the input and output. This is a profound idea that shines new light on the organizational principles of the brain. One of the difficulties of all these studies is that there is an abss between the sophisticated computation required to estimate Shannon s entrop and mutual information and the realit of the local computation of the Hebbian snapse. We will show in this paper, however, that if a ernel-based nonparametric estimator for Shannon s entrop is utilized, then the Hebbian snapse can in fact estimate entrop over time instead of simpl estimating correlation. We also demonstrate that the same goal could also be reached through the unconventional Reni s entrop. The structure of this paper is the following. First we briefl eplain and provide references of our nonparametric approach to estimate entrop directl from samples for both Shannon s and Reni s entrop definitions. Then we derive the stochastic information gradient SIG to adapt the parameters of a linear, or nonlinear sstem, to maimize or minimize entrop [8]. While eploring the mathematical properties of the SIG algorithm, we will establish the similarit between a special case of the SIG algorithm and Hebbian and anti-hebbian terms, which raised the question of whether information processing is possible through Hebbian learning. Finall we present a simple eample to demonstrate the convergence of the algorithm to the desired solutions.

3 SIG AGORITHM FOR ADAINE In order to arrive at the SIG algorithm, we start with the derivation of the nonparametric entrop estimator that emplos Parzen windowing. First consider Shannon s entrop, which is given b H = E [ log f ] S Suppose we utilize the stochastic value of this quantit at time, such that the epectation is dropped and the argument of this operation is evaluated at the most recent sample. H log f S Since in practice the analtical pdf of the random variable is not available, we emplo the nonparametric Parzen window estimator over the last samples of the signal []. Hˆ S = log = 3 Assuming that the samples of are generated b an ADAINE structure, i.e. T = w, where is the input vector and w is the weight vector, the stochastic gradient estimate for Shannon s entrop is found as in terms of a column vector Hˆ S = = w = This epression is called the stochastic information gradient SIG [8]. SIG can also be derived using Reni s entrop, a parametric famil of functions. For a random variable it is defined as [9] H = log E[ f ] 5 We named the argument of the log in 5 as the information potential, not arbitraril, but because it shares the properties of phsical potential fields, when the formulation is complete []. The epectation in the information potential could be dropped to obtain a stochastic estimate of this quantit. Notice that minimization or maimization of entrop is equivalent to minimization or maimization of the information potential, depending on the value of. Using Parzen windowing over the last samples, at time, we obtain the stochastic quantit ˆ = = V 6 Then, for an ADAINE, the stochastic gradient of Reni s entrop becomes 4

4 = = = = ˆ / ˆ ˆ V w V w H 7 which is identical to 3. In 3 and 7,. is called the ernel function of Parzen windowing, usuall selected to be a smmetric, continuous and differentiable pdf, whose width is determined b the parameter. In [], we have established the lin between the ernel function and the convolution smoothing for global optimization, and we also demonstrated the lin between the minimum error-entrop MEE criterion and minimum mean-square-error MMSE criterion in supervised learning. Note that, when maimizing or minimizing the output entrop of ADAINE, ust lie in Oa s rule [3], we would need to normalize the weights to eep them from growing without bound or decaing to zero. Alternative SIG epressions for Shannon s and Reni s definitions of entrop could also be obtained b simpl utilizing a single sample from the past in the Parzen pdf estimate and using the window of samples to approimate the epectation value operator with a sample mean. As a result, assuming an ADAINE structure, we obtain the following alternative SIG epressions for Shannon s and Reni s entropies, respectivel. + = = S w H ˆ 8 + = + = = w H ˆ 9 SIG given in 3 is, in effect, ver similar to Widrow s stochastic gradient in MS; we assume onl the instantaneous increment in the output samples, i.e., is available at the th iteration of adaptation. Moreover, it is trivial to see that the epected value of this gradient is the actual gradient of the Shannon s entrop for the output pdf estimated with Parzen windowing. For a small ernel size and a large number of samples, this estimate is ver close to the true pdf. An interesting special case of 3, 8, and 9 is for =. = w H The convergence properties of this algorithm for entrop minimization was studied in detail in [4]. Etensions to differentiable with respect to their weights nonlinear sstems are also possible [8].

5 REATIONSHIP BETWEEN SIG AND HEBB S RUE In general, we prefer using differentiable and smmetric ernels in Parzen windowing; differentiabilit is required to guarantee proper evaluation of the gradient in adaptation, and smmetr is preferred to prevent biasing the mean of the estimated pdf. Suppose Gaussian ernels are emploed. In that case, naturall becomes the standard deviation of the ernel, and we have the following. = Substituting in, we obtain the eplicit epression H = w We clearl see from that emploing Hebbian and anti-hebbian terms involving the two most recent values of the input and the output results in a stochastic estimate of the information gradient. When interpreted under the viewpoint of classical Hebbian learning, states that it is possible to maimize the entrop b appling the Hebbian rule to the instantaneous increments in the input and the output signals. Thus, the neuron implements information learning rather than merel implementing correlation learning. Now, consider the general case where an differentiable smmetric ernel function ma be used in Parzen windowing. Since onl the Gaussian distribution satisfies the differential equation in, it is the onl ernel choice that results in this special case given b, which reduces to a learning algorithm that is Hebbian on the increments in the classical sense. In general, since we use smmetric and differentiable ernels that are pdfs themselves, we get the following update rule H = f 3 w where f = /, and satisfies the condition sign f = sign. Thus, the update amount that would be applied to the weight vector is still in the same direction as would be in the classical Hebbian learning; however, it is scaled nonlinearl depending on the value of the increment that occurred in the output of the neuron. This is in fact consistent with Hebb s principle, and is a good eample that demonstrates the product of the output and the input signals is not the onl possibilit for the weight update to implement Hebb s principle. CASE STUDIES In this section, we present two case studies in which an ADAINE is trained to ield maimum entrop at the output subect to the unit-norm constraint on the weight vector. The weights are updated using SIG. In both eamples, training is carried out using two different ernel function choices: Gaussian and Cauch. Eplicitl, these ernel functions are given b

6 G = e, C = 4 π π + / The weights are normalized to unit-length after each update, and the information about the previous output values are modified using this normalized weight vector. In the first eample, samples from a -dimensional oint Gaussian distribution are generated and the ADAINE is trained to maimize the entrop at the output. The step sizes are chosen as -5, and -3, for Gaussian and Cauch ernels, respectivel, and both ernel sizes are chosen as.. Fig. a shows the samples and the directions deduced. Fig. b shows the convergence of the weight vector angle to the ideal solution, which corresponds to the st principal component in this case, where the data distribution is Gaussian []. The choice of ernel size and step size are important issues that affect convergence time and misadustment. 3 Ideal Entrop Gaussian EntropCauch.9 Ideal Entrop Gaussian Entrop Cauch.8 -ais - - Angle of weight vector rad ais Number of epochs Figure : a Data samples and the derived directions b Convergence of the angles In the second eample 5 samples of a -dimensional random vector are generated such that the -coordinate is uniform, -coordinate is Gaussian, and the sample covariance matri is identit. In this case, PCA is unable to deduce an maimal variance direction since the variance along each direction is the same. On the other hand, using the maimum minimum entrop approach, we can etract the direction along which the data ehibits the most least uncertaint. Fig. a shows the estimated entrop vs. direction of weight vector for both ernels, and Fig. b shows the convergence of weights from 5 different initial conditions Entrop Gaussian.74.7 Direction Gaussian Direction of Weight Vector rad Epochs Entrop Cauch.3.. Direction Cauch Direction of Weight Vector rad Epochs Figure : a Entrop vs. direction b Convergence from different initial conditions

7 We remar that, in this eample, as the number of samples increases, the estimated distributions will converge to the actual distributions for small ernel sizes, which are between uniform and Gaussian. Hence, the asmptoticall ideal solution for the angle will be π /, since the Gaussian distribution has the largest entrop among distributions of fied variance [7]. The third case stud we present here is a Monte Carlo analsis of the misadustment of the stochastic gradient given in. In each run of this analsis, we use samples from a -dimensional ointl Gaussian random vector, as was done in the first eample. The eigenspread λ ma /λ min of the covariance matri is one of the controlled variables. The ernel size is scaled up or down with the standard deviation of the maimum entrop proection this could be estimated from the samples as = λ ma, where is a base-value for the ernel function selected for unit-variance samples. This latter variable is also part of the controlled parameters of the Monte Carlo simulations. Specificall, we control the ratio η /, where η is the learning rate of the steepest ascent algorithm. We perform these series of Monte Carlo simulations varing η / from - and varing λ ma / λ from to. For each pairwise combination of min these parameters, eperiments were performed, starting from random initial conditions and using the last 3 samples to compute the MSE between the estimated maimum entrop direction and the actual solution found using the unbiased sample covariance estimate for the whole -sample training set. The MSE values obtained over the runs are then averaged to produce the results summar depicted in Fig. 3. These results are presented in terms of root-meansquare RMS values of the error in degrees between the estimated direction and the true direction notice the log-scale of the RMS aes in all three subplots. Both cross-section plots and the -argument -output mesh plot of the RMS Error versus learning rate and eigenspread are provided for convenience. -3 to.5 RMS Error degrees -.5. η/ RMS Error degrees sqrtλ ma /λ min log RMS Error degrees η/ 8 6 sqrtλ ma /λ min Figure 3: a RMS Error in degrees versus η / for different eigenspread values b RMS Error in degrees versus λ ma / λ for different learning rates min c RMS Error in degrees versus η / and λ ma / λ min 4

8 From the results shown in Fig. 3, we observe that the misadustment increases as the learning rate increases and as the eigenspread decreases, as epected. The affect of learning rate is intuitivel obvious from our nowledge on the effect of step size on the misadustment of the well-nown MS algorithm. The intuition behind the observed effect of the eigenspread is that with increasing eigenspread, the entrop of the proection to the desired direction increases causing the differences between consecutive samples of the input vector and the output value to become larger, thus emphasizing the distinction between the maimum-entrop direction and all the other directions. Therefore, it becomes easier for the algorithm to determine and converge to the optimal solution sought. CONCUSIONS In this paper we presented a nonparametric entrop estimator, whose stochastic gradient led to a local, sample-wise epression that involves the product of a nonlinear function of the differences between consecutive output samples and the difference between the corresponding input samples. The derived stochastic entrop gradient epressions all three of them can be utilized in the information theoretic solution of man engineering problems where training of adaptive sstems according to information theoretic criteria is required. These stochastic information gradient SIG epressions allow the designer to manipulate the information at the output of a learning sstem on a sample-b-sample basis; therefore, the are etremel useful for fast, on-line information theoretic learning. For the special case of single-sample windows, this stochastic gradient defaulted to a combination of Hebbian and anti-hebbian learning, among pairs of consecutive samples, acting on their instantaneous value increments in value instead of their instantaneous values, as in the classical sense. We have investigated the abilit of this simplified SIG algorithm to determine maimumentrop directions in different sets of random variables and demonstrated that it successfull forces the linear adaptive sstem to converge to the desired solution. A Monte Carlo analsis of the effects of the learning rate and the eigenspread of the data on the misadustment of the algorithm in determining the maimumentrop direction was also conducted. This analsis verified the epectation that the misadustment increases with increasing learning rate and decreasing eigenspread. There are two main conclusions from this stud that we would lie to address. This relationship between entrop and Hebbian update principle encourages us to stud the realism of this learning model in neuronal interactions in the brain; eperiments could be set up to see if, in fact, snaptic strength is onl dependent upon the level of ecitabilit of the neuron or if, as the formula predicts, the efficac of the snapse is also modulated b the action potential s intervals verbal communication with several neuroscience eperts revealed the possibilit of a similar process in neuron snapses. If our prediction is correct, then we have a ver gratifing result that increases even further our admiration for biological

9 processes. In fact, Hebbian learning could adapt the snapses with information gradients, and therefore neural assemblies will manipulate entrop, which from the point of information processing provides all that can be nown about the data streams. The second conclusion concerns the machine learning communit. Even if our predictions for computational neuroscience are incorrect, Reni s entrop estimator is a viable alternative to help us leave the local minima created b second order methods so pervasive in both adaptive filter theor and artificial neural networ cost functions. SIG clearl shows that Hebbian tpe rules are estimating far more than second order statistics. It is quite etraordinar that interactions between consecutive samples are able to estimate higher order statistics. Upon emploing our nonparametric estimator for entrop and seeing an on-line implementation, we began to realize how powerful local computation in space-time can reall be. As a result, it is imperative that researchers move to a different paradigm, where Hebbian learning is no longer a snonm of correlation learning. Finall, we would lie to address the issue of applicabilit of the presented relationship to the field of information theoretic learning. In general, mutual information, not entrop alone, is emploed as the adaptation criterion as it is scale invariant and ehibits other desirable properties. It is eas to write mutual information in terms of marginal and oint entropies of the signals involved, therefore a lin between mutual information and Hebbian learning might as well be established following this lead. On the other hand, there are applications where a modified entrop criterion might be useful, such as blind deconvolution. B adding the log of the variance of the signal to its entrop, a scale invariant obective function could be constructed, which would ehibit Hebbian update rules for both entrop and the variance terms in the stochastic updates. In summar, there is still much to be researched about the lins between information etraction in biological adaptive sstems and their application to learning engineering sstems. Acnowledgments: This wor is partiall supported b the grants NSF-ECS and ONR-N REFERENCES [] D.O. Hebb, The Organization of Behavior: A Neuropschological Theor, Wile, New Yor, 949. [] S. Hain, Neural Networs, Prentice-Hall, New Jerse, 999. [3] B. Widrow, S.D. Stearns, Adaptive Signal Processing, Prentice-Hall, New Jerse, 985. [4] R. Fano, Transmission of Information, MIT Press, Massachussetts, 96. [5] C.E. Shannon, A mathematical theor of communication, Bell Ss. Tech. J., vol. 7, pp ,63-653, 948.

10 [6] R. Blahut, Principles and Practice of Information Theor, Addison Wesle, Massachussetts, 987. [7] T. Cover, J. Thomas, Elements of Information Theor, Wile, New Yor, 99. [8] E. Janes, Information theor and statistical mechanics, Phs. Rev., vol. 6, pp. 6-63, 957. [9] H. Barlow, P. Foldia, Adaptation and Decorrelation in the Corte, in Computing Neuron, Durbin, Miall & Mitchison eds., Addison Wesle, Massachussetts, 989. [] F. Attneave, Information Aspects of Visual Perception, Psch. Rev., vol. 6, pp , 954. [] H. Barlow, T. Kaushal, G. Mitchison, Finding Minimum Entrop Codes, Neural Computation, vol, no. 3, pp. 4-43, 989. [] R. Zemel, P. Daan, A. Pouget, Probabilistic Interpretation of Population Codes, Neural Computation, vol., pp , 998. [3] H. Olshausen, D.J. Field, Emergence of Simple-Cell Receptive Field Properties b earning a Sparse Code for Natural Images, Nature, vol. 38, pp , 996. [4] J. Atic, Could Information Theor Provide an Ecological Theor of Sensor Processing? Networ, vol. 3, pp. 3-5, 99. [5] J. Atic, A. Redlich, Convergent Algorithms for Sensor Receptive Field Development, Neural Computation, vol. 5, pp. 45-6, 993. [6] D. Fields, What is the Goal of Sensor Coding? Neural Computation, vol. 6, pp , 994. [7] R. inser, An Application of the Principle of Maimum Information Preservation to inear Sstems, in D.S. Tourez ed., Morgan-Kauffman, San Francisco, 988. [8] D. Erdogmus, J.C. Principe, An On-ine Adaptation Algorithm for Adaptive Sstem Training with Minimum Error Entrop: Stochastic Information Gradient, in Proc. of Independent Component Analsis ICA,. [9] A. Reni, Probabilit Theor, American Elsevier Pub. Co., New Yor, 97. [] J.C. Principe, D. Xu, J. Fisher, Information Theoretic earning, in Unsupervised Adaptive Filtering, vol I, S. Hain ed., Wile, New Yor,. [] E. Parzen, On Estimation of a Probabilit Densit Function and Mode,. in Time Series Analsis Papers, Holden-Da, Inc., California, 967. [] D. Erdogmus, J.C. Principe, Generalized Information Potential Criterion for Adaptive Sstem Training, to appear in IEEE Trans. on Neural Networs,. [3] E. Oa, Subspace Methods for Pattern Recognition, Research Studies Press, England, 98. [4] D. Erdogmus, J.C. Principe, Convergence Analsis of the Information Potential Criterion in ADAINE Training, Proc. Neural Networs in Signal Processing X,.

Recursive Least Squares for an Entropy Regularized MSE Cost Function

Recursive Least Squares for an Entropy Regularized MSE Cost Function Deniz Erdogmus, Yadunandana N. Rao, Jose C. Principe Oscar Fontenla-Romero, Amparo Alonso-Betanzos Electrical Eng. Dept., University