Information maximization in a network of linear neurons

Similar documents
Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Natural Image Statistics

Convolutional networks. Sebastian Seung

Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Slide11 Haykin Chapter 10: Information-Theoretic Models

Tutorial on Principal Component Analysis

Statistics for scientists and engineers

Advanced Introduction to Machine Learning CMU-10715

Exercises. Chapter 1. of τ approx that produces the most accurate estimate for this firing pattern.

Simple Cell Receptive Fields in V1.

Lecture 7 MIMO Communica2ons

Parallel Additive Gaussian Channels

Chapter 11. Taylor Series. Josef Leydold Mathematical Methods WS 2018/19 11 Taylor Series 1 / 27

1/12/2017. Computational neuroscience. Neurotechnology.

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Lecture 5 Channel Coding over Continuous Channels

Here represents the impulse (or delta) function. is an diagonal matrix of intensities, and is an diagonal matrix of intensities.

Nonlinear reverse-correlation with synthesized naturalistic noise

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Chapter 9 Fundamental Limits in Information Theory

Extreme Values and Positive/ Negative Definite Matrix Conditions

What is the neural code? Sekuler lab, Brandeis

Lecture 11: Continuous-valued signals and differential entropy

Convolution and Pooling as an Infinitely Strong Prior

Exam, Solutions

SPIKE TRIGGERED APPROACHES. Odelia Schwartz Computational Neuroscience Course 2017

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

How Behavioral Constraints May Determine Optimal Sensory Representations

Exercises with solutions (Set D)

Transmit Directions and Optimality of Beamforming in MIMO-MAC with Partial CSI at the Transmitters 1

( 1 k "information" I(X;Y) given by Y about X)

The functional organization of the visual cortex in primates

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Covariance Matrix Simplification For Efficient Uncertainty Management

Basic Principles of Unsupervised and Unsupervised

Tutorial on Blind Source Separation and Independent Component Analysis

Gaussian channel. Information theory 2013, lecture 6. Jens Sjölund. 8 May Jens Sjölund (IMT, LiU) Gaussian channel 1 / 26

LECTURE 18. Lecture outline Gaussian channels: parallel colored noise inter-symbol interference general case: multiple inputs and outputs

Chaos, Complexity, and Inference (36-462)

Feature selection and extraction Spectral domain quality estimation Alternatives

Statistical signal processing

LECTURE NOTE #10 PROF. ALAN YUILLE

Electromagnetic Modeling and Simulation

1 Principal Components Analysis

EECS 275 Matrix Computation

On the Complete Monotonicity of Heat Equation

Quiz 2 Date: Monday, November 21, 2016

A Probability Review

Shannon meets Wiener II: On MMSE estimation in successive decoding schemes

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Machine Learning Lecture 7

CIFAR Lectures: Non-Gaussian statistics and natural images

Minimum Feedback Rates for Multi-Carrier Transmission With Correlated Frequency Selective Fading

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Encoding or decoding

Some Notes on Linear Algebra

CS 195-5: Machine Learning Problem Set 1

Multiple Antennas in Wireless Communications

Independent Component Analysis. Contents

Summary of Shannon Rate-Distortion Theory

Ângelo Cardoso 27 May, Symbolic and Sub-Symbolic Learning Course Instituto Superior Técnico

THE functional role of simple and complex cells has

Math 5311 Constrained Optimization Notes

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Quantitative Biology Lecture 3

!) + log(t) # n i. The last two terms on the right hand side (RHS) are clearly independent of θ and can be

ECE 4400:693 - Information Theory

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Binary Transmissions over Additive Gaussian Noise: A Closed-Form Expression for the Channel Capacity 1

Discrete Memoryless Channels with Memoryless Output Sequences

Neural Network Training

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Stochastic Processes. M. Sami Fadali Professor of Electrical Engineering University of Nevada, Reno

Adaptive Filtering. Squares. Alexander D. Poularikas. Fundamentals of. Least Mean. with MATLABR. University of Alabama, Huntsville, AL.

Hebbian Learning II. Robert Jacobs Department of Brain & Cognitive Sciences University of Rochester. July 20, 2017

Exponential Time Series Analysis in Lattice QCD

SUPPLEMENTARY INFORMATION

Linear & nonlinear classifiers

A summary of Deep Learning without Poor Local Minima

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Pacific Symposium on Biocomputing 6: (2001)

Memories Associated with Single Neurons and Proximity Matrices

Bayes Decision Theory

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

How to Quantitate a Markov Chain? Stochostic project 1

Statistical models for neural encoding, decoding, information estimation, and optimal on-line stimulus design

Karhunen-Loève Transformation for Optimal Color Feature Generation

Lecture 14 February 28

Communication Theory II

Quantized Principal Component Analysis with Applications to Low-Bandwidth Image Compression and Communication

TSKS01 Digital Communication Lecture 1

Finding a Basis for the Neural State

Transcription:

Information maximization in a network of linear neurons Holger Arnold May 30, 005 1 Introduction It is known since the work of Hubel and Wiesel [3], that many cells in the early visual areas of mammals have characteristic response properties. For example, there are cells sensitive to lightdark contrast and cells sensitive to edges of a certain orientation. In a series of articles [4, 5, 6], Linkser showed experimentally that similar response properties can be developed in a simple feed-forward network of linear neurons by using a Hebbian learning rule []. The structures that emerged on the different layers of Linsker s network were spatial-opponent also called center-surround cells, orientation-selective cells, and orientation columns. Interestingly, these structures emerged without any structured input to the network. This is an important aspect because cells with such characteristics have been found in monkeys even before birth, i.e., before they could have experienced any visual input. 1 Linsker showed in [7, 8, 9] that the self-organization process leading to these receptive field structures is consistent with the principle of maximum information preservation, or infomax principle for short. In the context of neural networks, the infomax principle states that the transfer function from one neural layer to the next should preserve as much information as possible. If we call the state of the first layer the stimulus and the state of the second layer the response, then the infomax principle states that the response should contain as much information about the stimulus as possible. This text is intended as an overview of a part of Linsker s work. The analysis makes use of some concepts from information theory. The book by Cover and Thomas [1] is an extensive and expensive reference on that subect; Shannon s original paper [10] is also a very good reading and is freely available. Optimal receptive fields for linear neurons We consider a two-layered linear feed-forward network. Let X = X 1,..., X be a random vector denoting the state of the input or stimulus layer and let Y = Y 1,..., Y M be a random vector denoting the state of the output or response layer. The mutual information between X and Y is defined as IX; Y = HY HY X. 1 In Shannon s terms, IX; Y is the information rate of the channel between X and Y. The higher the value of IX; Y, the more information the response contains about the stimulus. HY is 1 Just as a side remark: In the time of Hubel s and Wiesel s work, the response properties of neurons were were believed to be changing only on a time scale much slower than the time scale of the activity dynamics. Recently, however, it has been found that they can also change on a rather fast time scale. Considering the receptive fields as time-invariant for the process at hand should therefore be seen as a simplification. 1

the information entropy of Y ; it is defined as HY = p Y y log p Y y dy, where p Y is the probability density function of Y and the integral is taken over all possible values of Y. We assume that the random variables X 1,..., X and Y 1,..., Y M take real values. HY X is the conditional entropy of Y given X; it is defined as HY X = p X xhy x dx 3 = p X x p Y x y x log p Y x y x dy dx..1 A model with additive noise The output of every unit in our network is a weighted linear summation of its inputs. From a biological point of view, this is of course not a completely realistic assumption, but it simplifies our analysis and it is Linsker s assumption anyway. We assume that the stimuli are distributed according to a multivariate normal distribution with mean zero: p X x = π / det C X 1/ exp 1 xt C 1 X x. 4 C X = c X i i with c X i = p X xx i x dx and x = x 1,..., x is the positive definite covariance matrix of the stimulus distribution; x t denotes the transpose of x. In biological networks, noise plays a considerable role and should therefore not be neglected. As a first attempt, we assume that the components of the response vector are disturbed by independent additive noise of constant variance. In this case, the response can be written as Y = W X + R 5 for some weight matrix W = w i i. The element w i is the strength of the connection from unit in the stimulus layer to unit i in the response layer. R = R 1,..., R M denotes the noise vector. We assume that R 1,..., R M are i.i.d. random variables distributed according to a normal distribution with mean zero and common variance ρ: p Ri r i = πρ 1/ exp r i ρ, i = 1,..., M. 6 Because the elements of R are independent, R is distributed according to a multivariate normal distribution with mean zero and diagonal covariance matrix C R = c R i i, where c R i = δ iρ δ being the Kronecker delta. Y is a linear combination of independent normally distributed random vectors and is therefore normally distributed as well; its probability density function is p Y y = π M/ det C Y 1/ exp 1 yt C 1 Y y 7 with positive definite covariance matrix C Y = W C X W t + C R and mean zero. To maximize the mutual information IX; Y, we need to compute the entropies HY and HY X. The entropy of the normally distributed random vector Y is HY = log πe M/ det C Y 1/. 8

See [10] for a derivation of this result. ote that when a stimulus value x is fixed, the covariance matrix of the conditional response vector Y x is equal to that of the noise vector R. Therefore, the conditional entropy HY X can be expressed as HY X = p X xhy x dx = = log = log p X x log πe M/ det C R 1/ dx πe M/ det C R 1/ πe M/ ρ M/, where the last line follows because C R is a diagonal matrix with M-fold eigenvalue ρ. By inserting equations 8 and 9 into equation 1, we obtain 9 IX; Y = log ρ M/ det C Y 1/ = 1 log det ρ 1 C Y. 10 ote that IX; Y becomes infinite as ρ goes to 0. This effect is a result of our use of continuous probability distributions, which can have infinite entropy. Maximizing IX; Y given a certain noise level ρ is equivalent to maximizing the determinant of the covariance matrix of the response distribution, which is equal to the product of its eigenvalues. The elements of C Y = c Y i i are given by c Y i = w il c X lm + δ i ρ. 11 m=1 w m l=1 Without bounding the elements of the weight matrix W, IX; Y could become arbitrarily large, and the effect of noise could be made arbitrarily small. We can avoid this by normalizing the connections to each response unit using the constraints wi = 1 for all i = 1,..., M. 1 =1 In the next section, we shall see how we can get rid off this normalization condition by modifying our noise model. To get more analytical results, we need to introduce a few more assumptions. We assume that stimulus and response layer have the same size: = M this assumption can be generalized to = km for k. Let the units of both layers be uniformly arranged along two rings the analysis can be extended to higher dimensions, leading to n-tori, and to an infinite set of units. From this arrangement, we obtain a natural displacement function. Let d i denote the displacement from unit i to unit on the ring. In our case, we simply define d i = [ i] with [k] = k mod. The displacement can be computed between units on the same layer, as well as between units on different layers by proecting one ring onto the other. ow we assume that the covariance c X i between two elements i and of a random vector from the stimulus distribution X is a function of the displacement d i from i to : c X i = cx d i. ote that due to the ring structure of the units, c X is a periodic function. Because C X is a symmetric matrix, c X must satisfy c X d i = c X d i. C X is a cyclic matrix; therefore, its eigenvalues λ 1,..., λ are given as 3

the components of the discrete Fourier transform of the sequence c X 0,..., c X : λ k = F k c X = =0 c X ω k, k = 1,...,, 13 with ω k πi = e k. Further, we assume that the strength w i of the connection from unit in the stimulus layer to unit i in the response layer is also a periodic function of the displacement d i from i to : w i = w di. This means that every unit in the response layer develops the same receptive field structure. Under this condition C Y is a cyclic matrix as well. Its eigenvalues µ 1,..., µ are the components of the discrete Fourier transform of the sequence c Y 0,..., c Y : µ k = F k c Y = =0 c Y ω k, k = 1,...,. 14 With equation 11 and the definitions of the functions c X and w, we get µ k = F k m=0 w m l=0 w l c X m l + ρ. 15 Let f g k denote the convolution of the functions f and g at point k. Then, we can express the last equation as µ k = F k w m w c X + ρ, 16 m and with w r k = w k as m=0 µ k = F k w r w c + ρ = F k w r F k w F k c X + ρ. 17 ote that F k w r = F k w. With z k = F k w, and equation 13, we have By inserting the eigenvalues of C Y µ k = λ k z k + ρ. 18 into equation 10, we obtain IX; Y = 1 log 1 + λ kz k. 19 ρ To maximize IX; Y, we use the method of Lagrange multipliers. The constraints in equation 1 can be equivalently expressed as z k =. We define the Lagrange function L as Lz 1,..., z, α = IX; Y + α z k, 0 where α is the Lagrange multiplier. The partial derivatives of L w.r.t. the z k are L z k = λ k ρ + λ kz k 1 + α, k = 1,...,. 1 4

Solving L/ zk 0 = 0 gives the stationary points The second partial derivatives of IX; Y w.r.t. z k and z l are I z k z l = z 0 k = 1 α ρ λ k, k = 1,...,. { λ k λ k z k + ρ if k = l 0 otherwise, 3 and the quadratic form Q HI x = x t H I z 1,..., z x of the Hessian matrix H I z 1,..., z = h kl kl of IX; Y is Q HI x = l=1 = h kl x k x l λ k λ kz k + ρ x k. 4 Because C X and C Y are positive definite, Q HI x < 0 for all x and H I z 1,..., z is negative definite at all points. Thus, IX; Y is a strictly concave function and its maximum values w.r.t. z 1,..., z under the weight constraints are uniquely determined by equation. ote that z must be non-negative at all points. Because IX; Y is strictly concave and IX; Y / z k = λ k ρ + λ k z k 1 > 0 for all k, the values ẑ 1,..., ẑ that maximize IX; Y are given by { ẑ k = max 0, 1 α ρ }, k = 1,...,, 5 λ k where α is chosen so that z k = 1. What we actually want, however, is to determine the weights ŵ 0,..., ŵ that maximize IX; Y. From the definition of z, we can see that IX; Y is invariant under weight transformations that do not change the absolute values of their Fourier transform. For example, IX; Y is unaffected by rotations of the weights and by a change of their sign. Consequently, further constraints on the weights are necessary to uniquely determine their optimal values.. A model with independent noise on each input The model developed in the previous section has two problems. The first one is the rather arbitrary normalization constraint on the weights. The second one is the under-determinedness of the weights maximizing the mutual information IX; Y. In this section, we shall see that we can eliminate both problems by modifying our noise model. So far, we have assumed that the response units are disturbed by independent noise of constant variance ρ. ow, we assume that the signal of each input unit is disturbed by independent noise and that the noise variance can have a different value for each connection. Thus, we have Y i = w i X + R i, i = 1,..., M, 6 =1 where R i is a random variable distributed according to a normal distribution with mean zero and variance ρ i. Then, Y still has a multivariate normal distribution with mean zero and covariance 5

matrix C Y = W C X W t + C R. The elements of C R = c R i are given by i c R i = δ i w ikρ ik. 7 As in our first model, C Y is a cyclic matrix whose eigenvalues µ 1,..., µ are the components of the Fourier transformation of the sequence c Y 0,..., c Y...1 All connections have the same noise level With ρ i = ρ for all i,, we have µ k = λ k z k + ρ w, 8 and IX; Y = 1 λ k z k log 1 + ρ. 9 =0 w The sum over the squares of the weights functions as a regularization term, which penalizes large absolute weight values. In this model, the weight constraints from equation 1 are no longer necessary. =0.. oise level increases with connection length To eliminate the under-determinedness of the optimum weights, we apply a biologically relevant principle: shorter connections should be favored over longer connections. We can achieve this by increasing the noise level with the connection length. Let g be an -periodic function that is monotonic increasing from 0 to / and monotonic decreasing from / to in other words, g i increases with the absolute distance between units i and. Then, with ρ i = ρ 0 g i, we have µ k = λ k z k + ρ 0 g w, 30 and =0 IX; Y = 1 λ k z k log 1 + ρ 0 =0 g. 31 w.3 The effect of noise on the redundancy in the optimal responses Let us study the network from a slightly different point of view. We consider only two response units and analyze how the noise level effects their optimal response. We use our first noise model from section.1. For M =, we get IX; Y = 1 logdet C Y log ρ 3 with det C Y = c Y 11c Y c Y 1. Let v i = c Y ii ρ, i = 1,, be the variance of Y i when no noise is present and let r 1 = c Y 1/ v 1 v be the correlation coefficient of Y 1 and Y when no noise is present. Then we can express the determinant of C Y as det C Y = ρ + ρv 1 + v + v 1 v 1 r 1. 33 6

References [1] Thomas M. Cover and Joy Thomas. Elements of Information Theory. Wiley, 1991. [] Donald O. Hebb. The Organization of Behavior. Wiley, 1949. [3] David H. Hubel and Torsten. Wiesel. Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, 1951:15 43, 1968. [4] Ralph Linsker. From basic network principles to neural architecture: Emergence of orientation columns. Proceedings of the ational Acadamy of Sciences USA, 83:8779 8783, 1986. [5] Ralph Linsker. From basic network principles to neural architecture: Emergence of orientation-selective cells. Proceedings of the ational Acadamy of Sciences USA, 83:8390 8394, 1986. [6] Ralph Linsker. From basic network principles to neural architecture: Emergence of spatialopponent cells. Proceedings of the ational Acadamy of Sciences USA, 83:7508 751, 1986. [7] Ralph Linsker. Self-organization in a perceptual network. IEEE Computer, 13:105 117, 1988. [8] Ralph Linsker. An application of the principle of maximum information preservation to linear systems. Advances in eural Information Processing Systems, 1:186 194, 1989. [9] Ralph Linsker. Deriving receptive fields using an optimal encoding criterion. Advances in eural Information Processing Systems, 5:953 960, 199. [10] Claude E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 7:379 43, 1948. 7