Principles of Neural Information Theory

Size: px

Start display at page:

Download "Principles of Neural Information Theory"

Wendy Randall
6 years ago
Views:

2 Principles of Neural Information Theory Computational Neuroscience and the E cient Coding Hypothesis James V Stone

3 Title: Principles of Neural Information Theory Author: James V Stone c 2017 Sebtel Press All rights reserved. No part of this book may be reproduced or transmitted in any form without written permission from the author. The author asserts his moral right to be identified as the author of this work in accordance with the Copyright, Designs and Patents Act First Edition, Typeset in L A TEX2". First printing. File: main NeuralInfoTheory v34.tex. ISBN Cover based on photograph of Purkinje cell from mouse cerebellum injected with Lucifer Yellow. Courtesy of National Center for Microscopy and Imaging Research, University of California.

4 For Teleri

5 Contents 1. All That We See Introduction E cient Coding In Search of General Principles Information Theory Neurons, Signals and Noise An Overview of Chapters Information Theory Introduction Finding a Route, Bit by Bit Information and Entropy Entropy and Uncertainty Maximum Entropy Distributions Channel Capacity Mutual Information The Gaussian Channel Bandwidth, Capacity and Power Summary Measuring Neural Information Introduction Spikes Why Spikes? Neural Information Gaussian Firing Rates Information About What? Does Timing Precision Matter? Rate Codes and Timing Codes Summary Pricing Neural Information Introduction Paying With Spikes Paying With Hardware Paying With Energy Diminishing Returns in Flies Paying With Reluctance Optimal Axon Diameter Optimal Distribution of Axon Diameters

6 4.9. Axon Diameter and Spike Speed Optimal Mean Firing Rate The Marginal Value Theorem Optimal Distribution of Firing Rates Optimal Synaptic Conductance Summary Encoding Colour Introduction The Eye How Aftere ects Occur The Problem With Colour A Neural Encoding Strategy Encoding Colour Why Aftere ects Occur Measuring Mutual Information Maximising Mutual Information Principal Component Analysis PCA and Mutual Information Evidence for Colour Encoding Summary Encoding Time Introduction Linear Models Neurons and Wine Glasses The LNP Model Estimating LNP Parameters The Predictive Coding Model Estimating Predictive Coding Parameters Predictive Coding and Information Evidence for Predictive Coding Summary Encoding Space Introduction Spatial Frequency Do Ganglion Cells Decorrelate Images? Are Receptive Field Structures Optimal? Predictive Coding of Images Evidence For Predictive Coding Are Receptive Field Sizes Optimal? Summary

7 8. Encoding Visual Contrast Introduction Not Wasting Entropy Anatomy of a Fly s Eye Maximum Entropy Encoding Evidence for Maximum Entropy Encoding Summary Reading the Neural Code Introduction Bayes Rule Bayesian Decoding Encoding vs Decoding Nonlinear Encoding and Linear Decoding Linear Decodability How Bayes Adds Bits Bayesian Brains Summary Not Even How, But Why Introduction The E cient Coding Principle Counter Arguments Crossing the Neural Rubicon Further Reading 187 Appendices 189 A. Glossary 189 B. Mathematical Symbols 193 C. Correlation and Independence 197 D. A Vector Matrix Tutorial 199 E. Key Equations 203 References 205 Index 211

8 Preface Some scientists consider the brain to be a collection of heuristics or hacks, which have accumulated over evolutionary history. Others think that the brain relies on a small number of fundamental principles, which can be used to explain the diverse systems within the brain. This book provides a rigorous account of how Shannon s mathematical theory of information can be used to test one such principle, which (for historical reasons) is called the e cient coding hypothesis. From Words to Mathematics. The methods used to explore the e cient coding hypothesis lie in the realms of mathematical modelling. Mathematical models demand a precision unattainable with purely verbal accounts of brain function. With this precision, comes an equally precise quantitative predictive power. In contrast, the predictions of purely verbal models can be vague, and this vagueness also makes them virtually indestructible, because predictive failures can often be explained away. No such luxury exists for mathematical models. In this respect, mathematical models are easy to test, and if they are weak models then they are easy to disprove. So, in the Darwinian world of mathematical modelling, survivors tend to be few, but those few tend to be supremely fit. This is not to suggest that purely verbal models are always inferior. Such models are a necessary first step in understanding?. But continually refining a verbal model into ever more rarefied forms is not scientific progress. Eventually, a purely verbal model should evolve to the point where it can be reformulated as a mathematical model, with predictions that can be tested against measurable physical quantities. Happily, most branches of neuroscience reached this state of scientific maturity some time ago. Signposts. The writing style adopted here in Principles of Neural Information Theory describes the raw science of neural information theory, un-fettered by the conventions of standard textbooks. Accordingly, key concepts are introduced informally, before being

9 described mathematically; and each equation is accompanied by explanatory text. However, such an informal style can easily be mis-interpreted as poor scholarship, because informal writing is often sloppy writing. But the way we write is usually only loosely related to the way we speak, when giving a lecture, for example. A good lecture includes many asides and hints about to what is, and is not, important. In contrast, scientific writing is usually formal, bereft of sign-posts about where the main theoretical structures are to be found, and how to avoid the many pitfalls which can mislead the unwary. So, unlike most textbooks, and like the best lectures, this book is intended to be both informal and rigorous, with prominent sign-posts as to where the main insights are to be found, and many warnings about where they are not. Originality. Evidence is presented in the form of scientific papers and books, which are cited using a conventional format (e.g. Smith and Jones (1959)). When the sheer number of cited sources is large, numbered superscripts are used in order to prevent the text from becoming un-readable. Occasionally, facts are presented without evidence, either because they are reasonably self-evident, or because they can be found in standard texts. Consequently, the reader may wonder if the ideas being presented are derived from other scientists, or if they are unique to this book. In such cases, be re-assured that none of the ideas in this book belong to the author. Indeed, like most books, this book represents a synthesis of ideas from many sources, but the general approach is inspired principally by these texts: Vision (1981) 76 by Marr, Spikes (1997) 95 by Rieke, Warland, de Ruyter van Steveninck and Bialek, Biophysics (2012) 17 by Bialek, and Principles of Neural Design (2015) 110 by Sterling and Laughlin. What Is Not Included. A short book like this cannot cover all aspects of a subject in detail, and choosing what to leave out is as important as choosing what to include. In order to compensate for this necessity, pointers to material not included, or not covered in detail, can be found in the References and (annotated) Further Reading section.

10 Who Should Read This Book? The material presented in this book is intended to be accessible to readers with a basic scientific education at undergraduate level. In essence, this book is intended for those who wish to understand how the fundamental ingredients of inert matter, energy and information have been forged by evolution to produce a particularly e cient computational machine, the brain. Understanding how the elements of this triumvirate interact demands knowledge from a wide variety of academic disciplines, but principally from biology and mathematics. Accordingly, reading this book will require some patience from mathematicians to navigate the conceptual shift involved in re-interpreting the mathematics of signal processing in terms of neural computation, and it will require sustained e ort from biologists to understand the mathematical material. Even though some of the mathematical material is fairly sophisticated, care has been taken to ensure it is explained from a reasonably low level. Additionally, key concepts are introduced using diagrams wherever possible, so that readers can gain insight into the basic ideas being presented. PowerPoint Slides of Figures. Most of the figures used in this book can be downloaded from Corrections. Please corrections to j.v.stone@she eld.ac.uk. A list of corrections can be found at Acknowledgments. Thanks to all those involved in developing the freely available L A TEX2" software, which was used to typeset this book. Shashank Vatedka deserves a special mention for checking the mathematics in a final draft of this book. (NOT DONE YET) Thanks to Caroline Orr for meticulous copy-editing and proofreading. (NOT DONE YET). Thanks to Kevin Gurney, John de Pledge, Royston Sellman, and Steve Snow for interesting discussions on the role of information in biology. For reading draft versions of this book, I am very grateful to David Atwell, Ilona Box, Karl Friston, Raymond Lister, Stuart Wilson and... Jim Stone, She eld, England, 2017.

11 To understand life, one has to understand not just the flow of energy, but also the flow of information. W Bialek, 2012.

12 Chapter 1 All That We See When we see, we are not interpreting the pattern of light intensity that falls on our retina; we are interpreting the pattern of spikes that the million cells of our optic nerve send to the brain. Rieke, Warland, De Ruyter van Steveninck, and Bialek, Introduction All that we see begins with an image formed on the eye s retina (Figure 1.1). Initially, this image is recorded by 126 million photoreceptors within the retina. The outputs of these photoreceptors are then repackaged or encoded, via a series of intermediate connections, into a sequence of digital pulses or spikes, that travel through the one million neurons of the optic nerve which connect the eye to the brain. The fact that we see so well suggests that the retina must be extraordinarily accurate when it encodes the image into spikes, and the brain must be equally accurate when it decodes those spikes into all that we see (Figure 1.2). But the eye and brain are not only good at translating the world into spikes, and spikes into perception, they are also e cient at transmitting information from the eye to the brain whilst expending as little energy as possible. Precisely how e cient, is the subject of this book. 1

1 All That We See 1.2. E cient Coding Neurons communicate information, and that is pretty much all that they do. But neurons are energetically expensive to make, maintain, and run 70.

For children and adults, half the total energy budget of the brain is used for neuronal information processing, and the rest is used for basic maintenance 9.

Given that neurons and spikes are so expensive, we should be unsurprised to find that when the visual data from the eye is encoded as a series of spikes, each neuron and each spike conveys as much

13 1 All That We See 1.2. E cient Coding Neurons communicate information, and that is pretty much all that they do. But neurons are energetically expensive to make, maintain, and run 70. Half of a child s total energy budget, and a fifth of an adult s budget, is required just to keep the brain ticking over 106 (Figure 1.3). For children and adults, half the total energy budget of the brain is used for neuronal information processing, and the rest is used for basic maintenance 9. The high cost of using neurons probably accounts for the fact that only 2-4% of them are active at any one time 71. Given that neurons and spikes are so expensive, we should be unsurprised to find that when the visual data from the eye is encoded as a series of spikes, each neuron and each spike conveys as much information as possible. These considerations have given rise to the e cient coding hypothesis 8;14;17;37;53;95;122;123, an idea developed over many years by Horace Barlow (1959) 13. The e cient coding hypothesis is conventionally interpreted to mean that neurons encode sensory data to transmit as much information as possible. Even though it is not usually made explicit, if data are encoded e ciently as described above then this often implies that the amount of energy paid for information is as small as possible. In order to avoid confusion, we adopt a specific interpretation of the e cient coding hypothesis here: neurons transmit as much information as possible per Joule of energy expended, defined here as energy e ciency 11;52;62;64;82;87;88;110;121. Re#na Op#c Nerve Lens Figure 1.1. Cross section of eye. 2

14 1.3. In Search of General Principles There are a number of di erent methods which collectively fall under the umbrella term e cient coding. However, to a first approximation, the results of applying these various methods tend to be quite similar 95, even though the methods themselves appear quite di erent. These methods include sparse coding 46, principal component analysis, independent component analysis 16;111, information maximisation (infomax) 73, predictive coding 92;109 and redundancy reduction 6. We will encounter most of these broadly similar methods throughout this book, but we place special emphasis on predictive coding because it is based on a single principle, and it has a wide range of applicability In Search of General Principles The test of a theory is not just whether or not it accounts for a body of data, but also how complex the theory is in relation to the complexity of the data being explained. Clearly, if a theory is, in some sense, more convoluted than the phenomenon it explains then it is not much of a theory. As an extreme example, if each of the 86 billion neurons in the brain required its own unique theory then the resultant collective theory of brain function would be almost as complex as the brain itself. This is why we favour theories that explain a vast range of phenomena with the minimum of words or equations. A prime example of such a parsimonious theory is Newton s theory of gravitation, which explains a Response (spikes) Encoding b Luminance Decoding Reconstructed luminance Time (ms) Figure 1.2. Encoding and decoding. A rapidly changing luminance (bold curve in b) is encoded as a neuronal spike train (a), which is decoded to yield an estimate of the luminance (thin curve in b). 3

1 All That We See (amongst other things) how a ball falls to Earth, how atmospheric pressure varies with height above the Earth, and how the Earth orbits the Sun.

15 1 All That We See (amongst other things) how a ball falls to Earth, how atmospheric pressure varies with height above the Earth, and how the Earth orbits the Sun. In essence, we favour theories which rely on a general principle to explain a diverse range of physical phenomena. If we want to understand how the brain works then we need more than a theory which is expressed in mere words. For example, if the theory of gravitation were stated only in words then we could say that each planet has an approximately circular orbit, but we would have to use many words to prove precisely why each orbit must be elliptical, and to state exactly how elliptical each orbit is. In contrast, a few equations express these facts exactly, and without ambiguity. Thus, whereas words are required to provide theoretical context, mathematics imposes a degree of precision which is extremely difficult, if not impossible, to achieve with words alone. To quote one of the first great scientists, The universe is written in this grand book, which stands continually open to our gaze, but it cannot be understood unless one first learns to comprehend the language in which it is written. It is written in the language of mathematics, without which it is humanly impossible to understand a single word of it. Galileo Galilei, In the spirit of Galileo s recommendation, a rigorous theory of information processing in the brain should begin with a quantitative definition of information (see Chapter 2). (a) (b) Figure 1.3. a) The human brain weighs 1300g, contains about 1010 neurons, and consumes 12 Watts of power. The outer surface seen here is the neocortex. b) Each neuron (plus its support structures) therefore accounts for an average of 1200pJ/s (1pJ/s = J/s). 4

16 1.4. Information Theory 1.4. Information Theory Here in the 21st century, where we rely on computers for almost every aspect of our daily lives, it seems obvious that information is important. However, it would have been impossible for us to know just how important it is before Claude Shannon almost single-handedly created information theory in the 1940s. Since that time, it has become increasingly apparent that information, and the energy cost of each bit of information, imposes fundamental, unbreachable limits on the form and function of biological mechanisms. In this book, we concentrate on one particular function, information processing in the brain. Claude Shannon s theory, first published in , heralded a transformation in our understanding of information. Before 1948, information had been regarded as a kind of poorly defined miasmic fluid. But afterwards, it became apparent that information is a welldefined and, above all, measurable quantity. Shannon s theory of information provides a mathematical definition of information, and describes precisely how much information can be communicated between di erent elements of a system. This may not sound like much, but Shannon s theory underpins our understanding of how signals and noise are related, and why there are definite limits to the rate at which information can be processed within any system, whether man-made or biological Neurons, Signals and Noise When a question is typed into a computer search engine, the results provide useful information, but this is buried in a sea of mostly useless data. In this internet age, it is easy for us to appreciate the di erence between information and mere data, and we have learned to treat the information as useful signal and the rest as useless noise. This experience is now so commonplace that phrases like signal to noise ratio are becoming part of everyday language. Even though most people are unaware of the precise meaning of this phrase, they know intuitively that data comprise a combination of signal and noise. 5

17 1 All That We See This type of scenario is ubiquitous in the natural world, where the ability of eyes and ears to extract useful signals from noisy sensory data, and to encode those signals e ciently, is the key to survival 113. Indeed, the e cient coding hypothesis suggests that the evolution of sense organs, and of the brains that process data from those organs, is primarily driven by the need to minimise the energy expended for each bit of information acquired from the environment. And, because information theory tells us how to measure information precisely, it provides an objective benchmark against which the performance of neurons can be compared. The maximum rate at which information can be transmitted through a neuron can be increased in a number of di erent ways. However, whichever way we (or evolution) chooses to do this, doubling the information rate costs more than a doubling in neuronal hardware, and more than twice the amount of power (energy per second) 110. This is a universal phenomenon, which implies a diminishing information return on every additional micrometre of neuron diameter, and on every additional Joule of energy invested in transmitting spikes along a neuron. Information theory does not place any conditions on what type of mechanism implements information processing; in other words, on exactly how it is to be achieved. However, unless there are unlimited amounts of power available, relatively little information will reach the brain without some form of encoding. In other words, information theory does not specify how any biological function, such as vision, is implemented, but it does set fundamental limits on what is achievable by any implementation, biological or otherwise. Once we acknowledge that these limits are unbreachable, and that they e ectively extort such a high price, we should be unsurprised that Nature has managed to evolve brains which are exquisitely sensitive to the many trade-o s between time, neuronal hardware, energy and information. As we shall see, whenever such a trade-o is encountered, the brain seems to maximise the amount of information gained for each Joule of energy expended. 6

18 1.6. An Overview of Chapters The distinction between a function and the mechanism which implements that function is a cornerstone of David Marr s (1982) 76 computational theory of vision. Marr s approach stresses the interplay of theory and practice, and is summarised in a single quotation: Trying to understand perception by studying only neurons is like trying to understand bird flight by studying only feathers: It just cannot be done. Even though Marr did not address the role of information theory directly, his analytic approach has served as a source of inspiration, not only for the approach adopted in this book, but also for much of the progress made within computational neuroscience An Overview of Chapters This section contains technical terms which are explained fully in the relevant chapters, and in the Glossary. To fully appreciate the importance of information theory for neural computation, some familiarity with the basic elements of information theory is required; these elements are presented in Chapter 2 (which can be skipped on a first reading of the book). In Chapter 3, we consider, how to apply information theory to the problem of measuring the amount of information in the output of a spiking neuron, and how much of this information (i.e. mutual information) is related to the neuron s input. This leads to an analysis of the nature of the neural code; specifically, whether it is a rate code or a spike timing code. In Chapter 4, we discover that one of the consequences of information theory (specifically, Shannon s noisy coding theorem) is that the cost of information rises inexorably and disproportionately with information rate. This steep rise explains why parameters like axon diameter, the distribution of axon diameters, mean firing rate, and synaptic conductance, appear to be tuned to minimise the cost of information. In Chapter 5, we consider how the correlations between the outputs of photoreceptors sensitive to similar colours threaten to reduce information rates, and how this can be ameliorated by synaptic preprocessing in the retina. This pre-processing explains not only how, but 7

19 1 All That We See also why, there is a red-green aftere ect, but no red-blue aftere ect. A more formal account involves using principal component analysis to show which particular synaptic connection values maximise neuronal information throughput. In Chapter 6, the lessons learned so far are applied to the problem of encoding time-varying, correlated visual inputs. We explore how a standard (LNP) neuron model can be used for e cient coding of the temporal structure of retinal images, and how a model based on predictive coding yields similar results to this standard model. We also show how predictive coding is a viable candidate for implementing e cient coding. In Chapter 7, we consider how information theory predicts the receptive field structures of retinal ganglion cells across a range of luminance conditions. Evidence is presented that these the receptive field structures are also obtained using predictive coding. Once colour, spatial and temporal structure has been encoded by a neuron, the resultant signals must pass through the neuron s non-linear input/output (transfer) function. Accordingly, Chapter 8 is based on a classic paper by Simon Laughlin (1981) 68,whichpredictstheprecise form that this transfer function should adopt in order to maximise information throughput; a form which matches the transfer function found in visual neurons. Having considered how to encode sensory inputs, the problem of how to decode neuronal outputs is addressed in Chapter 9, where the importance of prior knowledge or experience is explored in the context of Bayes theorem. We also consider how often a neuron should produce a spike so that each spike conveys as much information as possible, and we find that the answer coincides with a vital property of e cient communication (i.e. linear decodability). A defining property of the computational approach adopted in this text is that, within each chapter, we explore particular neuronal mechanisms, how they work, and (most importantly) why they work in the way they do. As a result, each chapter documents evidence that the design of biological mechanisms is determined largely by the need to process information e ciently. 8

20 Chapter 2 Information Theory Abasicideaininformationtheoryisthatinformationcanbetreated very much like a physical quantity, such as mass or energy. C Shannon, Introduction Just as a bird cannot fly without obeying the laws of physics, so, a brain cannot function without obeying the laws of information. And, just as the shape of a bird s wing is ultimately determined by the laws of physics, so the structure of a neuron is ultimately determined by the laws of information. In order to understand how these laws are related to neural computation, it is necessary to have some familiarity with Shannon s theory of information. Every image formed on the retina, and every sound that reaches the ear is sensory data, which contains information about some aspect of the world. For an owl, the sound of a mouse rustling a leaf may indicate a meal is below; for the mouse, a flickering shadow overhead may indicate it is about to become a meal. Precisely how much information is gained by a receiver from sensory data depends on three things. First, and self-evidently, it depends on the amount of information in the data. Second, the relative amounts of relevant information or signal, and irrelevant information or noise, in the data. Third, the ability of the receiver to separate the signal from the noise. 9

21 Further Reading Bialek (2012) 17. Biophysics. This recent text is a comprehensive and rigorous account of the physics which underpins biological processes, including computational neuroscience and morphogenesis. Bialek adopts the information-theoretic approach so successfully applied in his previous book (Spikes, see below). The writing style is fairly informal, but this book requires a high level of mathematical competence to be fully appreciated. However, it is so full of valuable insights that it will reward careful reading by less numerate readers. Highly recommended. Dayan P and Abbott LF (2001) 31. Theoretical Neuroscience This classic text is a comprehensive and rigorous account of computational neuroscience, which demands a high level of mathematical competence. Doya K, Ishii S, Pouget A and Rao RPN (2007) 38. Bayesian Brain: Probabilistic Approaches to Neural Coding. MIT Press. This book of edited chapters contains some superb introductions to Bayes rule in the context of vision and brain function, as well as more advanced material. Gerstner, W and Kistler, WM 49. Spiking neuron models: Single neurons, populations, plasticity, Excellent text on models of the detailed dynamics of neurons. Land MF and Nilsson DE (2002). Animal eyes. New York: Oxford University Press. A landmark book by two of the most authoritative exponents of the intricacies of eye design. Marr, D (1982, reprinted 2010) Vision: A computational investigation into the human representation and processing of visual information. This classic book inspired a generation of vision scientists to explore the computational approach, succinctly summarised in this famous quotation, Trying to understand perception by studying only neurons is like trying to understand bird flight by studying only feathers: It just cannot be done. Marr s book is included here because his computational framework underpins much of the work reported in this text. Reza FM (1961) 93. An Introduction to Information Theory. Comprehensive and mathematically rigorous, with a reasonably geometric approach. Rieke F, Warland D, van Steveninck RR and Bialek W (1997) 95. Spikes: Exploring the Neural Code. First modern text to formulate questions about the functions of single neurons in terms of information 187

22 Further Reading theory. Superbly written in a tutorial style, well argued, and fearless in its pursuit of solid answers to hard questions. Research papers by these authors are highly recommended for their clarity. Eliasmith C (2004) 42. Neural Engineering. Technical account which stresses the importance of linearisation using opponent processing, and nonlinear encoding in the presence of linear decoding. Sterling P and Laughlin S (2015) 110. Principles of Neural Design. Comprehensive and detailed account of how information theory constrains the design of neural systems. Sterling and Laughlin interpret physiological findings from a wide range of organisms in terms of a single unifying framework: Shannon s mathematical theory of information. A remarkable book, highly recommended. Zhaoping L (2014) 123. Understanding Vision. Contemporary account of vision based mainly on the e cient coding hypothesis. Even though this book is technically demanding, the introductory chapters give a good overview of the approach. Tutorial Material Byrne JH, Neuroscience Online, McGovern Medical School University of Texas. Laughlin SB (2006). The Hungry Eye: Energy, Information and Retinal Function, Lay DC (1997). Linear Algebra and its Applications. Pierce JR (1980). An Introduction to Information Theory: Symbols, Signals and Noise. Riley KF, Hobson MP and Bence SJ (2006). Mathematical Methods for Physics and Engineering. Scholarpedia. This online encyclopedia includes excellent tutorials on computational neuroscience. Smith S (2013). Digital signal processing: A practical guide for engineers and scientists. Freely available at Stone JV (2012). Vision and Brain: How We See The World. Stone JV (2013). Bayes Rule: A Tutorial Introduction. Stone JV (2015). Information Theory: A Tutorial Introduction. 188

23 Appendix A Glossary ATP Adenosine triphosphate: molecule responsible for energy transfer. autocorrelation function Given a sequence x = (x 1,..., x n ), the autocorrelation function specifies how quickly the correlation between nearby values x i and x i+d diminishes with distance d. average Given a variable x, the average, mean or expected value of a sample of n values of x is x =1/n P n j=1 x j. Bayes rule Given an observed value y 1 of the variable y, Bayes rule states that the posterior probability that the variable x has the value x 1 is p(x 1 y 1 )=p(y 1 x 1 )p(x 1 )/p(y 1 ). binary digit A binary digit can be either a 0 or a 1. bit Information required to choose between two equally probable alternatives. capacity The maximum rate at which a communication channel can communicate information from its input to its output. central limit theorem As the number of samples from almost any distribution increases, the distribution of sample means becomes increasingly Gaussian. coding capacity The maximum amount of information that could be conveyed by a neuron with a given firing rate. conditional probability The probability that the value of one random variable x has the value x 1 given that the value of another random variable y has the value y 1, written as p(x 1 y 1 ). conditional entropy Given two random variables x and y, the average uncertainty regarding the value of x when the value of y is known, H(x y) = E[log(1/p(x y))] bits. convolution A filter can be used to change (blur or sharpen) a signal by convolving the signal with the filter, as defined in Section 6.4. correlation See Appendix C. 189

24 Glossary cross-correlation function Given two sequences x =(x 1,..., x n ) and y = (y 1,..., y n ), the cross-correlation function specifies how quickly the correlation between x i and y i+d diminishes with d. cumulative distribution function The cumulative area under the probability density function (pdf) of a variable. decoding Translating neural outputs (e.g. firing rates) to inputs (e.g. stimulus values). Modelled using a decoding kernel v. decoding function Maps neuronal outputs y to inputs x MAP,where x MAP is the most probable value of x given y. decorrelated If two Gaussian variables are decorrelated (or equivalently) uncorrelated then each variable provides no information about the other variable. See Appendix C. dynamic linearity If a neuron s response to a sinusoidal input is a sinusoidal output then it has dynamic linearity. e cient coding hypothesis States that sensory data is encoded so as to maximise the amount of information transmitted. In this book, the e cient coding hypothesis is interpreted to mean that sensory data is encoded so as to maximise energy e ciency. encoding Translating neural inputs (e.g. stimulus values) to outputs (e.g. firing rates). Modelled using an encoding kernel w. encoding function Maps neuronal inputs x to outputs y MAP,where y MAP is the most probable value of y given x. Equal to the transfer function if the distribution of noise in y is symmetric. energy e ciency The amount of information (i.e. coding capacity) or mutual information obtained per Joule of energy ". entropy The entropy of a variable y is a measure of its variability. If y adopts m possible values with probabilities p(y 1 ),..., p(m)) then its entropy is H(y) = P m i=1 p(y i) log 2 1/p(y i )bits. equivariant Having equal variance. expected value See average. filter See Wiener kernel. Gaussian variable A variable with a Gaussian distribution of values. iid If a variable x has values which are sampled independently from the same probability distribution then x is independent and identically distributed (iid). independence If two variables x and y are independent then the value x provides no information about the value y, and vice versa. information The information conveyed by a variable x which adopts the value x = x 1 with probability p(x 1 ) is log 2 (1/p(x 1 )) bits. joint probability The probability that two or more variables simultaneously adopt specified values. 190

25 Glossary Joule One Joule is the energy required to raise 100g by 1 metre. kernel See filter. linear See static linearity and dynamic linearity. linear filter See kernel. linear decodability If a continuous signal s, which has been encoded as a spike train y, can be reconstructed using a linear filter then it is linearly decodable. static linearity If a system has a static linearity then the response to an input value x t is proportional to x t. logarithm If x is expressed as a logarithm y with base a then y = log a x is the power to which we have to raise a in order to obtain x. mean See average. micron One micron equals 10 6 m, or one micrometre (1 µm). mitochondrion Cell organelle responsible for generating ATP. monotonic If y = f(x) is a monotonic function of x then a change in x always induces an increase or it always induces a decrease in y. mutual information The reduction in uncertainty I(y, x) regarding the value of one variable x induced by knowing the value of another variable y. Mutual information is symmetric, so I(y, x) = I(x, y). noise The random jitter that is part of a measured quantity. natural statistics The statistical distribution of values of a physical parameter (e.g. contrast) observed in the natural world. neocortex The most recent, in evolutionary terms, part of the mammalian brain, it is a 2-3mm layer of cells comprising six layers of neurons. orthogonal Perpendicular. phase The phase of a signal can be considered as its left-right position. A sine and a cosine wave have a phase di erence of 90 degrees. power The rate at which energy is expended per second (Joules/s). power spectrum A histogram of frequency versus the power at each frequency for a given signal is the power spectrum of that signal. probability distribution Given a variable x which can adopt the values {x 1,..., x n }, the probability distribution is p(x) = {p(x 1 ),..., p(x n )}, wherep(x i ) is the probability that x = x i. probability function (pf) A function p(x) of a discrete random variable y defines the probability of each value of x. The probability that x = x 1 is p(x = x 1 ) or, more succinctly, p(x 1 ). pyramidal neuron Regarded as the computational engine with various brain regions, especially the neocortex. Each pyramidal neuron receives thousands of synaptic inputs, and has an axon which is about 4cm in length. 191

26 Glossary random variable (RV) The concept of a random variable x can be understood from a simple example, like the throw of a die. Each physical outcome is a number x i, which is the value of the random variable, so that x = x i. The probability of each value is defined by a probability distribution p(x) ={p(x 1 ), p(x 2 ),...}. redundancy Given an ordered set of values of a variable (e.g. in an image or sound), if a value can be obtained from a knowledge of other values then it is redundant. signal In this book, the word signal usually refers to a noiseless variable, whereas the word variable is used to refer to noisy and noiseless variables. signal to noise ratio Given a variable y = x + which is a mixture of a signal x with variance S and noise with variance N, the signal to noise ratio of y is SNR= S/N. standard deviation The square root of the variance of a variable. theorem A mathematical statement which has been proven to be true. timing precision If firing rate is measured using a timing interval of t = 0.001s then the timing precision is =1/ ts = 1,000s 1. transfer function The sigmoidal function E[y] = g(x) which maps neuronal inputs x to mean output values. Also known as a static nonlinearity in signal processing. See encoding function. tuning function Defines how firing rate changes as a function of a physical parameter, such as luminance, contrast, or colour. uncertainty In this text, uncertainty refers to the surprisal (i.e. log(1/p(y))) of a variable y. uncorrelated See decorrelated. variable A variable is like a container, usually for a number. Continuous variables (e.g. temperature) can adopt any value, whereas discrete variables (e.g. a die) adopt certain values only. variance The variance is of x is var(x) =E[(x x) 2 ], and is a measure of how spread out the values of a variable are. vector A vector is an ordered list of m numbers x =(x 1,..., x m ). white By analogy with white light, a signal which contains an equal proportion of all frequencies is white, so it has a flat power spectrum. A Gaussian signal with uncorrelated values is white. Wiener kernel Spatial or temporal filter, linearity in this text. which has dynamically 192

27 Appendix B Mathematical Symbols convolution operator. See pages 106 and 130. (upper case letter delta) represents a small increment. t small timing interval used to measure firing rate. r (nabla, also called del), represents a vector valued gradient. ˆ (hat) used to indicate an estimated value. For example, ˆv x is an estimate of the variance v x. x indicates the absolute value of x (e.g. if x = 3then x = 3). apple if x apple y then x is less than or equal to y. if x y then x is greater than or equal to y. means approximately equal to. if a random variable x has a distribution p(x) thenthisiswritten as x p(x). / indicates proportional to. E [ ] the mean or expectation of a variable. P (capital sigma), represents summation. For example, if we represent the n = 3 numbers 2, 5 and 7 as x 1 = 2, x 2 = 5, x 3 =7then their sum x sum is x sum = nx x i = x 1 + x 2 + x 3 = i=1 The variable i is counted up from 1 to n, and, for each i, theterm x i adopts a new value and is added to a running total. 193

28 Mathematical Symbols Q (capital pi) represents multiplication. For example, if we use the values defined above then the product of these n = 3 integers is x prod = ny i=1 x i = x 1 x 2 x 3 =2 5 7 = 70. (B.1) The variable i is counted up from 1 to n, and, for each i, theterm x i adopts a new value, and is multiplied by a running total. (epsilon) coding e ciency, which is the mutual information between neuronal input and output expressed as a proportion of the coding capacity of the neuron s output. " (Latin letter epsilon) energy e ciency, which is the number of bits of information (coding capacity) or mutual information obtained per Joule of energy. (iota) denotes timing precision, =1/ t s 1. (lambda) mean and variance of a Poisson distribution. s (lambda) space constant used in exponential decay of membrane voltage, the distance over which voltage decays by 67%. (eta) ganglion cell noise. (ksi) photoreceptor noise. µ (mu) the mean of a variable. (x, y) (rho) the correlation between x and y. (sigma) the standard deviation of a distribution. corr(x, y) the estimated correlation between x and y. cov(x, y) the estimated covariance between x and y. C channel capacity, the maximum information that can be transmitted through a channel, usually measured in bits per second (bits/s). C m m covariance matrix of the output values of m cones. C t C s m m temporal covariance matrix of the output values of m cones. m m spatial covariance matrix of the output values of m cones. 194

29 Mathematical Symbols C m m covariance matrix of noise values. e constant, equal to E the mean, average, or expected value of a variable x, written as E[x]. g encoding function, which transforms a signal s =(s 1,..., s k )into channel inputs x =(x 1,..., x n ), so x = g(s). H(x) entropy of x, which is the average Shannon information of the probability distribution p(x) of the random variable x. H(x y) conditional entropy of the conditional probability distribution p(x y). This is the average uncertainty in the value of x after the value of y is observed. H(y x) conditional entropy of the conditional probability distribution p(y x). This is the average uncertainty in the value of y after the value of x is observed. H(x, y) entropy of the joint probability distribution p(x, y). I(x, y) mutual information between x and y, average number of bits provided by each value of y about the value of x, and vice versa. ln x natural logarithm (log to the base e) of x. log x logarithm of x. Logarithms use base 2 in this text, and base is indicated with a subscript if the base is unclear (e.g. log 2 x). m number of di erent possible messages or input values. M number of bins in a histogram. N noise variance in Shannon s fundamental equation for the capacity of a Gaussian channel C = 1 2 log(1 + P/N). n the number of observations in a data set (e.g. coin flip outcomes). p(x) the probability distribution of the random variable x. p(x i ) the probability (density) that the random variable x = x i. p(x, y) joint probability distribution of the random variables x and y. p(x i y i ) the conditional probability that x = x i given that y = y i. R the entropy of a spike train, such that R max R I(x, y) R info. 195

30 Mathematical Symbols R info R max estimate of the mutual information I(x, y) between neuronal input and output. neural coding capacity, the maximum rate at which neuronal information can be communicated. s stimulus values, s =(s 1,..., s n ). t small interval of time; usually, t = 0.001s for neurons. v x if x has mean µ then the variance of x is v x = 2 x =E[(µ x) 2 ]. w vector of m weights comprising a linear encoding kernel; w is usually an estimate of the optimal kernel w. v vector of m weights comprising a linear decoding kernel; v is usually an estimate of the optimal kernel v. x cone output values, x =(x 1,..., x n ). x vector variable; x t is the value of the vector variable x at time t. X vector variable x expressed as an m n data matrix. y neuron output values. y model neuron output values. y vector variable; y t is the value of the vector variable y at time t. z ganglion cell input values, z =(z 1,..., z n ). z vector variable; z t is the value of the vector variable z at time t. 196

31 Appendix C Correlation and Independence Correlation and Covariance. The similarity between two variables x and y is measured in terms of a quantity called the correlation. Ifx has a value x t at time t and y has a value y t at the same time t then the correlation between x and y is defined as (x, y) = c xy E[(x t x)(y t y)], (C.1) where x is the average value of x, y is the average value of y, and c xy is a constant which ensures that the correlation has a value between -1 and +1. Given a sample of n pairs of values, the correlation is estimated as corr(x, y) = c xy n nx (x t x)(y t y). (C.2) t=1 We are not usually concerned with the value of the constant c xy,but for completeness, it is defined as c xy = p var(x) var(y), where var(x) and var(y) are the variances in x and y, respectively. Variance is a measure of the spread in the values of of a variable. For example, the variance in x is estimated as var(x) = 1 n nx (x t x) 2. (C.3) In fact, it is conventional to use the un-normalised version of correlation, called the covariance, which is estimated as cov(x, y) = 1 n t=1 nx (x t x)(y t y). (C.4) t=1 Because covariance is proportional to correlation, these terms are often used interchangeably. 197

Correlation and Independence It will greatly simplify notation if we assume all variables have a mean of zero.

If two variables are independent then the value of one variable provides no information about the corresponding value of the other variable.

8) Given this definition, we can also consider decorrelation as a special case of independence.

32 Correlation and Independence It will greatly simplify notation if we assume all variables have a mean of zero. For example, this allows us to express the covariance more succinctly as n cov(x, y) = 1X x t yt n t=1 (C.5) = E[xt yt ], (C.6) and the correlation as corr(x, y) = cxy E[xt yt ]. (C.7) Decorrelation and Independence. If two variables are independent then the value of one variable provides no information about the corresponding value of the other variable. More precisely, if two variables x and y are independent then their joint distribution p(x, y) is given by the product of the distributions p(x) and p(y) p(x, y) = p(x) p(y). (C.8) Given this definition, we can also consider decorrelation as a special case of independence. For example, if two zero-mean variables x and y are independent then, for any non-zero values of and, the correlation between x t and yt is zero, corr(x, y ) = cxy E[x t yt ] = 0. p(x,y) (C.9) p(x,y) x y (a) x y (b) Figure C.1. (a) Joint probability density function p(y, x) for correlated Gaussian variables x and y. The probability density p(x, y) is indicated by the density of points on the ground plane at (x, y). The marginal distributions p(y) and p(x) are on the side axes. (b) Joint probability density function p(x, y) for independent Gaussian variables, for which p(x, y) = p(x)p(y). 198

33 Appendix D A Vector Matrix Tutorial The single key fact about vectors and matrices is that each vector represents a point located in space, and a matrix moves that point to a di erent location. Everything else is just details. Vectors. A number, such as 1.234, is known as a scalar, and a vector is an ordered list of scalars. Here is an example of a vector with two components a and b: w =(a, b). Note that vectors are written in bold type. The vector w can be represented as a single point in a graph, where the location of this point is by convention a distance of a from the origin along the horizontal axis and a distance of b from the origin along the vertical axis. Adding Vectors. Thevector sum of two vectors is the addition of their corresponding elements. Consider the addition of two pairs of scalars (x 1, x 2 ) and (a, b) (a + x 1 ), (b + x 2 ). (D.1) Clearly, (x 1, x 2 ) and (a, b) can be written as vectors: z = (a + x 1 ), (b + x 2 ) (D.2) = (x 1, x 2 )+(a, b) = x + w. (D.3) Thus the sum of two vectors is another vector, which is known as the resultant of those two vectors. Subtracting Vectors. Subtracting vectors is similarly implemented by the subtraction of corresponding elements so that z = x w (D.4) = (x 1 a), (x 2 b). (D.5) 199

34 Vector Matrix Multiplying Vectors. Consider the sum given by the multiplication of two pairs of scalars (x 1, x 2 ) and (a, b) y = ax 1 + bx 2. (D.6) Clearly, (x 1, x 2 ) and (a, b) can be written as vectors y = (x 1, x 2 ).(a, b), = x.w, (D.7) where equation (D.7) is to be interpreted as equation (D.6). This multiplication of corresponding vector elements is known as the inner, scalar or dot product, and is often denoted with a dot, as here. Vector Length. First, as each vector represents a point in space it must have a distance from the origin, and this distance is known as the vector s length or modulus, denoted as x for a vector x. For a vector x =(x 1, x 2 ) with two components this distance is given by the length of the hypotenuse of a triangle with sides x 1 and x 2, so that q x = x x2 2. (D.8) Angle between Vectors. The angle between two vectors x and w is cos = x.w x w. (D.9) Crucially, if = 90 degrees then the inner product is zero, because cos 90 = 0, irrespective of the lengths of the vectors. Vectors at 90 degrees to each other are known as orthogonal vectors. Row and Column Vectors. Vectors come in two basic flavours, row vectors and column vectors. A simple notational device to transform a row vector (x 1, x 2 ) into a column vector (or vice versa) is the transpose operator, T (x 1, x 2 ) T x1 =. (D.10) The reason for having row and column vectors is because it is often necessary to combine several vectors into a single matrix which is then used to multiply a single vector x, defined here as x 2 x =(x 1, x 2 ) T. (D.11) In such cases it is necessary to keep track of which vectors are row vectors and which are column vectors. If we redefine w as a column 200

35 Vector Matrix vector, w =(a, b) T,thentheinnerproductw.x can be written as y = w T x (D.12) x1 = (a, b) (D.13) x 2 = x 1 w 1 + x 2 w 2. (D.14) Here, each element of the row vector w T is multiplied by the corresponding element of the column x, and the results are summed. Writing the inner product in this way allows us to specify many pairs of such products as a vector-matrix product. If x is a vector variable such that x 1 and x 2 have been measured n times (e.g. at n time consecutive time steps) then y is a variable with n values x11, x (y 1, y 2,..., y n ) = (a, b) 12,..., x 1n. (D.15) x 21, x 22,..., x 2n Here, each (single element) column y 1t is given by the inner product of the corresponding column in x with the row vector w. This can now be rewritten succinctly as y = w T x. Vector Matrix Multiplication. If we reset the number of times x has been measured to N = 1 for now then we can consider the simple case of how two scalar values y 1 and y 2 are given by the inner products y 1 = w T 1 x (D.16) y 2 = w T 2 x, (D.17) where w 1 =(a, b) T and w 2 =(c, d) T. If we consider the pair of values y 1 and y 2 as a vector y = (y 1, y 2 ) T then we can rewrite equations (D.16) and (D.17) as (y 1, y 2 ) T =(w T 1 x, w T 2 x) T. (D.18) If we combine the column vectors w 1 and w 2 then we can define a matrix W W = (w 1, w 2 ) T (D.19) = a c b d. (D.20) 201

36 Vector Matrix We can now rewrite equation (D.18) as (y 1, y 2 ) T a b = c d (x 1, x 2 ) T. (D.21) This can be written more succinctly as y = W x. This defines the standard syntax for vector-matrix multiplication. Note that the column vector (x 1, x 2 ) T is multiplied by the first row in W to obtain y 1 and is multiplied by the second row in W to obtain y 2. Just as the vector x represents a point on a plane, so the point y represents a (usually di erent) point on the plane. Thus the matrix W implements a linear geometric transformation of points from x to y. If n > 1thenthetth column (y 1t, y 2t ) T in y is obtained as the product of tth column (x 1t, x 2t ) T in x with the row vectors in W y11, y 12,..., y 1n a b x11, x = 12,..., x 1n y 21, y 22,..., y 2n c d x 21, x 22,..., x 2n = (w 1, w 2 ) T (x 1, x 2 ) T (D.22) = W x. (D.23) Each (single element) column in y 1 is a scalar value which is obtained by taking the inner product of the corresponding column in x with the first row vector w1 T in W. Similarly, each column in y 2 is obtained by taking the inner product of the corresponding column in x with the second row vector w2 T in W. Transpose of Vector-Matrix Product. It is useful to note that if y = W x then the transpose y T of this vector-matrix product is y T =(W x) T = x T W T, (D.24) where the transpose of a matrix is defined by T W T a b a = = c d b c d. (D.25) Matrix Inverse. By analogy with scalar algebra, if y = W x then x = W 1 y,wherew 1 is the inverse of W. If the columns of a matrix are orthogonal then W 1 = W T. 202

37 Appendix E Key Equations Logarithms use base 2 unless stated otherwise. Entropy Joint entropy H(y) = H(y) = H(x, y) = H(x, y) = mx mx 1 p(y i ) log p(y i ) bits i=1 Z x i=1 j=1 Z y Z p(y) log 1 dy bits p(y) (E.2) mx p(x i, y j ) log x p(x, y) log 1 p(x i, y j ) bits 1 dx dy bits p(x, y) (E.4) (E.1) (E.3) H(x, y) = I(x, y) +H(x y) +H(y x) bits (E.5) Conditional Entropy H(x y) = H(y x) = H(x y) = H(y x) = mx i=1 j=1 mx i=1 j=1 Z Z y y Z Z mx p(x i, y j ) log mx p(x i, y j ) log x x p(x, y) log p(x, y) log 1 p(x i y j ) bits 1 p(y j x i ) bits 1 dx dy bits p(x y) (E.8) 1 dx dy bits p(y x) (E.9) (E.6) (E.7) H(x y) = H(x, y) H(y) bits (E.10) H(y x) = H(x, y) H(x) bits (E.11) 203

38 Key Equations from which we obtain the chain rule for entropy H(x, y) = H(x) +H(y x) bits (E.12) = H(y)+H(x y) bits (E.13) Marginalisation mx p(x i )= p(x i, y j ), mx p(y j )= p(x i, y j ) (E.14) j=1 Mutual Information Z Z p(x) = p(x, y) dy, p(y) = i=1 y x p(x, y) dx (E.15) I(x, y) = I(x, y) = mx mx i=1 j=1 Z y Z x p(x i, y j ) log p(x i, y j ) p(x i )p(y j ) bits p(x, y) log p(x, y) dx dy bits p(x)p(y) (E.17) (E.16) I(x, y) = H(x) +H(y) H(x, y) (E.18) = H(x) H(x y) (E.19) = H(y) H(y x) (E.20) = H(x, y) [H(x y) +H(y x)] bits (E.21) If y = x+, withx and y (possibly non-white) Gaussian variables with bandwidth W, where the Fourier components of the signal x f and noise f have variance S f and N f (respectively) then Z W I(x, y) = 1/2 log 1+ S f df bits, f=0 N f where S f /N f is the signal to noise ratio at frequency f. (E.22) Channel Capacity C = max p(x) I(x, y)bits. (E.23) If the channel input x has variance S, the noise has variance N, and both x and are white and Gaussian then C = 1/2 log 2 1+ S bits, (E.24) N where the ratio of variances S/N is the signal to noise ratio. 204

39 References [1] EH Adelson and JR Bergen. Spatiotemporal energy models for the perception of motion. JOSA A, 2(2): , [2] ED Adrian. The impulses produced by sensory nerve endings: Part I. Journal of Physiology, 61:49 72,1926. [3] L Aitchison and M Lengyel. The Hamiltonian brain: E cient probabilistic inference with excitatory-inhibitory neural circuit dynamics. PLOS Computational Biology, 12(12):e ,2016. [4] RM Alexander. Optima for animals. Princeton University Press, [5] JJ Atick, Z Li, and AN Redlich. Understanding retinal color coding from first principles. Neural Computation, 4(4): , [6] JJ Atick and AN Redlich. Towards a theory of early visual processing. Neural Computation, 2(3): ,1990. [7] JJ Atick and AN Redlich. What does the retina know about natural scenes? Neural Computation, 4(2): ,1992. [8] F Attneave. Some informational aspects of visual perception. Psychological Review, pages ,1954. [9] D Attwell and SB Laughlin. An energy budget for signaling in the grey matter of the brain. J Cerebral Blood Flow and Metabolism, 21(10): , [10] R Baddeley, LF Abbott, MCA Booth, F Sengpiel, T Freeman, EA Wakeman, and ET Rolls. Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc Royal Society London, B, 264: ,1997. [11] V Balasubramanian and MJ Berry. A test of metabolically e cient coding in the retina. Network: Computation in Neural Systems, 13(4): , [12] V Balasubramanian and P Sterling. Receptive fields and functional architecture in the retina. JPhysiology, 587(12): , [13] HB Barlow. Sensory mechanisms, the reduction of redundancy, and intelligence. In DV Blake and AM Utlley, editors, Proc Symp Mechanization Thought Processes (Vol2) [14] HB Barlow. Possible principles underlying the transformations of sensory messages. Sensory communication, pages , [15] HB Barlow, R Fitzhugh, and SW Ku er. Change of organization in the receptive fields of the cat s retina during dark adaptation. Journal of Physiology, 137: ,1957. [16] AJ Bell and TJ Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7: , [17] WBialek. Biophysics: Searching for Principles. Princeton University Press,

40 References [18] W Bialek, M DeWeese, F Rieke, and D Warland. Bits and brains: Information flow in the nervous system. Physica A: Statistical Mechanics and its Applications, 200(1-4): ,1993. [19] R Bogacz. A tutorial on the free-energy framework for modelling perception and learning. J. of Mathematical Psychology, [20] BG Borghuis, CP Ratli, RG Smith, P Sterling, and V Balasubramanian. Design of a neuronal array. JNeuroscience,28(12): , [21] A Borst and FE Theunissen. Information theory and neural coding. Nature neuroscience, 2(11): ,1999. [22] DH Brainard, P Longere, PB Delahunt, WT Freeman, JM Kraft, and BXiao.Bayesianmodelofhumancolorconstancy.Journal of Vision, 6(11), [23] W Brendel, R Bourdoukan, P Vertechi, CK Machens, and S Denéve. Learning to represent signals spike by spike. ArXiv e-prints, [24] N Brenner, W Bialek, and R de Ruyter van Steveninck. Adaptive rescaling maximizes information transmission. Neuron, 26: , [25] G Buchsbaum and Gottschalk. Trichromacy, opponent colours coding and optimum colour information transmission in the retina. Proc Roy Soc London, B, 220(1218):89 113,1983. [26] J Burge and WS Geisler. Optimal disparity estimation in natural stereo images. Journal of Vision, 14(2):1,2014. [27] J Bussgang. Cross-correlation function of amplitude-distorted Gaussian signals. Lab. Elect., Mass. Inst. Technol., Cambridge, MA, USA, Tech. Rep, 216,1952. [28] EL Charnov. Optimal foraging, the marginal value theorem. Theoretical population biology, 9(2): ,1976. [29] P Dagum and M Luby. Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial intelligence, 60(1): , [30] NB Davies, JR Krebs, and SA West. An Introduction to Behavioural Ecology. Wiley,2012. [31] P Dayan and DC Abbott. Theoretical Neuroscience. MIT Press, New York, NY, USA, [32] MH DeGroot. Probability and Statistics, 2nd edition. UK, Addison- Wesley, [33] A Dettner, S Münzberg, and T Tchumatchenko. Temporal pairwise spike correlations fully capture single-neuron information. Nature Communications, 7,2016. [34] MR Deweese. Optimization principles for the neural code. PhDthesis, Princeton University, [35] MR DeWeese and W Bialek. Information flow in sensory neurons. Il Nuovo Cimento D, 17(7-8): ,1995. [36] E Doi, JL Gauthier, GD Field, J Shlens, A Sher, M Greschner, TA Machado, LH Jepson, K Mathieson, DE Gunning, AM Litke, L Paninski, EJ Chichilnisky, and EP Simoncelli. E cient coding of spatial information in the primate retina. J. Neuroscience, 32(46): , [37] E Doi and MS Lewicki. A simple model of optimal population coding for sensory systems. PLoS Comp Biol, 10(8). [38] K Doya, S Ishii, A Pouget, and R Rao. The Bayesian Brain. MIT, MA,

41 References [39] KR Du y and DH Hubel. Receptive field properties of neurons in the primary visual cortex under photopic and scotopic lighting conditions. Vision Research, 47: ,2007. [40] TE Duncan. Evaluation of likelihood functions. Information and Control, 13(1):62 74,1968. [41] P Dyan and LF Abbott. Theoretical Neuroscience. MIT Press, [42] CEliasmith.Neural Engineering. MIT Press, [43] AL Fairhall, GD Lewen, W Bialek, and RR de Ruyter van Steveninck. E ciency and ambiguity in an adaptive neural code. Nature, 412(6849): , [44] LA Finelli, S Haney, M Bazhenov, M Stopfer, and TJ Sejnowski. Synaptic learning rules and sparse coding in a model sensory system. PLoS Comp Biol, [45] BJ Fischer and JL Pena. Owl s behavior and neural representation predicted by Bayesian inference. Nature Neuroscience, 14: , [46] P Foldiak and M Young. Sparse coding in the primate cortex. in The Handbook of Brain Theory and Neural Networks, ed. Arbib, MA, pages , [47] P Garrigan, CP Ratli, JM Klein, P Sterling, DH Brainard, and V Balasubramanian. Design of a trichromatic cone array. PLoS Comput Biol, 6(2):e ,2010. [48] JL Gauthier, GD Field, A Sher, M Greschner, J Shlens, AM Litke, and EJ Chichilnisky. Receptive fields in primate retina are coordinated to sample visual space more uniformly. Public Library of Science (PLoS) Biology, 7(4),2009. [49] W Gerstner and WM Kistler. Spiking neuron models: Single neurons, populations, plasticity. CUP,2002. [50] AR Girshick, MS Landy, and EP Simoncelli. Cardinal rules: visual orientation perception reflects knowledge of environmental statistics. Nature neuroscience, 14(7): ,2011. [51] D Guo, S Shamai, and S Verdú. Mutual information and minimum mean-square error in Gaussian channels. IEEE Transactions on Information Theory, 51(4): ,2005. [52] JJ Harris, R Jolivet, and D Attwell. Synaptic energy use and supply. Neuron, 75(5): ,2012. [53] JJ Harris, R Jolivet, E Engl, and D Attwell. Energy-e cient information transfer by visual pathway synapses. Current Biology, 25(24): , [54] HK Hartline. Visual receptors and retinal interaction. Nobel lecture, pages , [55] BY Hayden, JM Pearson, and ML Platt. Neuronal basis of sequential foraging decisions in a patchy environment. Nature neuroscience, 14(7): , [56] A Jarvstad, SK Rushton, PA Warren, and U Hahn. Knowing when to move on cognitive and perceptual decisions in time. Psychological science, 23(6): ,2012. [57] EC Johnson, DL Jones, and R Ratnam. A minimum-error, energyconstrained neural code is an instantaneous-rate code. Journal of computational neuroscience, 40(2): ,2016. [58] E Kaplan. The M, P, and K pathways of the primate visual system. In LM Chalupa and JS Werner, editors, The Visual Neurosciences, Volume 1, pages MIT Press,

42 References [59] DC Knill. Robust cue integration: A Bayesian model and evidence from cue-conflict studies with stereoscopic and figure cues to slant. Journal of Vision, 7:1 24,2007. [60] DC Knill and R Richards. Perception as Bayesian inference. Cambridge University Press, New York, NY, USA, [61] DC Knill and JA Saunders. Do humans optimally integrate stereo and texture information for judgments of surface slant? Vision Research, 43: , [62] K Koch, J McLean, M Berry, P Sterling, V Balasubramanian, and MA Freed. E ciency of information transmission by retinal ganglion cells. Current Biology, 14(17): ,2004. [63] K Koch, J McLean, R Segev, MA Freed, MJ Berry, V Balasubramanian, and P Sterling. How much the eye tells the brain. Current biology, 16(14): , [64] L Kostal, P Lansky, and MD McDonnell. Metabolic cost of neuronal information in an empirical stimulus-response model. Biological cybernetics, 107(3): ,2013. [65] L Kostal, P Lansky, and J-P Rospars. E cient olfactory coding in the pheromone receptor neuron of a moth. PLoS Computational Biology, 4(4), [66] SW Ku er. Discharge patterns and functional organization of mammalian retina. Journal of Neurophysiology, 16:37 68, [67] MF Land and DE Nilsson. Animal eyes. OUP,2002. [68] SB Laughlin. A simple coding procedure enhances a neuron s information capacity. ZNaturforsch, 36c: , [69] SB Laughlin. Matching coding to scenes to enhance e ciency. In O.J. Braddick and A.C. Sleigh, editors, Physical and biological processing of images, pages Springer, Berlin, [70] SB Laughlin, RR de Ruyter van Steveninck, and JC Anderson. The metabolic cost of neural information. Nature Neuro, 1(1):36 41, [71] P Lennie. The cost of cortical computation. Current Biology, 13: , [72] WB Levy and RA Baxter. Energy e cient neural codes. Neural computation, 8(3): ,1996. [73] R Linsker. Self-organization in perceptual network. Computer, pages , [74] BF Logan Jr. Information in the zero crossings of bandpass signals. AT&T Technical Journal, 56:487510,1977. [75] ZF Mainen and TJ Sejnowski. Reliability of spike timing in neocortical neurons. Science, 268(5216):1503, [76] D Marr. Vision: A computational investigation into the human representation and processing of visual information. 1982,2010. [77] M Meister and MJ Berry. The neural code of the retina. Neuron, 22: , [78] A Moujahid, A D Anjou, and M Graña. Energy demands of diverse spiking cells from the neocortex, hippocampus, and thalamus. Frontiers in computational neuroscience, 8:41,2014. [79] INemenman,GDLewen,WBialek,andRRdeRuytervanSteveninck. Neural coding of natural stimuli: Information at sub-millisecond resolution. PLoS Comp Biology, 4(3), [80] S Nirenberg, SM Carcieri, AL Jacobs, and PE Latham. Retinal ganglion cells act largely as independent encoders. Nature, 208

43 References 411(6838): , June [81] JE Niven, JC Anderson, and SB Laughlin. Fly photoreceptors demonstrate energy-information trade-o s in neural coding. PLoS Biology, 5(4), [82] JE Niven and SB Laughlin. Energy limitation as a selective pressure on the evolution of sensory systems. JExpBiol, 211: , [83] H. Nyquist. Certain topics in telegraph transmission theory. Proceedings of the IEEE, 90(2): ,1928. [84] BA Olshausen. 20 years of learning about vision: Questions answered, questions unanswered, and questions not yet asked. In 20 Years of Computational Neuroscience, pages [85] S Pajevic and PJ Basser. An optimum principle predicts the distribution of axon diameters in PLoS ONE, [86] L Paninski, JW Pillow, and EP Simoncelli. Maximum likelihood estimation of a stochastic integrate-and-fire neural encoding model. Neural computation, 16(12): ,2004. [87] JA Perge, K Koch, R Miller, P Sterling, and V Balasubramanian. How the optic nerve allocates space, energy capacity, and information. JNeuroscience,29(24): ,2009. [88] JA Perge, JE Niven, E Mugnaini, V Balasubramanian, and P Sterling. Why do axons di er in caliber? JNeuroscience,32(2): ,2012. [89] E Persi, D Hansel, L Nowak, P Barone, and C van Vreeswijk. Powerlaw input-output transfer functions explain the contrast-response and tuning properties of neurons in visual cortex. PLoS Comput Biol, 7(2):e , [90] MD Plumbley. E cient information transfer and anti-hebbian neural networks. Neural Networks, 6: , [91] K Rajan and W Bialek. Maximally informative stimulus energies in the analysis of neural responses to natural signals. PloS one, 8(11):e71959, [92] RPN Rao and DH Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field e ects. Nature Neuroscience, 2:79 87,1999. [93] FM Reza. Information Theory. New York, McGraw-Hill, [94] F Rieke, DA Bodnar, and W Bialek. Naturalistic stimuli increase the rate and e ciency of information transmission by primary auditory a erents. Proc Roy Soc Lond, 262(1365): , [95] F Rieke, D Warland, RR de Ruyter van Steveninck, and W Bialek. Spikes: Exploring the Neural Code. MIT Press, Cambridge, MA, [96] KF Riley, MP Hobson, and SJ Bence. Mathematical methods for physics and engineering. Cambridge University Press, [97] TW Ruderman DL, Cronin and C Chiao. Statistics of cone responses to natural images: Implications for visual coding. Journal of the Optical Society of America, 15: ,1998. [98] AB Saul and AL Humphrey. Spatial and temporal response properties of lagged and nonlagged cells in cat lateral geniculate nucleus. Journal of neurophysiology, 64(1): ,1990. [99] B Sengupta, MB Stemmler, and KJ Friston. Information and e ciency in the nervous system - a synthesis. PLoS Computuational Biology, July [100] CE Shannon. A mathematical theory of communication. Bell System Technical Journal,27: ,

44 References [101] CE Shannon and W Weaver. The Mathematical Theory of Communication. University of Illinois Press, [102] T Sharpee, NC Rust, and W Bialek. Analyzing neural responses to natural signals: maximally informative dimensions. Neural computation, 16(2): ,2004. [103] TO Sharpee. Computational identification of receptive fields. Annual review of neuroscience, 36: ,2013. [104] EC Smith and MS Lewicki. E cient auditory coding. Nature, 439(7079): , [105] SSmith.Digital signal processing: A practical guide for engineers and scientists. Newnes, [106] L Sokolo. Circulation and energy metabolism of the brain. Basic neurochemistry, 2: ,1989. [107] SG Solomon and P Lennie. The machinery of colour vision. Nature Reviews Neuroscience, 8(4): ,2007. [108] WW Sprague, EA Cooper, I Tošić, and MS Banks. Stereopsis is adaptive for the natural environment. Science Advances, 1(4), [109] MV Srinivasan, SB Laughlin, and A Dubs. Predictive coding: A fresh view of inhibition in the retina. Proc Roy Soc London, B., 216(1205): , [110] P Sterling and S Laughlin. Principles of Neural Design. MIT Press, [111] JV Stone. Independent Component Analysis: A Tutorial Introduction. MIT Press, Boston, [112] JV Stone. Footprints sticking out of the sand (Part II): Children s Bayesian priors for lighting direction and convexity. Perception, 40(2): , [113] JV Stone. Vision and Brain. MIT Press, [114] JV Stone. Bayes Rule: A Tutorial Introduction to Bayesian Analysis. Sebtel Press, She eld, England, [115] JV Stone, IS Kerrigan, and J Porrill. Where is the light? Bayesian perceptual priors for lighting direction. Proceedings Royal Society London (B), 276: ,2009. [116] SP Strong, RR de Ruyter van Steveninck, W Bialek, and R Koberle. On the application of information theory to neural spike trains. In Pac Symp Biocomput, volume621,page32,1998. [117] P Tiesinga, JM Fellous, and TJ Sejnowski. Regulation of spike timing in visual cortical circuits. Nature reviews neuroscience, 9(2):97 107, [118] JH van Hateren. A theory of maximizing sensory information. Biological Cybernetics, 68:23 29,1992. [119] BT Vincent and RJ Baddeley. Synaptic energy e ciency in retinal processing. Vision research, 43(11): , [120] MJ Wainwright. Visual adaptation as optimal information transmission. Vision research, 39(23): , [121] Z Wang, X Wei, A Stocker, and D Lee. E cient neural codes under metabolic constraints. In NIPS2016, [122] X Wei and AA Stocker. Mutual information, Fisher information, and e cient coding. Neural Computation, 28(2): ,2016. [123] LZhaoping. Understanding Vision: Theory, Models, and Data. OUP Oxford,

45 Index action potential, 28, 29 adenosine triphosphate (ATP), 47, 189 aftere ect, 65 autocorrelation function, 189 average, 189 axon, 28 band-pass filter, 139 bandwidth, 25, 26 Bayes rule, 16, 165, 189 binary digits vs bits, 12 bit, 10, 12, 189 Bussgang s theorem, 102, 115 calculus of variations, 54 capacity, 17, 189 central limit theorem, 36, 63, 189 chain rule for entropy, 204 channel Gaussian, 23 channel capacity, 10, 17, 18, 189 characteristic equation, 89 chromatic aberration, 67 coding capacity, 32, 43, 58, 189 coding e ciency, 38 colour aftere ect, 65 communication channel, 10 compound eye, 153 computational neuroscience, 6 conditional entropy, 21, 178, 189 conditional probability, 189 conduction velocity, 28, 54 cone, 66 contrast definition, 155 convolution operator, 104, 129 cornea, 153 correlation, 189, 197 correlation time, 176 covariance matrix, 88, 112 Cramer-Rao bound, 99 cricket, 39 cross-correlation function, 190 cross-correlaton function, 113 cumulative distribution function, 163, 190 cut-o frequency, 140 dark noise, 73 data matrix, 78 decodability linear, 191 decoding, 190 decoding function, 190 decorrelated, 190 dendrite, 27 determinant, 89 die, 14 di erence-of-gaussian, 131 di erential entropy, 23 Dirac delta function, 105 direct method, 37 dot product, 79, 105 dynamic linearity, 95, 190 e cient coding hypothesis, i, 2, 6, 49,

46 Index e cient coding model, 97 e cient coding principle, 185 empirical Bayes, 171, 180 encoding, 10, 158, 160, 190 neural, 103 encoding function, 190 energy e ciency, 2, 51, 56, 118, 122, 190 entropy, 13, 14, 190 entropy vs information, 15 equivariant, 85, 190 expected value, 190 exponential decay, 30 exponential distribution, 17 fan diagram, 20 filter, 190 firing rate, 29 Fisher information, 22 fly s eye, 153 Fourier analysis, 26, 126 Fourier transform, 136 Galileo, 4 ganglion cell, 67, 92, 116, 122, 129 Gaussian distribution, 17, 36 Gaussian variable, 190 glial cells, 58 identity matrix, 89 iid, 190 independence, 190 information, 12, 15, 190 information theory, 5 information vs entropy, 15 inner product, 79, 105 Jensen s inequality, 86 Joule, 43, 191 kernel, 96, 127, 191 Lagrange multipliers, 60 large monopolar cells, 48, 155 least squares estimate, 99, 108, 170 linear, 191 decodability, 176, 191 decoding vs decodability, 177 filter, 38, 96, 119, 142, 191 function, 101 superposition, 97, 101 system, 97 time invariant, 97 linearity dynamic, 95 static, 95 LMC, 155 LNP neuron model, 103 Logan s theorem, 141 logarithm, 11, 12, 191 low-pass filter, 127, 140 LSE and information, 115 marginal value theorem, 56, 61 Marr, David, 7 matrix, 78, 79 maximum a posteriori, 168 maximum entropy, 17 maximum entropy e ciency, 58 maximum entropy model, 151, 164 maximum likelihood, 99, 168 measurement precision, 40 Mexican hat, 130 micron, 191 midget ganglion cell, 92 mimic model, 96 mitochondria, 47 mitochondrial volume, 50 mitochondrion, 191 model e cient coding, 97 linear filter, 96 mimic, 96 predictive coding, 97 monotonic, 191 mutual information, 21,

47 Index Gaussian channel, 23 myelin, 47 myelin sheath, 29 natural statistics, 164, 180, 191 neocortex, 191 neural arithmetic, 74 neural encoding, 103 neural superposition, 153 neuron, 1, 27 noise, 5, 191 noisy channel coding theorem, 22, 26 Nyquist rate, 25 o -centre cell, 130 ommatidia, 153 on-centre cell, 130 opponent process, 69, 92, 131 optic nerve, 44, 46 organelle, 47 orthogonal, 191 parasol ganglion cell, 92 phase, 136, 191 photopsin, 73 Poisson distribution, 36 power, 25, 191 power density, 52 power spectrum, 137, 191 predictive coding, 2, 97, 99, 116, 121, 141, 185 prior probability, 168 probability distribution, 14, 191 function, 191 mass function, 191 pyramidal cell, 27, 59 pyramidal neuron, 191 random variable, 13, 192 receptive field, 67 on-centre, 130 receptors, 30 redundancy, 192 retinal ganglion cell, 67 reverse correlation, 106 rhabdom, 153 rhodopsin, 66, 73 rotation matrix, 82 sample rate, 25 scalar variable, 79 scale invariance, 148 Shannon, C, 5, 9 signal to noise ratio, 25, 192 source coding theorem, 19 space constant, 31 spatial frequency, 126 spike, 1, 28, 29 spike cost, 47 spike speed, 54 spike timing precision, 40 spike train, 32 spike-triggered average, 105 standard deviation, 192 static linearity, 95, 191 surprisal, 12 synapse, 30 theorem, 10, 192 timing precision, 40, 192 transfer function, 29, 103, 104, 115, 151, 152, 192 transpose, 79, 202 tuning function, 192 uncertainty, 13, 15, 192 uncorrelated, 192 uniform distribution, 17 variable, 192 scalar, 79 vector, 78 vector, 78, 192, 199 variable, 78 white, 106, 137, 192 whitening filter, 137 Wiener kernel, 96, 104, 122,

48 Dr James Stone is an Honorary Reader in Vision and Computational Neuroscience at the University of Sheffield, England. Seeing The Computational Approach to Biological Vision second edition John P. Frisby and James V. Stone

Principles of Neural Information Theory

Principles of Neural Information Theory Computational Neuroscience and Metabolic E ciency James V Stone Title: Principles of Neural Information Theory Author: James V Stone c 2017 Sebtel Press All rights