Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding

Size: px

Start display at page:

Download "Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding"

Sharon Thomas
5 years ago
Views:

1 Article Approximations of Shannon utual Information for Discrete Variables with Applications to Neural Population Coding Wentao Huang 1,2, * and Kechen Zhang 2, * 1 Key Laboratory of Cognition and Intelligence and Information Science Academy of China Electronics Technology Group Corporation, Beijing 186, China 2 Department of Biomedical Engineering, Johns Hopkins University School of edicine, Baltimore, D 2125, USA * Correspondence: wnthuang@gmail.com W.H.; kzhang4@jhmi.edu K.Z.; Tel.: W.H.; K.Z. Abstract: Although Shannon mutual information has been widely used, its effective calculation is often difficult for many practical problems, including those in neural population coding. Asymptotic formulas based on Fisher information sometimes provide accurate approximations to the mutual information but this approach is restricted to continuous variables because the calculation of Fisher information requires derivatives with respect to the encoded variables. In this paper, we consider information-theoretic bounds and approximations of the mutual information based on Kullback Leibler divergence and Rényi divergence. We propose several information metrics to approximate Shannon mutual information in the context of neural population coding. While our asymptotic formulas all work for discrete variables, one of them has consistent performance and high accuracy regardless of whether the encoded variables are discrete or continuous. We performed numerical simulations and confirmed that our approximation formulas were highly accurate for approximating the mutual information between the stimuli and the responses of a large neural population. These approximation formulas may potentially bring convenience to the applications of information theory to many practical and theoretical problems. Keywords: neural population coding; mutual information; Kullback Leibler divergence; Rényi divergence; Chernoff divergence; approximation; discrete variables 1. Introduction Information theory is a powerful tool widely used in many disciplines, including, for example, neuroscience, machine learning, and communication technology [1 7]. As it is often notoriously difficult to effectively calculate Shannon mutual information in many practical applications [8], various approximation methods have been proposed to estimate the mutual information, such as those based on asymptotic expansion [9 13], k-nearest neighbor [14], and minimal spanning trees [15]. Recently, Safaai et al. proposed a copula method for estimation of mutual information, which can be nonparametric and potentially robust [16]. Another approach for estimating the mutual information is to simplify the calculations by approximations based on information-theoretic bounds, such as the Cramér Rao lower bound [17] and the van Trees Bayesian Cramér Rao bound [18]. In this paper, we focus on mutual information estimation based on asymptotic approximations [19 24]. For encoding of continuous variables, asymptotic relations between mutual information and Fisher information have been presented by several researchers [19 22]. Recently, Huang and Zhang [24] proposed an improved approximation formula, which remains accurate for high-dimensional variables. A significant advantage of this approach is that asymptotic approximations are sometimes very useful in analytical studies. For instance, asymptotic approximations allow us to prove that the 219 by the authors. Distributed under a Creative Commons CC BY license.

2 2 of 21 optimal neural population distribution that maximizes the mutual information between stimulus and response can be solved by convex optimization [24]. Unfortunately this approach does not generalize to discrete variables since the calculation of Fisher information requires partial derivatives of the likelihood function with respect to the encoded variables. For encoding of discrete variables, Kang and Sompolinsky [23] represented an asymptotic relationship between mutual information and Chernoff information for statistically independent neurons in a large population. However, Chernoff information is still hard to calculate in many practical applications. Discrete stimuli or variables occur naturally in sensory coding. While some stimuli are continuous e.g., the direction of movement, and the pitch of a tone, others are discrete e.g., the identities of faces, and the words in human speech. For definiteness, in this paper, we frame our questions in the context of neural population coding; that is, we assume that the stimuli or the input variables are encoded by the pattern of responses elicited from a large population of neurons. The concrete examples used in our numerical simulations were based on Poisson spike model, where the response of each neuron is taken as the spike count within a given time window. While this simple Poisson model allowed us to consider a large neural population, it only captured the spike rate but not any temporal structure of the spike trains [25 28]. Nonetheless, our mathematical results are quite general and should be applicable to other input output systems under suitable conditions to be discussed later. In the following, we first derive several upper and lower bounds on Shannon mutual information using Kullback Leibler divergence and Rényi divergence. Next, we derive several new approximation formulas for Shannon mutual information in the limit of large population size. These formulas are more convenient to calculate than the mutual information in our examples. Finally, we confirm the validity of our approximation formulas using the true mutual information as evaluated by onte Carlo simulations. 2. Theory and ethods 2.1. Notations and Definitions Suppose the input x is a K-dimensional vector, x = x 1,, x K T, which could be interpreted as the parameters that specifies a stimulus for a sensory system, and the outputs is an N-dimensional vector, r = r 1,, r N T, which could be interpreted as the responses of N neurons. We assume N is large, generally N K. We denote random variables by upper case letters, e.g., random variables X and R, in contrast to their vector values x and r. The mutual information between X and R is defined by I = IX; R = ln pr x, 1 pr r,x where x X R K, r R R N, and r,x denotes the expectation with respect to the probability density function pr, x. Similarly, in the following, we use r x and x to denote expectations with respect to pr x and px, respectively. If px and pr x are twice continuously differentiable for almost every x X, then for large N we can use an asymptotic formula to approximate the true value of I with high accuracy [24]: which is sometimes reduced to I I G = 1 2 I I F = 1 2 Gx ln det + HX, 2 2πe x Jx ln det + HX, 3 2πe x

3 3 of 21 where det denotes the matrix determinant, HX = ln px x is the stimulus entropy, and Jx = 2 ln pr x x x T is the Fisher information matrix. We denote the Kullback Leibler divergence as Gx = Jx + P x, 4 Px = 2 ln px x x T, 5 r x ln pr x = x D x ˆx = ln p r x p r ˆx and denote Rényi divergence [29] of order β + 1 as ln pr x x T r x 6, 7 r x D β x ˆx = 1 pr x β β ln. 8 p r ˆx r x Here, βd β x ˆx is equivalent to Chernoff divergence of order β + 1 [3]. It is well known that D β x ˆx D x ˆx in the limit β. We define I u = ln exp D x ˆx ˆx x, 9 I e = ln exp e 1 D x ˆx, 1 ˆx x I β,α = ln exp βd β x ˆx + 1 α ln p x, 11 p ˆx ˆx x where in I β,α we have β, 1 and α, and assume p x > for all x X. In the following, we suppose x takes discrete values, x m, m = {1, 2,, }, and px m > for all m. Now, the definitions in Equations 9 11 become I u = I e = I β,α = p x m ln p x m ln p x m ln ˆ ˆ ˆ p x ˆm p x m exp D x m x ˆm + HX, 12 p x ˆm p x m exp e 1 D x m x ˆm + HX, 13 px ˆm α exp βd px m β x m x ˆm + HX. 14

4 4 of 21 Furthermore, we define I d = I d u = px m ln px m ln 1 + ˆm u m 1 + Iβ,α d = px m ln 1 + I D = px m ln ˆm u m ˆm β m px ˆm px m exp e 1 D x m x ˆm + H X, 15 px ˆm px m exp D x m x ˆm + H X, 16 px ˆm α exp βd px m β x m x ˆm + H X, exp e 1 D x m x ˆm + H X, 18 ˆm u m where ˇ m β = ˆm : ˆm = arg min D β x m x ˇm, 19 ˇm ˆ m β { } ˇ u m = ˆm : ˆm = arg min ˇm ˆ u m D x m x ˇm, 2 ˆ β m = { ˆm : D β x m x ˆm = }, 21 ˆ u m = { ˆm : D x m x ˆm = }, 22 β m = ˇ β m ˆ β m {m}, 23 u m = ˇ u m ˆ u m {m}. 24 Here, notice that, if x is uniformly distributed, then by definition I d and I D become identical. The elements in set ˇ m β are those that make D β x m x ˇm take the minimum value, excluding any element that satisfies the condition D β x m x ˆm =. Similarly, the elements in set ˇ u m are those that minimize D x m x ˇm excluding the ones that satisfy the condition D x m x ˆm = Theorems In the following, we state several conclusions as theorems and prove them in Appendix. Theorem 1. The mutual information I is bounded as follows: Theorem 2. The following inequalities are satisfied, where I β1,1 is a special case of I β,α in Equation 11 with β 1 = e 1 so that Theorem 3. If there exist γ 1 > and γ 2 > such that I β,α I I u. 25 I β1,1 I e I u 26 I β1,1 = ln exp β 1 D β1 x ˆx ˆx. 27 x βd β x m x m1 γ 1 ln N, 28 D x m x m2 γ 2 ln N, 29

5 5 of 21 for discrete stimuli x m, where m, m 1 β m and m 2 u m, then we have the following asymptotic relationships: I β,α = I d β,α + O N γ 1 I I u = I d u + O N γ 2 3 and I e = I d + O N γ 2/e. 31 Theorem 4. Suppose px and pr x are twice continuously differentiable for x X, q x <, q x <, where qx = ln px and and denote partial derivatives / x and 2 / x x T, and G γ x is positive definite with NG 1 x = O 1, where denotes matrix Frobenius norm, γ = β 1 β and β, 1. If there exist an ω = ω x > such that γ det Gx 1/2 det G γ x 1/2 G γ x = γ Jx + P x, 32 X ε x X ε x pˆx exp D x ˆx dˆx = O N 1, 33 pˆx exp βd β x ˆx dˆx = O N 1, 34 for all x X and ε, ω, where X { ω x = X X ω x is the complementary set of X ω x = x R K : x x T Gx x x < Nω 2}, then we have the following asymptotic relationships: where I β,α I γ + O N 1 I I u = I G + K/2 + O N 1, 35 I γ = 1 2 and γ = β 1 β = 1/4 with β = 1/2. X I e = I G + O N 1, 36 I β,α = I γ + O N 1, 37 px ln det Gγ x 2π dx + HX 38 Remark 1. We see from Theorems 1 and 2 that the true mutual information I and the approximation I e both lie between I β1,1 and I u, which implies that their values may be close to each other. For discrete variable x, Theorem 3 tells us that I e and I d are asymptotically equivalent i.e., their difference vanishes in the limit of large N. For continuous variable x, Theorem 4 tells us that I e and I G are asymptotically equivalent in the limit of large N, which means that I e and I are also asymptotically equivalent because I G and I are known to be asymptotically equivalent [24]. Remark 2. To see how the condition in Equation 33 could be satisfied, consider the case where D x ˆx has only one extreme point at ˆx = x for ˆx X ω x and there exists an η > such that N 1 D x ˆx η for ˆx X ω x. Now, the condition in Equation 33 is satisfied because det Gx 1/2 X ε x det Gx 1/2 X ε x pˆx exp D x ˆx dˆx pˆx exp ˆη ε N dˆx = O N K/2 e ˆηεN, 39

6 6 of 21 where by assumption we can find an ˆη ε > for any given ε, ω. The condition in Equation 34 can be satisfied in a similar way. When β = 1/2, β D β x ˆx is the Bhattacharyya distance [31]: β D β x ˆx = ln pr xpr ˆxdr, 4 R and we have J x = 2 D x ˆx ˆx ˆx T = 2 4β D β x ˆx ˆx=x ˆx ˆx T = 2 ˆx=x 8H 2 l x ˆx ˆx ˆx T, 41 ˆx=x where H l x ˆx is the Hellinger distance [32] between pr x and pr ˆx: H 2 l x ˆx = 1 2 By Jensen s inequality, for β, 1 we get Denoting the Chernoff information [8] as where βd β x ˆx achieves its maximum at β m, we have I β,α HX h c = h d = R 2 pr x pr ˆx dr. 42 D β x ˆx D x ˆx. 43 C x ˆx = max βdβ x ˆx = β m D βm x ˆx, 44 β, 1 p x m ln p x m ln ˆ ˆ p x ˆm p x m exp C x m x ˆm p x ˆm p x m exp β m D β x m x ˆm By Theorem 4, max I β,α = I γ + O N 1, 47 β, 1 If β m = 1/2, then, by Equations 5, 46, 47 and 48, we have I γ = I G K 2 ln 4 e. 48 max I β + K β, 1 2 ln 4 e + O N 1 I e = I + O N 1 h d + HX I u. 49 Therefore, from Equations 45, 46 and 49, we can see that I e and I are close to h c + HX Approximations for utual Information In this section, we use the relationships described above to find effective approximations to true mutual information I in the case of large but finite N. First, Theorems 1 and 2 tell us that the true mutual information I and its approximation I e lie between lower and upper bounds given by:

7 7 of 21 I β,α I I u and I β1,1 I e I u. As a special case, I is also bounded by I β1,1 I I u. Furthermore, from Equation 2 and 36 we can obtain the following asymptotic equality under suitable conditions: I = I e + O N 1. 5 Hence, for continuous stimuli, we have the following approximate relationship for large N: I I e I G. 51 For discrete stimuli, by Equation 31 for large but finite N, we have I I e I d = px m ln 1 + ˆm u m px ˆm px m exp e 1 D x m x ˆm + H X. 52 Consider the special case px ˆm px m for ˆm u m. With the help of Equation 18, substitution of px ˆm px m into Equation 52 yields I I D = = I D px m px m ln 1 + exp e 1 D x m x ˆm + H X ˆm u m exp e 1 D x m x ˆm + H X ˆm u m where ID I D and the second approximation follows from the first-order Taylor expansion assuming that the term exp e 1 D x m x ˆm is sufficiently small. ˆm u m The theoretical discussion above suggests that I e and I d are effective approximations to true mutual information I in the limit of large N. oreover, we find that they are often good approximations of mutual information I even for relatively small N, as illustrated in the following section. 3. Results of Numerical Simulations Consider Poisson model neuron whose responses i.e., numbers of spikes within a given time window follow a Poisson distribution [24]. The mean response of neuron n, with n {1, 2,, N}, is described by the tuning function f x; θ n, which takes the form of a Heaviside step function: f x; θ n = { A, if x θ n,, if x < θ n, where the stimulus x [ T, T] with T = 1, A = 1, and the centers θ 1, θ 2,, θ N of the N neurons are uniformly spaced in interval [ T, T], namely, θ n = n 1 d T with d = 2T/N 1 for N 2, and θ n = for N = 1. We suppose that the discrete stimulus x has = 21 possible values that are evenly spaced from T to T, namely, x X = {x m : x m = 2 m 1 T/ 1 T, m = 1, 2,, }. Now, the Kullback Leibler divergence can be written as D x m x ˆm = f x m ; θ n log f xm ; θ n + f x f x ˆm ; θ n ˆm ; θ n f x m ; θ n. 55

8 8 of 21 Thus, we have exp e 1 D x m x ˆm = 1 when f x m ; θ n = f x ˆm ; θ n, exp e 1 D x m x ˆm = exp e 1 A when f x m ; θ n = and f x ˆm ; θ n = A, and exp e 1 D x m x ˆm = when f x m ; θ n = A and f x ˆm ; θ n =. Therefore, in this case, we have I e = I d. 56 ore generally, this equality holds true whenever the tuning function has binary values. In the first example, as illustrated in Figure 1, we suppose the stimulus has a uniform distribution, so that the probability is given by px m = 1/. Figure 1a shows graphs of the input distribution px and a representative tuning function f x; θ with the center θ =. To assess the accuracy of the approximation formulas, we employed onte Carlo C simulation to evaluate the mutual information I [24]. In our C simulation, we first sampled an input x j X from the uniform distribution px j = 1/, then generated the neural responses r j by the conditional distribution pr j x j based on the Poisson model, where j = 1, 2,, j max. The value of mutual information by C simulation was calculated by where IC = 1 j max j max pr j = j=1 prj x j ln, 57 pr j pr j x m px m. 58 To assess the precision of our C simulation, we computed the standard deviation of repeated trials by bootstrapping: I std = 1 i max i max I i 2, C I C 59 i=1 where IC i = 1 j max j max j=1 prγj,i x Γj,i ln, 6 pr Γj,i I C = 1 i max i max IC i, 61 i=1 and Γ j,i {1, 2,, j max } is the j, i-th entry of the matrix Γ N j max i max with samples taken randomly from the integer set {1, 2,, j max } by a uniform distribution. Here, we set j max = 5 1 5, i max = 1 and = 1 3. For different N {1, 2, 3, 4, 6, 1, 14, 2, 3, 5, 1, 2, 4, 7, 1}, we compared I C with I e, I d and I D, as illustrated in Figure 1b d. Here, we define the relative error of approximation, e.g., for I e, as DI e = I e I C I C, 62 and the relative standard deviation DI std = I std I C. 63

9 9 of 21 a.1 1 b 4 px.5 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 1. A comparison of approximations I e, I d and I D against I C obtained by onte Carlo method for one-dimensional discrete stimuli. a Discrete uniform distribution of the stimulus px black dots and the Heaviside step tuning function f x; θ with center θ = blue dashed lines ; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N ; c The relative errors DI e, DI d and DI D for the results in b ; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. For the second example, we only changed the probability distribution of stimulus px m while keeping all other conditions unchanged. Now, px m is a discrete sample from a Gaussian function: px m = Z 1 exp x2 m 2σ 2, m = 1, 2,,, 64 where Z is the normalization constant and σ = T/2. The results are illustrated in Figure 2.

10 1 of 21 a.1 1 b 4 px.5 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 2. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 1 except that the stimulus distribution px is peaked rather flat black dots in a. a Discrete Gaussian-like distribution of the stimulus px black dots and the Heaviside step tuning function f x; θ with center θ = blue dashed lines; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. Next, we changed each tuning function f x; θ n to a rectified linear function: f x; θ n = max, x θ n, 65 Figures 3 and 4 show the results under the same conditions of Figures 1 and 2 except for the shape of the tuning functions.

11 11 of 21 a.1 1 b 4 px 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 3. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 1 except for the shape of the tuning function blue dashed lines in a. a Discrete uniform distribution of the stimulus px black dots and the rectified linear tuning function f x; θ with center θ = blue dashed lines; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. Finally, we let the tuning function f x; θ n have a random form: f x; θ n = { B, if x θ n = { θn, 1 θn, 2, θn K },, otherwise, 66 where the stimulus x X = {1, 2,, 999, 1}, B = 1, the values of { θ 1 n, θ 2 n,, θ K n } are distinct and randomly selected from the set X with K = 1. In this example, we may regard X as a list of natural objects stimuli, and there are a total of N sensory neurons, each of which responds only to K randomly selected objects. Figure 5 shows the results under the condition that px is a uniform distribution. In Figure 6, we assume that px is not flat but a half Gaussian given by Equation 64 with σ = 5.

12 12 of 21 a.1 1 b 4 px 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 4. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 3 except that the stimulus distribution px is peaked rather flat black dots in a. a Discrete Gaussian-like distribution of the stimulus px black dots and the rectified linear tuning function f x; θ with center θ = blue dashed lines; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. In all these examples, we found that the three formulas, namely, I e, I d and I D, provided excellent approximations to the true values of mutual information as evaluated by onte Carlo method. For example, in the examples in Figures 1 and 5, all three approximations were practically indistinguishable. In general, all these approximations were extremely accurate when N > 1. In all our simulations, the mutual information tended to increase with the population size N, eventually reaching a plateau for large enough N. The saturation of information for large N is due to the fact that it requires at most log 2 bits of information to completely distinguish all stimuli. It is impossible to gain more information than this maximum amount regardless of how many neurons are used in the population. In Figure 1, for instance, this maximum is log 2 21 = 4.39 bits, and in Figure 5, this maximum is log 2 1 = 9.97 bits.

13 13 of 21 a.2 1 b 1 8 px.1 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 5. A comparison of approximations I e, I d and I D against I C. The situation is similar to that in Figure 1 except that the tuning function is random blue dashed lines in a; see Equation 66. a Discrete uniform distribution of the stimulus px black dots and the random tuning function f x; θ; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. For relatively small values of N, we found that I D tended to be less accurate than I e or I d see Figures 5 and 6. Our simulations also confirmed two analytical results. The first one is that I d = I D when the stimulus distribution is uniform; this result follows directly from the definitions of I d and I D and is confirmed by the simulations in Figures 1, 3, and 5. The second result is that I d = I e Equation 56 when the tuning function is binary, as confirmed by the simulations in Figures 1, 2, 5, and 6. When the tuning function allows many different values, I e can be much more accurate than I d and I D, as shown by the simulations in Figures 3 and 4. To summarize, our best approximation formula is I e because it is more accurate than I d and I D, and, unlike I d and I D, it applies to both discrete and continuous stimuli Equations 1 and 13.

14 14 of 21 a.2 1 b 1 8 px Input x 5 fx; Information bits IC Ie Id ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 6. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 5 except that the stimulus distribution px is not flat black dots in a. a Discrete Gaussian-like distribution of the stimulus px black dots and the random tuning function f x; θ; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. 4. Discussion We have derived several asymptotic bounds and effective approximations of mutual information for discrete variables and established several relationships among different approximations. Our final approximation formulas involve only Kullback Leibler divergence, which is often easier to evaluate than Shannon mutual information in practical applications. Although in this paper our theory is developed in the framework of neural population coding with concrete examples, our mathematical results are generic and should hold true in many related situations beyond the original context. We propose to approximate the mutual information with several asymptotic formulas, including I e in Equation 1 or Equation 13, I d in Equation 15 and I D in Equation 18. Our numerical experimental results show that the three approximations I e, I d and I D were very accurate for large population size N, and sometimes even for relatively small N. Among the three approximations, I D

15 15 of 21 tended to be the least accurate, although, as a special case of I d, it is slightly easier to evaluate than I d. For a comparison of I e and I d, we note that I e is the universal formula, whereas I d is restricted only to discrete variables. The two formulas I e and I d become identical when the responses or the tuning functions have only two values. For more general tuning functions, the performance of I e was better than I d in our simulations. As mentioned before, an advantage of of I e is that it works not only for discrete stimuli but also for continuous stimuli. Theoretically speaking, the formula for I e is well justified, and we have proven that it approaches the true mutual information I in the limit of large population. In our numerical simulations, the performance of I e was excellent and better than that of I d and I D. Overall, I e is our most accurate and versatile approximation formula, although, in some cases, I d and I D are slightly more convenient to calculate. The numerical examples considered in this paper were based on an independent population of neurons whose responses have Poisson statistics. Although such models are widely used, they are appropriate only if the neural responses can be well characterized by the spike counts within a fixed time window. To study the temporal patterns of spike trains, one has to consider more complicated models. Estimation of mutual information from neural spike trains is a difficult computational problem [25 28]. In future work, it would be interesting to apply the asymptotic formulas such as I e to spike trains with small time bins each containing either one spike or nothing. A potential advantage of the asymptotic formula is that it might help reduce the bias caused by small samples in the calculation of the response marginal distribution pr = x pr xpx or the response entropy HR because here one only needs to calculate the Kullback Leibler divergence D x ˆx, which may have a smaller estimation error. Finding effective approximation methods for computing mutual information is a key step for many practical applications of the information theory. Generally speaking, Kullback Leibler divergence Equation 7 is often easier to evaluate and approximate than either Chernoff information Equation 44 or Shannon mutual information Equation 1. In situations where this is indeed the case, our approximation formulas are potentially useful. Besides applications in numerical simulations, the availability of a set of approximation formulas may also provide helpful theoretical tools in future analytical studies of information coding and representations. As mentioned in the Introduction, various methods have been proposed to approximate the mutual information [9 16]. In future work, it would be useful to compare different methods rigorously under identical conditions in order to asses their relative merits. The approximation formulas developed in this paper are relatively easy to compute for practical problems. They are especially suitable for analytical purposes; for example, they could be used explicitly as objective functions for optimization or learning algorithms. Although the examples used in our simulations in this paper are parametric, it should be possible to extend the formulas to nonparametric problem, possibly with help of the copula method to take advantage of its robustness in nonparametric estimations [16]. Author Contributions: W.H. developed and proved the theorems, programmed the numerical experiments and wrote the manuscript. K.Z. verified the proofs and revised the manuscript. Funding: This research was supported by an NIH grant R1 DC Conflicts of Interest: The authors declare no conflict of interest.

16 16 of 21 Appendix A. The Proofs Appendix A.1. Proof of Theorem 1 By Jensen s inequality, we have I β,α = ln X ln X p β r ˆxp α ˆx p β r x p α x p β r ˆxp α ˆx p β r x p α x dˆx dˆx r x r x x x + HX + HX A1 and ln X = ln p β r ˆxp α ˆx p β r x p α x dˆx r x x + HX I 1 pr, ˆx X p r, x dˆx p β r ˆxp α ˆx X p β r x p α x dˆx r x x ln pr X pβ r x p α xdx R X pβ r ˆxp α ˆxdˆx dr =. A2 Combining Equations A1 and A2, we immediately get the lower bound in Equation 25. In this section, we use integral for variable x, although our argument is valid for both continuous variables and discrete variables. For discrete variables, we just need to replace each integral by a summation, and our argument remains valid without other modification. The same is true for the response variable r. To prove the upper bound, let Φ [qˆx] = R p r x X qˆx ln p r x qˆx dˆxdr, p r ˆx pˆx A3 where qˆx satisfies { X qˆxdˆx = 1 qˆx. A4 By Jensen s inequality, we get Φ [qˆx] R p r x ln p r x dr. p r To find a function qˆx that minimizes Φ [qˆx], we apply the variational principle as follows: Φ [qˆx] q ˆx = R p r x ln p r x qˆx dr λ, p r ˆx pˆx A5 A6 where λ is the Lagrange multiplier and Φ [qˆx] = Φ [qˆx] + λ qˆxdˆx 1. A7 X

17 17 of 21 Setting Φ[qˆx] qˆx = and using the constraint in Equation A4, we find the optimal solution q ˆx = X pˆx exp D x ˆx. A8 pˇx exp D x ˇx dˇx Thus, the variational lower bound of Φ [qˆx] is given by Φ [q ˆx] = min Φ [qˆx] = ln pˆx exp D x ˆx dˆx dx. qˆx X A9 Therefore, from Equations 1, A5 and A9, we get the upper bound in Equation 25. This completes the proof of Theorem 1. Appendix A.2. Proof of Theorem 2 It follows from Equation 43 that I β1,α 1 = ln exp β 1 D β1 x ˆx + 1 α 1 ln p x p ˆx ˆx ln exp e 1 D x ˆx = I e ˆx x ln exp D x ˆx ˆx x = I u, x A1 where β 1 = e 1 and α 1 = 1. We immediately get Equation 26. This completes the proof of Theorem 2. Appendix A.3. Proof of Theorem 3 For the lower bound I β,α, we have I β,α = px m ln ˇ p x ˇm α exp βd p x m β x m x ˇm = px m ln 1 + d x m + H X, A11 where Now, consider p x d x m = ˇm α ˇm {m} exp βd p x m β x m x ˇm. A12 ln 1 + d x m = ln 1 + a x m + b x m = ln 1 + a x m + ln 1 + b x m 1 + a x m 1 = ln 1 + a x m + O N γ, A13 where p x a x m = ˆm α β exp βd ˆm m p x m β x m x ˆm, A14a p x b x m = ˇm α β exp βd ˇm m p x m β x m x ˇm N γ 1 ˇm β m p x ˇm p x m α = O N γ 1. A14b

18 18 of 21 Combining Equations A11 and A13 and Theorem 1, we get the lower bound in Equation 3. In a manner similar to the above, we can get the upper bound in Equations 3 and 31. This completes the proof of Theorem 3. Appendix A.4. Proof of Theorem 4 The upper bound I u for mutual information I in Equation 25 can be written as I u = = X ln ln p ˆx exp D x ˆx dˆx X Lr ˆx X exp Lr x r x dˆx p x dx + H X. A15 where L r ˆx = ln p r ˆx p ˆx and L r x = ln p r x p x. Consider the Taylor expansion for Lr ˆx around x. Assuming that Lr ˆx is twice continuously differentiable for any ˆx X ω x, we get x Lr ˆx Lr x r x = y T v yt y 1 2 yt G 1/2 x G x Gx G 1/2 x y A16 where y = G 1/2 x ˆx x, v 1 = G 1/2 x q x A17 A18 and x = x + t ˆx x X ω x, t, 1. A19 where For later use, we also define v = G 1/2 x l r x l r x = ln p r x. A2 A21 Since G x is continuous and symmetric for x X, for any ɛ, 1, there is a ε, ω such that y T G 1/2 x G x Gx G 1/2 x y < ɛ y 2 A22 { for all y Y ε, where Y ε = y R K : y < ε } N. Then, we get ln Lr ˆx X exp Lr x r x dˆx 1 2 Y ln det Gx + ln exp y T v 1 1 ε ɛ yt y dy A23 and with Jensen s inequality, ln exp y T v 1 1 Y ε ɛ yt y dy ln Ψ ε + y T v 1 φ ε y dy Ŷ ε = K 2π 2 ln + O N K/2 e Nδ, 1 + ɛ A24

19 19 of 21 where δ is a positive constant, Ŷ ε y T v 1 φ ε y dy =, φ ε y = Ψ 1 ε Ψ ε = Ŷ ε exp exp ɛ yt y dy ɛ yt y A25 and Now, we evaluate Ŷ ε = Ψ ε = exp R K 2π = 1 + ɛ { y R K : y k < ε } N/K, k = 1, 2,, K Y ε ɛ yt y dy K/2 R K Ŷ ε exp exp R K Ŷ ε ɛ yt y ɛ yt y dy dy. Performing integration by parts with a e t2 /2 dt = e a2 /2 a a R K Ŷ ε exp ɛ yt y dy = O e t2 /2 dt, we find t 2 exp ɛ ε2 N 1 + ɛ 2 ε 2 N/ 4K N K/2 e Nδ, K/2 A26 A27 A28 for some constant δ >. Combining Equations A15, A23 and A24, we get I u 1 2 ln det 1 + ɛ 2π Gx + H X + O N K/2 e Nδ. x On the other hand, from Equation A22 and the condition in Equation 33, we obtain exp Lr ˆx Lr x r x dˆx X ε x det Gx R 1/2 exp y T v 1 1 K 2 1 ɛ yt y dy 1 ɛ 1/2 1 = det 2π Gx exp 2 1 ɛ 1 v T v 1 A29 A3 and Lr ˆx X exp Lr x r x dˆx = exp Lr ˆx Lr x r x dˆx + det X ε x 1 ɛ 1/2 2π Gx exp v T v ɛ X X ε x + O exp Lr ˆx Lr x r x dˆx N 1. A31

20 2 of 21 It follows from Equations A15 and A31 that ln Lr ˆx X exp Lr x r x dˆx 1 2 ln det 1 ɛ Note that v T v 1 x = O N 1. 2π x Gx + 1 x 2 1 ɛ 1 v T v 1 + O N 1. x A32 A33 Now, we have I u 1 2 ln det 1 ɛ 2π Gx + H X + O N 1. x A34 Since ɛ is arbitrary, we can let it go to zero. Therefore, from Equations 25, A29 and A34, we obtain the upper bound in Equation 35. The Taylor expansion of h ˆx, x = pr ˆx β around x is pr x β h ˆx, x = 1 + pr x ˆx x T β 2pr x 2 = 1 pr x x r x ˆx x+ r x β 1 pr x pr x x x T β 1 β ˆx x T Jxˆx x pr x 2 pr x x x T ˆx x + r x In a similar manner as described above, we obtain the asymptotic relationship 37: A35 I β,α = I γ + O N 1 = 1 px ln det 2 X Gγ x 2π dx + HX. A36 Notice that < γ = β 1 β 1/4 and the equality holds when β = β = 1/2. Thus, we have det G γ x det G γ x. Combining Equations 25, A36 and A37 yields the lower bound in Equation 35. The proof of Equation 36 is similar. This completes the proof of Theorem 4. A37 References 1. Borst, A.; Theunissen, F.E. Information theory and neural coding. Nat. Neurosci. 1999, 2, Pouget, A.; Dayan, P.; Zemel, R. Information processing with population codes. Nat. Rev. Neurosci. 2, 1, Laughlin, S.B.; Sejnowski, T.J. Communication in neuronal networks. Science 23, 31, Brown, E.N.; Kass, R.E.; itra, P.P. ultiple neural spike train data analysis: State-of-the-art and future challenges. Nat. Neurosci. 24, 7, Bell, A.J.; Sejnowski, T.J. The independent components of natural scenes are edge filters. Vision Res. 1997, 37, Huang, W.; Zhang, K. An Information-Theoretic Framework for Fast and Robust Unsupervised Learning via Neural Population Infomax. In Proceedings of the 5th International Conference on Learning Representations ICLR, Toulon, France, April 217; arxiv:

21 21 of Huang, W.; Huang, X.; Zhang, K. Information-theoretic interpretation of tuning curves for multiple motion directions. In Proceedings of the st Annual Conference on Information Sciences and Systems CISS, Baltimore, D, USA, arch Cover, T..; Thomas, J.A. Elements of Information, 2nd ed.; Wiley-Interscience: New York, NY, USA, iller, G.A. Note on the bias of information estimates. Inf. Theory Psychol.Probl. ethods 1955, 2, Carlton, A. On the bias of information estimates. Psychol. Bull. 1969, 71, Treves, A.; Panzeri, S. The upward bias in measures of information derived from limited data samples. Neural Comput. 1995, 7, Victor, J.D. Asymptotic bias in information estimates and the exponential Bell polynomials. Neural Comput. 2, 12, Paninski, L. Estimation of entropy and mutual information. Neural Comput. 23, 15, Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 24, 69, Khan, S.; Bandyopadhyay, S.; Ganguly, A.R.; Saigal, S.; Erickson, D.J., III; Protopopescu, V.; Ostrouchov, G. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E 27, 76, Safaai, H.; Onken, A.; Harvey, C.D.; Panzeri, S. Information estimation using nonparametric copulas. Phys. Rev. E 218, 98, Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta ath. Soc. 1945, 37, Van Trees, H.L.; Bell, K.L. Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking; John Wiley: Piscataway, NJ, USA, Clarke, B.S.; Barron, A.R. Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 199, 36, Rissanen, J.J. Fisher information and stochastic complexity. IEEE Trans. Inform. Theory 1996, 42, Brunel, N.; Nadal, J.P. utual information, Fisher information, and population coding. Neural Comput. 1998, 1, Sompolinsky, H.; Yoon, H.; Kang, K.J.; Shamir,. Population coding in neuronal systems with correlated noise. Phys. Rev. E 21, 64, Kang, K.; Sompolinsky, H. utual information of population codes and distance measures in probability space. Phys. Rev. Lett. 21, 86, Huang, W.; Zhang, K. Information-theoretic bounds and approximations in neural population coding. Neural Comput. 218, 3, Strong, S.P.; Koberle, R.; van Steveninck, R.R.D.R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 8, Nemenman, I.; Bialek, W.; van Steveninck, R.D.R. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E 24, 69, Panzeri, S.; Senatore, R.; ontemurro,.a.; Petersen, R.S. Correcting for the sampling bias problem in spike train information measures. J. Neurophysiol. 217, 98, Houghton, C. Calculating the utual Information Between Two Spike Trains. Neural Comput. 219, 31, Rényi, A. On measures of entropy and information. In Fourth Berkeley Symposium on athematical Statistics and Probability; The Regents of the University of California, University of California Press: Berkeley, CA, USA, 1961; pp Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. ath. Stat. 1952, 23, Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta ath. Soc. 1943, 35, Beran, R. inimum Hellinger distance estimates for parametric models. Ann. Stat. 1977, 5,

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35 1 / 35 Information Theory Mark van Rossum School of Informatics, University of Edinburgh January 24, 2018 0 Version: January 24, 2018 Why information theory 2 / 35 Understanding the neural code. Encoding