Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding

Size: px
Start display at page:

Download "Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding"

Transcription

1 Article Approximations of Shannon utual Information for Discrete Variables with Applications to Neural Population Coding Wentao Huang 1,2, * and Kechen Zhang 2, * 1 Key Laboratory of Cognition and Intelligence and Information Science Academy of China Electronics Technology Group Corporation, Beijing 186, China 2 Department of Biomedical Engineering, Johns Hopkins University School of edicine, Baltimore, D 2125, USA * Correspondence: wnthuang@gmail.com W.H.; kzhang4@jhmi.edu K.Z.; Tel.: W.H.; K.Z. Abstract: Although Shannon mutual information has been widely used, its effective calculation is often difficult for many practical problems, including those in neural population coding. Asymptotic formulas based on Fisher information sometimes provide accurate approximations to the mutual information but this approach is restricted to continuous variables because the calculation of Fisher information requires derivatives with respect to the encoded variables. In this paper, we consider information-theoretic bounds and approximations of the mutual information based on Kullback Leibler divergence and Rényi divergence. We propose several information metrics to approximate Shannon mutual information in the context of neural population coding. While our asymptotic formulas all work for discrete variables, one of them has consistent performance and high accuracy regardless of whether the encoded variables are discrete or continuous. We performed numerical simulations and confirmed that our approximation formulas were highly accurate for approximating the mutual information between the stimuli and the responses of a large neural population. These approximation formulas may potentially bring convenience to the applications of information theory to many practical and theoretical problems. Keywords: neural population coding; mutual information; Kullback Leibler divergence; Rényi divergence; Chernoff divergence; approximation; discrete variables 1. Introduction Information theory is a powerful tool widely used in many disciplines, including, for example, neuroscience, machine learning, and communication technology [1 7]. As it is often notoriously difficult to effectively calculate Shannon mutual information in many practical applications [8], various approximation methods have been proposed to estimate the mutual information, such as those based on asymptotic expansion [9 13], k-nearest neighbor [14], and minimal spanning trees [15]. Recently, Safaai et al. proposed a copula method for estimation of mutual information, which can be nonparametric and potentially robust [16]. Another approach for estimating the mutual information is to simplify the calculations by approximations based on information-theoretic bounds, such as the Cramér Rao lower bound [17] and the van Trees Bayesian Cramér Rao bound [18]. In this paper, we focus on mutual information estimation based on asymptotic approximations [19 24]. For encoding of continuous variables, asymptotic relations between mutual information and Fisher information have been presented by several researchers [19 22]. Recently, Huang and Zhang [24] proposed an improved approximation formula, which remains accurate for high-dimensional variables. A significant advantage of this approach is that asymptotic approximations are sometimes very useful in analytical studies. For instance, asymptotic approximations allow us to prove that the 219 by the authors. Distributed under a Creative Commons CC BY license.

2 2 of 21 optimal neural population distribution that maximizes the mutual information between stimulus and response can be solved by convex optimization [24]. Unfortunately this approach does not generalize to discrete variables since the calculation of Fisher information requires partial derivatives of the likelihood function with respect to the encoded variables. For encoding of discrete variables, Kang and Sompolinsky [23] represented an asymptotic relationship between mutual information and Chernoff information for statistically independent neurons in a large population. However, Chernoff information is still hard to calculate in many practical applications. Discrete stimuli or variables occur naturally in sensory coding. While some stimuli are continuous e.g., the direction of movement, and the pitch of a tone, others are discrete e.g., the identities of faces, and the words in human speech. For definiteness, in this paper, we frame our questions in the context of neural population coding; that is, we assume that the stimuli or the input variables are encoded by the pattern of responses elicited from a large population of neurons. The concrete examples used in our numerical simulations were based on Poisson spike model, where the response of each neuron is taken as the spike count within a given time window. While this simple Poisson model allowed us to consider a large neural population, it only captured the spike rate but not any temporal structure of the spike trains [25 28]. Nonetheless, our mathematical results are quite general and should be applicable to other input output systems under suitable conditions to be discussed later. In the following, we first derive several upper and lower bounds on Shannon mutual information using Kullback Leibler divergence and Rényi divergence. Next, we derive several new approximation formulas for Shannon mutual information in the limit of large population size. These formulas are more convenient to calculate than the mutual information in our examples. Finally, we confirm the validity of our approximation formulas using the true mutual information as evaluated by onte Carlo simulations. 2. Theory and ethods 2.1. Notations and Definitions Suppose the input x is a K-dimensional vector, x = x 1,, x K T, which could be interpreted as the parameters that specifies a stimulus for a sensory system, and the outputs is an N-dimensional vector, r = r 1,, r N T, which could be interpreted as the responses of N neurons. We assume N is large, generally N K. We denote random variables by upper case letters, e.g., random variables X and R, in contrast to their vector values x and r. The mutual information between X and R is defined by I = IX; R = ln pr x, 1 pr r,x where x X R K, r R R N, and r,x denotes the expectation with respect to the probability density function pr, x. Similarly, in the following, we use r x and x to denote expectations with respect to pr x and px, respectively. If px and pr x are twice continuously differentiable for almost every x X, then for large N we can use an asymptotic formula to approximate the true value of I with high accuracy [24]: which is sometimes reduced to I I G = 1 2 I I F = 1 2 Gx ln det + HX, 2 2πe x Jx ln det + HX, 3 2πe x

3 3 of 21 where det denotes the matrix determinant, HX = ln px x is the stimulus entropy, and Jx = 2 ln pr x x x T is the Fisher information matrix. We denote the Kullback Leibler divergence as Gx = Jx + P x, 4 Px = 2 ln px x x T, 5 r x ln pr x = x D x ˆx = ln p r x p r ˆx and denote Rényi divergence [29] of order β + 1 as ln pr x x T r x 6, 7 r x D β x ˆx = 1 pr x β β ln. 8 p r ˆx r x Here, βd β x ˆx is equivalent to Chernoff divergence of order β + 1 [3]. It is well known that D β x ˆx D x ˆx in the limit β. We define I u = ln exp D x ˆx ˆx x, 9 I e = ln exp e 1 D x ˆx, 1 ˆx x I β,α = ln exp βd β x ˆx + 1 α ln p x, 11 p ˆx ˆx x where in I β,α we have β, 1 and α, and assume p x > for all x X. In the following, we suppose x takes discrete values, x m, m = {1, 2,, }, and px m > for all m. Now, the definitions in Equations 9 11 become I u = I e = I β,α = p x m ln p x m ln p x m ln ˆ ˆ ˆ p x ˆm p x m exp D x m x ˆm + HX, 12 p x ˆm p x m exp e 1 D x m x ˆm + HX, 13 px ˆm α exp βd px m β x m x ˆm + HX. 14

4 4 of 21 Furthermore, we define I d = I d u = px m ln px m ln 1 + ˆm u m 1 + Iβ,α d = px m ln 1 + I D = px m ln ˆm u m ˆm β m px ˆm px m exp e 1 D x m x ˆm + H X, 15 px ˆm px m exp D x m x ˆm + H X, 16 px ˆm α exp βd px m β x m x ˆm + H X, exp e 1 D x m x ˆm + H X, 18 ˆm u m where ˇ m β = ˆm : ˆm = arg min D β x m x ˇm, 19 ˇm ˆ m β { } ˇ u m = ˆm : ˆm = arg min ˇm ˆ u m D x m x ˇm, 2 ˆ β m = { ˆm : D β x m x ˆm = }, 21 ˆ u m = { ˆm : D x m x ˆm = }, 22 β m = ˇ β m ˆ β m {m}, 23 u m = ˇ u m ˆ u m {m}. 24 Here, notice that, if x is uniformly distributed, then by definition I d and I D become identical. The elements in set ˇ m β are those that make D β x m x ˇm take the minimum value, excluding any element that satisfies the condition D β x m x ˆm =. Similarly, the elements in set ˇ u m are those that minimize D x m x ˇm excluding the ones that satisfy the condition D x m x ˆm = Theorems In the following, we state several conclusions as theorems and prove them in Appendix. Theorem 1. The mutual information I is bounded as follows: Theorem 2. The following inequalities are satisfied, where I β1,1 is a special case of I β,α in Equation 11 with β 1 = e 1 so that Theorem 3. If there exist γ 1 > and γ 2 > such that I β,α I I u. 25 I β1,1 I e I u 26 I β1,1 = ln exp β 1 D β1 x ˆx ˆx. 27 x βd β x m x m1 γ 1 ln N, 28 D x m x m2 γ 2 ln N, 29

5 5 of 21 for discrete stimuli x m, where m, m 1 β m and m 2 u m, then we have the following asymptotic relationships: I β,α = I d β,α + O N γ 1 I I u = I d u + O N γ 2 3 and I e = I d + O N γ 2/e. 31 Theorem 4. Suppose px and pr x are twice continuously differentiable for x X, q x <, q x <, where qx = ln px and and denote partial derivatives / x and 2 / x x T, and G γ x is positive definite with NG 1 x = O 1, where denotes matrix Frobenius norm, γ = β 1 β and β, 1. If there exist an ω = ω x > such that γ det Gx 1/2 det G γ x 1/2 G γ x = γ Jx + P x, 32 X ε x X ε x pˆx exp D x ˆx dˆx = O N 1, 33 pˆx exp βd β x ˆx dˆx = O N 1, 34 for all x X and ε, ω, where X { ω x = X X ω x is the complementary set of X ω x = x R K : x x T Gx x x < Nω 2}, then we have the following asymptotic relationships: where I β,α I γ + O N 1 I I u = I G + K/2 + O N 1, 35 I γ = 1 2 and γ = β 1 β = 1/4 with β = 1/2. X I e = I G + O N 1, 36 I β,α = I γ + O N 1, 37 px ln det Gγ x 2π dx + HX 38 Remark 1. We see from Theorems 1 and 2 that the true mutual information I and the approximation I e both lie between I β1,1 and I u, which implies that their values may be close to each other. For discrete variable x, Theorem 3 tells us that I e and I d are asymptotically equivalent i.e., their difference vanishes in the limit of large N. For continuous variable x, Theorem 4 tells us that I e and I G are asymptotically equivalent in the limit of large N, which means that I e and I are also asymptotically equivalent because I G and I are known to be asymptotically equivalent [24]. Remark 2. To see how the condition in Equation 33 could be satisfied, consider the case where D x ˆx has only one extreme point at ˆx = x for ˆx X ω x and there exists an η > such that N 1 D x ˆx η for ˆx X ω x. Now, the condition in Equation 33 is satisfied because det Gx 1/2 X ε x det Gx 1/2 X ε x pˆx exp D x ˆx dˆx pˆx exp ˆη ε N dˆx = O N K/2 e ˆηεN, 39

6 6 of 21 where by assumption we can find an ˆη ε > for any given ε, ω. The condition in Equation 34 can be satisfied in a similar way. When β = 1/2, β D β x ˆx is the Bhattacharyya distance [31]: β D β x ˆx = ln pr xpr ˆxdr, 4 R and we have J x = 2 D x ˆx ˆx ˆx T = 2 4β D β x ˆx ˆx=x ˆx ˆx T = 2 ˆx=x 8H 2 l x ˆx ˆx ˆx T, 41 ˆx=x where H l x ˆx is the Hellinger distance [32] between pr x and pr ˆx: H 2 l x ˆx = 1 2 By Jensen s inequality, for β, 1 we get Denoting the Chernoff information [8] as where βd β x ˆx achieves its maximum at β m, we have I β,α HX h c = h d = R 2 pr x pr ˆx dr. 42 D β x ˆx D x ˆx. 43 C x ˆx = max βdβ x ˆx = β m D βm x ˆx, 44 β, 1 p x m ln p x m ln ˆ ˆ p x ˆm p x m exp C x m x ˆm p x ˆm p x m exp β m D β x m x ˆm By Theorem 4, max I β,α = I γ + O N 1, 47 β, 1 If β m = 1/2, then, by Equations 5, 46, 47 and 48, we have I γ = I G K 2 ln 4 e. 48 max I β + K β, 1 2 ln 4 e + O N 1 I e = I + O N 1 h d + HX I u. 49 Therefore, from Equations 45, 46 and 49, we can see that I e and I are close to h c + HX Approximations for utual Information In this section, we use the relationships described above to find effective approximations to true mutual information I in the case of large but finite N. First, Theorems 1 and 2 tell us that the true mutual information I and its approximation I e lie between lower and upper bounds given by:

7 7 of 21 I β,α I I u and I β1,1 I e I u. As a special case, I is also bounded by I β1,1 I I u. Furthermore, from Equation 2 and 36 we can obtain the following asymptotic equality under suitable conditions: I = I e + O N 1. 5 Hence, for continuous stimuli, we have the following approximate relationship for large N: I I e I G. 51 For discrete stimuli, by Equation 31 for large but finite N, we have I I e I d = px m ln 1 + ˆm u m px ˆm px m exp e 1 D x m x ˆm + H X. 52 Consider the special case px ˆm px m for ˆm u m. With the help of Equation 18, substitution of px ˆm px m into Equation 52 yields I I D = = I D px m px m ln 1 + exp e 1 D x m x ˆm + H X ˆm u m exp e 1 D x m x ˆm + H X ˆm u m where ID I D and the second approximation follows from the first-order Taylor expansion assuming that the term exp e 1 D x m x ˆm is sufficiently small. ˆm u m The theoretical discussion above suggests that I e and I d are effective approximations to true mutual information I in the limit of large N. oreover, we find that they are often good approximations of mutual information I even for relatively small N, as illustrated in the following section. 3. Results of Numerical Simulations Consider Poisson model neuron whose responses i.e., numbers of spikes within a given time window follow a Poisson distribution [24]. The mean response of neuron n, with n {1, 2,, N}, is described by the tuning function f x; θ n, which takes the form of a Heaviside step function: f x; θ n = { A, if x θ n,, if x < θ n, where the stimulus x [ T, T] with T = 1, A = 1, and the centers θ 1, θ 2,, θ N of the N neurons are uniformly spaced in interval [ T, T], namely, θ n = n 1 d T with d = 2T/N 1 for N 2, and θ n = for N = 1. We suppose that the discrete stimulus x has = 21 possible values that are evenly spaced from T to T, namely, x X = {x m : x m = 2 m 1 T/ 1 T, m = 1, 2,, }. Now, the Kullback Leibler divergence can be written as D x m x ˆm = f x m ; θ n log f xm ; θ n + f x f x ˆm ; θ n ˆm ; θ n f x m ; θ n. 55

8 8 of 21 Thus, we have exp e 1 D x m x ˆm = 1 when f x m ; θ n = f x ˆm ; θ n, exp e 1 D x m x ˆm = exp e 1 A when f x m ; θ n = and f x ˆm ; θ n = A, and exp e 1 D x m x ˆm = when f x m ; θ n = A and f x ˆm ; θ n =. Therefore, in this case, we have I e = I d. 56 ore generally, this equality holds true whenever the tuning function has binary values. In the first example, as illustrated in Figure 1, we suppose the stimulus has a uniform distribution, so that the probability is given by px m = 1/. Figure 1a shows graphs of the input distribution px and a representative tuning function f x; θ with the center θ =. To assess the accuracy of the approximation formulas, we employed onte Carlo C simulation to evaluate the mutual information I [24]. In our C simulation, we first sampled an input x j X from the uniform distribution px j = 1/, then generated the neural responses r j by the conditional distribution pr j x j based on the Poisson model, where j = 1, 2,, j max. The value of mutual information by C simulation was calculated by where IC = 1 j max j max pr j = j=1 prj x j ln, 57 pr j pr j x m px m. 58 To assess the precision of our C simulation, we computed the standard deviation of repeated trials by bootstrapping: I std = 1 i max i max I i 2, C I C 59 i=1 where IC i = 1 j max j max j=1 prγj,i x Γj,i ln, 6 pr Γj,i I C = 1 i max i max IC i, 61 i=1 and Γ j,i {1, 2,, j max } is the j, i-th entry of the matrix Γ N j max i max with samples taken randomly from the integer set {1, 2,, j max } by a uniform distribution. Here, we set j max = 5 1 5, i max = 1 and = 1 3. For different N {1, 2, 3, 4, 6, 1, 14, 2, 3, 5, 1, 2, 4, 7, 1}, we compared I C with I e, I d and I D, as illustrated in Figure 1b d. Here, we define the relative error of approximation, e.g., for I e, as DI e = I e I C I C, 62 and the relative standard deviation DI std = I std I C. 63

9 9 of 21 a.1 1 b 4 px.5 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 1. A comparison of approximations I e, I d and I D against I C obtained by onte Carlo method for one-dimensional discrete stimuli. a Discrete uniform distribution of the stimulus px black dots and the Heaviside step tuning function f x; θ with center θ = blue dashed lines ; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N ; c The relative errors DI e, DI d and DI D for the results in b ; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. For the second example, we only changed the probability distribution of stimulus px m while keeping all other conditions unchanged. Now, px m is a discrete sample from a Gaussian function: px m = Z 1 exp x2 m 2σ 2, m = 1, 2,,, 64 where Z is the normalization constant and σ = T/2. The results are illustrated in Figure 2.

10 1 of 21 a.1 1 b 4 px.5 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 2. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 1 except that the stimulus distribution px is peaked rather flat black dots in a. a Discrete Gaussian-like distribution of the stimulus px black dots and the Heaviside step tuning function f x; θ with center θ = blue dashed lines; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. Next, we changed each tuning function f x; θ n to a rectified linear function: f x; θ n = max, x θ n, 65 Figures 3 and 4 show the results under the same conditions of Figures 1 and 2 except for the shape of the tuning functions.

11 11 of 21 a.1 1 b 4 px 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 3. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 1 except for the shape of the tuning function blue dashed lines in a. a Discrete uniform distribution of the stimulus px black dots and the rectified linear tuning function f x; θ with center θ = blue dashed lines; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. Finally, we let the tuning function f x; θ n have a random form: f x; θ n = { B, if x θ n = { θn, 1 θn, 2, θn K },, otherwise, 66 where the stimulus x X = {1, 2,, 999, 1}, B = 1, the values of { θ 1 n, θ 2 n,, θ K n } are distinct and randomly selected from the set X with K = 1. In this example, we may regard X as a list of natural objects stimuli, and there are a total of N sensory neurons, each of which responds only to K randomly selected objects. Figure 5 shows the results under the condition that px is a uniform distribution. In Figure 6, we assume that px is not flat but a half Gaussian given by Equation 64 with σ = 5.

12 12 of 21 a.1 1 b 4 px 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 4. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 3 except that the stimulus distribution px is peaked rather flat black dots in a. a Discrete Gaussian-like distribution of the stimulus px black dots and the rectified linear tuning function f x; θ with center θ = blue dashed lines; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. In all these examples, we found that the three formulas, namely, I e, I d and I D, provided excellent approximations to the true values of mutual information as evaluated by onte Carlo method. For example, in the examples in Figures 1 and 5, all three approximations were practically indistinguishable. In general, all these approximations were extremely accurate when N > 1. In all our simulations, the mutual information tended to increase with the population size N, eventually reaching a plateau for large enough N. The saturation of information for large N is due to the fact that it requires at most log 2 bits of information to completely distinguish all stimuli. It is impossible to gain more information than this maximum amount regardless of how many neurons are used in the population. In Figure 1, for instance, this maximum is log 2 21 = 4.39 bits, and in Figure 5, this maximum is log 2 1 = 9.97 bits.

13 13 of 21 a.2 1 b 1 8 px.1 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 5. A comparison of approximations I e, I d and I D against I C. The situation is similar to that in Figure 1 except that the tuning function is random blue dashed lines in a; see Equation 66. a Discrete uniform distribution of the stimulus px black dots and the random tuning function f x; θ; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. For relatively small values of N, we found that I D tended to be less accurate than I e or I d see Figures 5 and 6. Our simulations also confirmed two analytical results. The first one is that I d = I D when the stimulus distribution is uniform; this result follows directly from the definitions of I d and I D and is confirmed by the simulations in Figures 1, 3, and 5. The second result is that I d = I e Equation 56 when the tuning function is binary, as confirmed by the simulations in Figures 1, 2, 5, and 6. When the tuning function allows many different values, I e can be much more accurate than I d and I D, as shown by the simulations in Figures 3 and 4. To summarize, our best approximation formula is I e because it is more accurate than I d and I D, and, unlike I d and I D, it applies to both discrete and continuous stimuli Equations 1 and 13.

14 14 of 21 a.2 1 b 1 8 px Input x 5 fx; Information bits IC Ie Id ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 6. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 5 except that the stimulus distribution px is not flat black dots in a. a Discrete Gaussian-like distribution of the stimulus px black dots and the random tuning function f x; θ; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. 4. Discussion We have derived several asymptotic bounds and effective approximations of mutual information for discrete variables and established several relationships among different approximations. Our final approximation formulas involve only Kullback Leibler divergence, which is often easier to evaluate than Shannon mutual information in practical applications. Although in this paper our theory is developed in the framework of neural population coding with concrete examples, our mathematical results are generic and should hold true in many related situations beyond the original context. We propose to approximate the mutual information with several asymptotic formulas, including I e in Equation 1 or Equation 13, I d in Equation 15 and I D in Equation 18. Our numerical experimental results show that the three approximations I e, I d and I D were very accurate for large population size N, and sometimes even for relatively small N. Among the three approximations, I D

15 15 of 21 tended to be the least accurate, although, as a special case of I d, it is slightly easier to evaluate than I d. For a comparison of I e and I d, we note that I e is the universal formula, whereas I d is restricted only to discrete variables. The two formulas I e and I d become identical when the responses or the tuning functions have only two values. For more general tuning functions, the performance of I e was better than I d in our simulations. As mentioned before, an advantage of of I e is that it works not only for discrete stimuli but also for continuous stimuli. Theoretically speaking, the formula for I e is well justified, and we have proven that it approaches the true mutual information I in the limit of large population. In our numerical simulations, the performance of I e was excellent and better than that of I d and I D. Overall, I e is our most accurate and versatile approximation formula, although, in some cases, I d and I D are slightly more convenient to calculate. The numerical examples considered in this paper were based on an independent population of neurons whose responses have Poisson statistics. Although such models are widely used, they are appropriate only if the neural responses can be well characterized by the spike counts within a fixed time window. To study the temporal patterns of spike trains, one has to consider more complicated models. Estimation of mutual information from neural spike trains is a difficult computational problem [25 28]. In future work, it would be interesting to apply the asymptotic formulas such as I e to spike trains with small time bins each containing either one spike or nothing. A potential advantage of the asymptotic formula is that it might help reduce the bias caused by small samples in the calculation of the response marginal distribution pr = x pr xpx or the response entropy HR because here one only needs to calculate the Kullback Leibler divergence D x ˆx, which may have a smaller estimation error. Finding effective approximation methods for computing mutual information is a key step for many practical applications of the information theory. Generally speaking, Kullback Leibler divergence Equation 7 is often easier to evaluate and approximate than either Chernoff information Equation 44 or Shannon mutual information Equation 1. In situations where this is indeed the case, our approximation formulas are potentially useful. Besides applications in numerical simulations, the availability of a set of approximation formulas may also provide helpful theoretical tools in future analytical studies of information coding and representations. As mentioned in the Introduction, various methods have been proposed to approximate the mutual information [9 16]. In future work, it would be useful to compare different methods rigorously under identical conditions in order to asses their relative merits. The approximation formulas developed in this paper are relatively easy to compute for practical problems. They are especially suitable for analytical purposes; for example, they could be used explicitly as objective functions for optimization or learning algorithms. Although the examples used in our simulations in this paper are parametric, it should be possible to extend the formulas to nonparametric problem, possibly with help of the copula method to take advantage of its robustness in nonparametric estimations [16]. Author Contributions: W.H. developed and proved the theorems, programmed the numerical experiments and wrote the manuscript. K.Z. verified the proofs and revised the manuscript. Funding: This research was supported by an NIH grant R1 DC Conflicts of Interest: The authors declare no conflict of interest.

16 16 of 21 Appendix A. The Proofs Appendix A.1. Proof of Theorem 1 By Jensen s inequality, we have I β,α = ln X ln X p β r ˆxp α ˆx p β r x p α x p β r ˆxp α ˆx p β r x p α x dˆx dˆx r x r x x x + HX + HX A1 and ln X = ln p β r ˆxp α ˆx p β r x p α x dˆx r x x + HX I 1 pr, ˆx X p r, x dˆx p β r ˆxp α ˆx X p β r x p α x dˆx r x x ln pr X pβ r x p α xdx R X pβ r ˆxp α ˆxdˆx dr =. A2 Combining Equations A1 and A2, we immediately get the lower bound in Equation 25. In this section, we use integral for variable x, although our argument is valid for both continuous variables and discrete variables. For discrete variables, we just need to replace each integral by a summation, and our argument remains valid without other modification. The same is true for the response variable r. To prove the upper bound, let Φ [qˆx] = R p r x X qˆx ln p r x qˆx dˆxdr, p r ˆx pˆx A3 where qˆx satisfies { X qˆxdˆx = 1 qˆx. A4 By Jensen s inequality, we get Φ [qˆx] R p r x ln p r x dr. p r To find a function qˆx that minimizes Φ [qˆx], we apply the variational principle as follows: Φ [qˆx] q ˆx = R p r x ln p r x qˆx dr λ, p r ˆx pˆx A5 A6 where λ is the Lagrange multiplier and Φ [qˆx] = Φ [qˆx] + λ qˆxdˆx 1. A7 X

17 17 of 21 Setting Φ[qˆx] qˆx = and using the constraint in Equation A4, we find the optimal solution q ˆx = X pˆx exp D x ˆx. A8 pˇx exp D x ˇx dˇx Thus, the variational lower bound of Φ [qˆx] is given by Φ [q ˆx] = min Φ [qˆx] = ln pˆx exp D x ˆx dˆx dx. qˆx X A9 Therefore, from Equations 1, A5 and A9, we get the upper bound in Equation 25. This completes the proof of Theorem 1. Appendix A.2. Proof of Theorem 2 It follows from Equation 43 that I β1,α 1 = ln exp β 1 D β1 x ˆx + 1 α 1 ln p x p ˆx ˆx ln exp e 1 D x ˆx = I e ˆx x ln exp D x ˆx ˆx x = I u, x A1 where β 1 = e 1 and α 1 = 1. We immediately get Equation 26. This completes the proof of Theorem 2. Appendix A.3. Proof of Theorem 3 For the lower bound I β,α, we have I β,α = px m ln ˇ p x ˇm α exp βd p x m β x m x ˇm = px m ln 1 + d x m + H X, A11 where Now, consider p x d x m = ˇm α ˇm {m} exp βd p x m β x m x ˇm. A12 ln 1 + d x m = ln 1 + a x m + b x m = ln 1 + a x m + ln 1 + b x m 1 + a x m 1 = ln 1 + a x m + O N γ, A13 where p x a x m = ˆm α β exp βd ˆm m p x m β x m x ˆm, A14a p x b x m = ˇm α β exp βd ˇm m p x m β x m x ˇm N γ 1 ˇm β m p x ˇm p x m α = O N γ 1. A14b

18 18 of 21 Combining Equations A11 and A13 and Theorem 1, we get the lower bound in Equation 3. In a manner similar to the above, we can get the upper bound in Equations 3 and 31. This completes the proof of Theorem 3. Appendix A.4. Proof of Theorem 4 The upper bound I u for mutual information I in Equation 25 can be written as I u = = X ln ln p ˆx exp D x ˆx dˆx X Lr ˆx X exp Lr x r x dˆx p x dx + H X. A15 where L r ˆx = ln p r ˆx p ˆx and L r x = ln p r x p x. Consider the Taylor expansion for Lr ˆx around x. Assuming that Lr ˆx is twice continuously differentiable for any ˆx X ω x, we get x Lr ˆx Lr x r x = y T v yt y 1 2 yt G 1/2 x G x Gx G 1/2 x y A16 where y = G 1/2 x ˆx x, v 1 = G 1/2 x q x A17 A18 and x = x + t ˆx x X ω x, t, 1. A19 where For later use, we also define v = G 1/2 x l r x l r x = ln p r x. A2 A21 Since G x is continuous and symmetric for x X, for any ɛ, 1, there is a ε, ω such that y T G 1/2 x G x Gx G 1/2 x y < ɛ y 2 A22 { for all y Y ε, where Y ε = y R K : y < ε } N. Then, we get ln Lr ˆx X exp Lr x r x dˆx 1 2 Y ln det Gx + ln exp y T v 1 1 ε ɛ yt y dy A23 and with Jensen s inequality, ln exp y T v 1 1 Y ε ɛ yt y dy ln Ψ ε + y T v 1 φ ε y dy Ŷ ε = K 2π 2 ln + O N K/2 e Nδ, 1 + ɛ A24

19 19 of 21 where δ is a positive constant, Ŷ ε y T v 1 φ ε y dy =, φ ε y = Ψ 1 ε Ψ ε = Ŷ ε exp exp ɛ yt y dy ɛ yt y A25 and Now, we evaluate Ŷ ε = Ψ ε = exp R K 2π = 1 + ɛ { y R K : y k < ε } N/K, k = 1, 2,, K Y ε ɛ yt y dy K/2 R K Ŷ ε exp exp R K Ŷ ε ɛ yt y ɛ yt y dy dy. Performing integration by parts with a e t2 /2 dt = e a2 /2 a a R K Ŷ ε exp ɛ yt y dy = O e t2 /2 dt, we find t 2 exp ɛ ε2 N 1 + ɛ 2 ε 2 N/ 4K N K/2 e Nδ, K/2 A26 A27 A28 for some constant δ >. Combining Equations A15, A23 and A24, we get I u 1 2 ln det 1 + ɛ 2π Gx + H X + O N K/2 e Nδ. x On the other hand, from Equation A22 and the condition in Equation 33, we obtain exp Lr ˆx Lr x r x dˆx X ε x det Gx R 1/2 exp y T v 1 1 K 2 1 ɛ yt y dy 1 ɛ 1/2 1 = det 2π Gx exp 2 1 ɛ 1 v T v 1 A29 A3 and Lr ˆx X exp Lr x r x dˆx = exp Lr ˆx Lr x r x dˆx + det X ε x 1 ɛ 1/2 2π Gx exp v T v ɛ X X ε x + O exp Lr ˆx Lr x r x dˆx N 1. A31

20 2 of 21 It follows from Equations A15 and A31 that ln Lr ˆx X exp Lr x r x dˆx 1 2 ln det 1 ɛ Note that v T v 1 x = O N 1. 2π x Gx + 1 x 2 1 ɛ 1 v T v 1 + O N 1. x A32 A33 Now, we have I u 1 2 ln det 1 ɛ 2π Gx + H X + O N 1. x A34 Since ɛ is arbitrary, we can let it go to zero. Therefore, from Equations 25, A29 and A34, we obtain the upper bound in Equation 35. The Taylor expansion of h ˆx, x = pr ˆx β around x is pr x β h ˆx, x = 1 + pr x ˆx x T β 2pr x 2 = 1 pr x x r x ˆx x+ r x β 1 pr x pr x x x T β 1 β ˆx x T Jxˆx x pr x 2 pr x x x T ˆx x + r x In a similar manner as described above, we obtain the asymptotic relationship 37: A35 I β,α = I γ + O N 1 = 1 px ln det 2 X Gγ x 2π dx + HX. A36 Notice that < γ = β 1 β 1/4 and the equality holds when β = β = 1/2. Thus, we have det G γ x det G γ x. Combining Equations 25, A36 and A37 yields the lower bound in Equation 35. The proof of Equation 36 is similar. This completes the proof of Theorem 4. A37 References 1. Borst, A.; Theunissen, F.E. Information theory and neural coding. Nat. Neurosci. 1999, 2, Pouget, A.; Dayan, P.; Zemel, R. Information processing with population codes. Nat. Rev. Neurosci. 2, 1, Laughlin, S.B.; Sejnowski, T.J. Communication in neuronal networks. Science 23, 31, Brown, E.N.; Kass, R.E.; itra, P.P. ultiple neural spike train data analysis: State-of-the-art and future challenges. Nat. Neurosci. 24, 7, Bell, A.J.; Sejnowski, T.J. The independent components of natural scenes are edge filters. Vision Res. 1997, 37, Huang, W.; Zhang, K. An Information-Theoretic Framework for Fast and Robust Unsupervised Learning via Neural Population Infomax. In Proceedings of the 5th International Conference on Learning Representations ICLR, Toulon, France, April 217; arxiv:

21 21 of Huang, W.; Huang, X.; Zhang, K. Information-theoretic interpretation of tuning curves for multiple motion directions. In Proceedings of the st Annual Conference on Information Sciences and Systems CISS, Baltimore, D, USA, arch Cover, T..; Thomas, J.A. Elements of Information, 2nd ed.; Wiley-Interscience: New York, NY, USA, iller, G.A. Note on the bias of information estimates. Inf. Theory Psychol.Probl. ethods 1955, 2, Carlton, A. On the bias of information estimates. Psychol. Bull. 1969, 71, Treves, A.; Panzeri, S. The upward bias in measures of information derived from limited data samples. Neural Comput. 1995, 7, Victor, J.D. Asymptotic bias in information estimates and the exponential Bell polynomials. Neural Comput. 2, 12, Paninski, L. Estimation of entropy and mutual information. Neural Comput. 23, 15, Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 24, 69, Khan, S.; Bandyopadhyay, S.; Ganguly, A.R.; Saigal, S.; Erickson, D.J., III; Protopopescu, V.; Ostrouchov, G. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E 27, 76, Safaai, H.; Onken, A.; Harvey, C.D.; Panzeri, S. Information estimation using nonparametric copulas. Phys. Rev. E 218, 98, Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta ath. Soc. 1945, 37, Van Trees, H.L.; Bell, K.L. Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking; John Wiley: Piscataway, NJ, USA, Clarke, B.S.; Barron, A.R. Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 199, 36, Rissanen, J.J. Fisher information and stochastic complexity. IEEE Trans. Inform. Theory 1996, 42, Brunel, N.; Nadal, J.P. utual information, Fisher information, and population coding. Neural Comput. 1998, 1, Sompolinsky, H.; Yoon, H.; Kang, K.J.; Shamir,. Population coding in neuronal systems with correlated noise. Phys. Rev. E 21, 64, Kang, K.; Sompolinsky, H. utual information of population codes and distance measures in probability space. Phys. Rev. Lett. 21, 86, Huang, W.; Zhang, K. Information-theoretic bounds and approximations in neural population coding. Neural Comput. 218, 3, Strong, S.P.; Koberle, R.; van Steveninck, R.R.D.R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 8, Nemenman, I.; Bialek, W.; van Steveninck, R.D.R. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E 24, 69, Panzeri, S.; Senatore, R.; ontemurro,.a.; Petersen, R.S. Correcting for the sampling bias problem in spike train information measures. J. Neurophysiol. 217, 98, Houghton, C. Calculating the utual Information Between Two Spike Trains. Neural Comput. 219, 31, Rényi, A. On measures of entropy and information. In Fourth Berkeley Symposium on athematical Statistics and Probability; The Regents of the University of California, University of California Press: Berkeley, CA, USA, 1961; pp Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. ath. Stat. 1952, 23, Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta ath. Soc. 1943, 35, Beran, R. inimum Hellinger distance estimates for parametric models. Ann. Stat. 1977, 5,

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35 1 / 35 Information Theory Mark van Rossum School of Informatics, University of Edinburgh January 24, 2018 0 Version: January 24, 2018 Why information theory 2 / 35 Understanding the neural code. Encoding

More information

Estimation of information-theoretic quantities

Estimation of information-theoretic quantities Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ liam liam@gatsby.ucl.ac.uk November 16, 2004 Some

More information

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing Linear recurrent networks Simpler, much more amenable to analytic treatment E.g. by choosing + ( + ) = Firing rates can be negative Approximates dynamics around fixed point Approximation often reasonable

More information

Encoding or decoding

Encoding or decoding Encoding or decoding Decoding How well can we learn what the stimulus is by looking at the neural responses? We will discuss two approaches: devise and evaluate explicit algorithms for extracting a stimulus

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ.

More information

Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons Yan Karklin and Eero P. Simoncelli NYU Overview Efficient coding is a well-known objective for the evaluation and

More information

3.3 Population Decoding

3.3 Population Decoding 3.3 Population Decoding 97 We have thus far considered discriminating between two quite distinct stimulus values, plus and minus. Often we are interested in discriminating between two stimulus values s

More information

Exercises. Chapter 1. of τ approx that produces the most accurate estimate for this firing pattern.

Exercises. Chapter 1. of τ approx that produces the most accurate estimate for this firing pattern. 1 Exercises Chapter 1 1. Generate spike sequences with a constant firing rate r 0 using a Poisson spike generator. Then, add a refractory period to the model by allowing the firing rate r(t) to depend

More information

Population Coding. Maneesh Sahani Gatsby Computational Neuroscience Unit University College London

Population Coding. Maneesh Sahani Gatsby Computational Neuroscience Unit University College London Population Coding Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2010 Coding so far... Time-series for both spikes and stimuli Empirical

More information

Neuronal Tuning: To Sharpen or Broaden?

Neuronal Tuning: To Sharpen or Broaden? NOTE Communicated by Laurence Abbott Neuronal Tuning: To Sharpen or Broaden? Kechen Zhang Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute for Biological Studies,

More information

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only MMSE Dimension Yihong Wu Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú Department of Electrical Engineering Princeton University

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY 2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

Signal detection theory

Signal detection theory Signal detection theory z p[r -] p[r +] - + Role of priors: Find z by maximizing P[correct] = p[+] b(z) + p[-](1 a(z)) Is there a better test to use than r? z p[r -] p[r +] - + The optimal

More information

Variable selection and feature construction using methods related to information theory

Variable selection and feature construction using methods related to information theory Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and

More information

Chapter 2 Review of Classical Information Theory

Chapter 2 Review of Classical Information Theory Chapter 2 Review of Classical Information Theory Abstract This chapter presents a review of the classical information theory which plays a crucial role in this thesis. We introduce the various types of

More information

Information Theory in Intelligent Decision Making

Information Theory in Intelligent Decision Making Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory

More information

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali

More information

Entropy and information estimation: An overview

Entropy and information estimation: An overview Entropy and information estimation: An overview Ilya Nemenman December 12, 2003 Ilya Nemenman, NIPS 03 Entropy Estimation Workshop, December 12, 2003 1 Workshop schedule: Morning 7:30 7:55 Ilya Nemenman,

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

encoding and estimation bottleneck and limits to visual fidelity

encoding and estimation bottleneck and limits to visual fidelity Retina Light Optic Nerve photoreceptors encoding and estimation bottleneck and limits to visual fidelity interneurons ganglion cells light The Neural Coding Problem s(t) {t i } Central goals for today:

More information

Confidence Measure Estimation in Dynamical Systems Model Input Set Selection

Confidence Measure Estimation in Dynamical Systems Model Input Set Selection Confidence Measure Estimation in Dynamical Systems Model Input Set Selection Paul B. Deignan, Jr. Galen B. King Peter H. Meckl School of Mechanical Engineering Purdue University West Lafayette, IN 4797-88

More information

Bayesian probability theory and generative models

Bayesian probability theory and generative models Bayesian probability theory and generative models Bruno A. Olshausen November 8, 2006 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

Adapted Feature Extraction and Its Applications

Adapted Feature Extraction and Its Applications saito@math.ucdavis.edu 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA 95616 email: saito@math.ucdavis.edu URL: http://www.math.ucdavis.edu/

More information

arxiv:physics/ v1 [physics.data-an] 7 Jun 2003

arxiv:physics/ v1 [physics.data-an] 7 Jun 2003 Entropy and information in neural spike trains: Progress on the sampling problem arxiv:physics/0306063v1 [physics.data-an] 7 Jun 2003 Ilya Nemenman, 1 William Bialek, 2 and Rob de Ruyter van Steveninck

More information

the Information Bottleneck

the Information Bottleneck the Information Bottleneck Daniel Moyer December 10, 2017 Imaging Genetics Center/Information Science Institute University of Southern California Sorry, no Neuroimaging! (at least not presented) 0 Instead,

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

1/12/2017. Computational neuroscience. Neurotechnology.

1/12/2017. Computational neuroscience. Neurotechnology. Computational neuroscience Neurotechnology https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ 1 Neurotechnology http://www.lce.hut.fi/research/cogntech/neurophysiology Recording

More information

Scalable Hash-Based Estimation of Divergence Measures

Scalable Hash-Based Estimation of Divergence Measures Scalable Hash-Based Estimation of Divergence easures orteza oshad and Alfred O. Hero III Electrical Engineering and Computer Science University of ichigan Ann Arbor, I 4805, USA {noshad,hero} @umich.edu

More information

Algebraic Information Geometry for Learning Machines with Singularities

Algebraic Information Geometry for Learning Machines with Singularities Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503

More information

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua

More information

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS Michael A. Lexa and Don H. Johnson Rice University Department of Electrical and Computer Engineering Houston, TX 775-892 amlexa@rice.edu,

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary discussion 1: Most excitatory and suppressive stimuli for model neurons The model allows us to determine, for each model neuron, the set of most excitatory and suppresive features. First,

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Maximization of the information divergence from the multinomial distributions 1

Maximization of the information divergence from the multinomial distributions 1 aximization of the information divergence from the multinomial distributions Jozef Juríček Charles University in Prague Faculty of athematics and Physics Department of Probability and athematical Statistics

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Statistical models for neural encoding, decoding, information estimation, and optimal on-line stimulus design

Statistical models for neural encoding, decoding, information estimation, and optimal on-line stimulus design Statistical models for neural encoding, decoding, information estimation, and optimal on-line stimulus design Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University

More information

Why is Deep Learning so effective?

Why is Deep Learning so effective? Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Machine Learning Srihari. Information Theory. Sargur N. Srihari Information Theory Sargur N. Srihari 1 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy

More information

Some New Results on Information Properties of Mixture Distributions

Some New Results on Information Properties of Mixture Distributions Filomat 31:13 (2017), 4225 4230 https://doi.org/10.2298/fil1713225t Published by Faculty of Sciences and Mathematics, University of Niš, Serbia Available at: http://www.pmf.ni.ac.rs/filomat Some New Results

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Information Theory and Distribution Modeling Why do we model distributions and conditional distributions using the following objective

More information

Semi-Parametric Importance Sampling for Rare-event probability Estimation

Semi-Parametric Importance Sampling for Rare-event probability Estimation Semi-Parametric Importance Sampling for Rare-event probability Estimation Z. I. Botev and P. L Ecuyer IMACS Seminar 2011 Borovets, Bulgaria Semi-Parametric Importance Sampling for Rare-event probability

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

Information maximization in a network of linear neurons

Information maximization in a network of linear neurons Information maximization in a network of linear neurons Holger Arnold May 30, 005 1 Introduction It is known since the work of Hubel and Wiesel [3], that many cells in the early visual areas of mammals

More information

Improved characterization of neural and behavioral response. state-space framework

Improved characterization of neural and behavioral response. state-space framework Improved characterization of neural and behavioral response properties using point-process state-space framework Anna Alexandra Dreyer Harvard-MIT Division of Health Sciences and Technology Speech and

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

What can spike train distances tell us about the neural code?

What can spike train distances tell us about the neural code? What can spike train distances tell us about the neural code? Daniel Chicharro a,b,1,, Thomas Kreuz b,c, Ralph G. Andrzejak a a Dept. of Information and Communication Technologies, Universitat Pompeu Fabra,

More information

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory Mathematical Tools for Neuroscience (NEU 34) Princeton University, Spring 206 Jonathan Pillow Homework 8: Logistic Regression & Information Theory Due: Tuesday, April 26, 9:59am Optimization Toolbox One

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

THE retina in general consists of three layers: photoreceptors

THE retina in general consists of three layers: photoreceptors CS229 MACHINE LEARNING, STANFORD UNIVERSITY, DECEMBER 2016 1 Models of Neuron Coding in Retinal Ganglion Cells and Clustering by Receptive Field Kevin Fegelis, SUID: 005996192, Claire Hebert, SUID: 006122438,

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Kullback-Leibler Designs

Kullback-Leibler Designs Kullback-Leibler Designs Astrid JOURDAN Jessica FRANCO Contents Contents Introduction Kullback-Leibler divergence Estimation by a Monte-Carlo method Design comparison Conclusion 2 Introduction Computer

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

The Method of Types and Its Application to Information Hiding

The Method of Types and Its Application to Information Hiding The Method of Types and Its Application to Information Hiding Pierre Moulin University of Illinois at Urbana-Champaign www.ifp.uiuc.edu/ moulin/talks/eusipco05-slides.pdf EUSIPCO Antalya, September 7,

More information

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17 On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org

More information

Information Dynamics Foundations and Applications

Information Dynamics Foundations and Applications Gustavo Deco Bernd Schürmann Information Dynamics Foundations and Applications With 89 Illustrations Springer PREFACE vii CHAPTER 1 Introduction 1 CHAPTER 2 Dynamical Systems: An Overview 7 2.1 Deterministic

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

How biased are maximum entropy models?

How biased are maximum entropy models? How biased are maximum entropy models? Jakob H. Macke Gatsby Computational Neuroscience Unit University College London, UK jakob@gatsby.ucl.ac.uk Iain Murray School of Informatics University of Edinburgh,

More information

THE information capacity is one of the most important

THE information capacity is one of the most important 256 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 1, JANUARY 1998 Capacity of Two-Layer Feedforward Neural Networks with Binary Weights Chuanyi Ji, Member, IEEE, Demetri Psaltis, Senior Member,

More information

Neural variability and Poisson statistics

Neural variability and Poisson statistics Neural variability and Poisson statistics January 15, 2014 1 Introduction We are in the process of deriving the Hodgkin-Huxley model. That model describes how an action potential is generated by ion specic

More information

Mismatched Estimation in Large Linear Systems

Mismatched Estimation in Large Linear Systems Mismatched Estimation in Large Linear Systems Yanting Ma, Dror Baron, Ahmad Beirami Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 7695, USA Department

More information

Homework 1 Due: Wednesday, September 28, 2016

Homework 1 Due: Wednesday, September 28, 2016 0-704 Information Processing and Learning Fall 06 Homework Due: Wednesday, September 8, 06 Notes: For positive integers k, [k] := {,..., k} denotes te set of te first k positive integers. Wen p and Y q

More information

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET The Problem Identification of Linear and onlinear Dynamical Systems Theme : Curve Fitting Division of Automatic Control Linköping University Sweden Data from Gripen Questions How do the control surface

More information

On Learning µ-perceptron Networks with Binary Weights

On Learning µ-perceptron Networks with Binary Weights On Learning µ-perceptron Networks with Binary Weights Mostefa Golea Ottawa-Carleton Institute for Physics University of Ottawa Ottawa, Ont., Canada K1N 6N5 050287@acadvm1.uottawa.ca Mario Marchand Ottawa-Carleton

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#8:(November-08-2010) Cancer and Signals Outline 1 Bayesian Interpretation of Probabilities Information Theory Outline Bayesian

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter

Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter Finding a eedle in a Haystack: Conditions for Reliable Detection in the Presence of Clutter Bruno Jedynak and Damianos Karakos October 23, 2006 Abstract We study conditions for the detection of an -length

More information

Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1

Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1 Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees Alfred Hero Dept. of EECS, The university of Michigan, Ann Arbor, MI 489-, USA Email: hero@eecs.umich.edu Olivier J.J. Michel

More information

Convexity/Concavity of Renyi Entropy and α-mutual Information

Convexity/Concavity of Renyi Entropy and α-mutual Information Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Email: siuwai.ho@unisa.edu.au

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Upper Bounds on the Capacity of Binary Intermittent Communication

Upper Bounds on the Capacity of Binary Intermittent Communication Upper Bounds on the Capacity of Binary Intermittent Communication Mostafa Khoshnevisan and J. Nicholas Laneman Department of Electrical Engineering University of Notre Dame Notre Dame, Indiana 46556 Email:{mhoshne,

More information

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER Zhen Zhen 1, Jun Young Lee 2, and Abdus Saboor 3 1 Mingde College, Guizhou University, China zhenz2000@21cn.com 2 Department

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

MA22S3 Summary Sheet: Ordinary Differential Equations

MA22S3 Summary Sheet: Ordinary Differential Equations MA22S3 Summary Sheet: Ordinary Differential Equations December 14, 2017 Kreyszig s textbook is a suitable guide for this part of the module. Contents 1 Terminology 1 2 First order separable 2 2.1 Separable

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Efficient Coding. Odelia Schwartz 2017

Efficient Coding. Odelia Schwartz 2017 Efficient Coding Odelia Schwartz 2017 1 Levels of modeling Descriptive (what) Mechanistic (how) Interpretive (why) 2 Levels of modeling Fitting a receptive field model to experimental data (e.g., using

More information

The Robustness of Stochastic Switching Networks

The Robustness of Stochastic Switching Networks The Robustness of Stochastic Switching Networks Po-Ling Loh Department of Mathematics California Institute of Technology Pasadena, CA 95 Email: loh@caltech.edu Hongchao Zhou Department of Electrical Engineering

More information

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 12, DECEMBER 2010 1005 Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko Abstract A new theorem shows that

More information

Numerical Analysis: Interpolation Part 1

Numerical Analysis: Interpolation Part 1 Numerical Analysis: Interpolation Part 1 Computer Science, Ben-Gurion University (slides based mostly on Prof. Ben-Shahar s notes) 2018/2019, Fall Semester BGU CS Interpolation (ver. 1.00) AY 2018/2019,

More information

A Modification of Linfoot s Informational Correlation Coefficient

A Modification of Linfoot s Informational Correlation Coefficient Austrian Journal of Statistics April 07, Volume 46, 99 05. AJS http://www.ajs.or.at/ doi:0.773/ajs.v46i3-4.675 A Modification of Linfoot s Informational Correlation Coefficient Georgy Shevlyakov Peter

More information