Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding
|
|
- Sharon Thomas
- 5 years ago
- Views:
Transcription
1 Article Approximations of Shannon utual Information for Discrete Variables with Applications to Neural Population Coding Wentao Huang 1,2, * and Kechen Zhang 2, * 1 Key Laboratory of Cognition and Intelligence and Information Science Academy of China Electronics Technology Group Corporation, Beijing 186, China 2 Department of Biomedical Engineering, Johns Hopkins University School of edicine, Baltimore, D 2125, USA * Correspondence: wnthuang@gmail.com W.H.; kzhang4@jhmi.edu K.Z.; Tel.: W.H.; K.Z. Abstract: Although Shannon mutual information has been widely used, its effective calculation is often difficult for many practical problems, including those in neural population coding. Asymptotic formulas based on Fisher information sometimes provide accurate approximations to the mutual information but this approach is restricted to continuous variables because the calculation of Fisher information requires derivatives with respect to the encoded variables. In this paper, we consider information-theoretic bounds and approximations of the mutual information based on Kullback Leibler divergence and Rényi divergence. We propose several information metrics to approximate Shannon mutual information in the context of neural population coding. While our asymptotic formulas all work for discrete variables, one of them has consistent performance and high accuracy regardless of whether the encoded variables are discrete or continuous. We performed numerical simulations and confirmed that our approximation formulas were highly accurate for approximating the mutual information between the stimuli and the responses of a large neural population. These approximation formulas may potentially bring convenience to the applications of information theory to many practical and theoretical problems. Keywords: neural population coding; mutual information; Kullback Leibler divergence; Rényi divergence; Chernoff divergence; approximation; discrete variables 1. Introduction Information theory is a powerful tool widely used in many disciplines, including, for example, neuroscience, machine learning, and communication technology [1 7]. As it is often notoriously difficult to effectively calculate Shannon mutual information in many practical applications [8], various approximation methods have been proposed to estimate the mutual information, such as those based on asymptotic expansion [9 13], k-nearest neighbor [14], and minimal spanning trees [15]. Recently, Safaai et al. proposed a copula method for estimation of mutual information, which can be nonparametric and potentially robust [16]. Another approach for estimating the mutual information is to simplify the calculations by approximations based on information-theoretic bounds, such as the Cramér Rao lower bound [17] and the van Trees Bayesian Cramér Rao bound [18]. In this paper, we focus on mutual information estimation based on asymptotic approximations [19 24]. For encoding of continuous variables, asymptotic relations between mutual information and Fisher information have been presented by several researchers [19 22]. Recently, Huang and Zhang [24] proposed an improved approximation formula, which remains accurate for high-dimensional variables. A significant advantage of this approach is that asymptotic approximations are sometimes very useful in analytical studies. For instance, asymptotic approximations allow us to prove that the 219 by the authors. Distributed under a Creative Commons CC BY license.
2 2 of 21 optimal neural population distribution that maximizes the mutual information between stimulus and response can be solved by convex optimization [24]. Unfortunately this approach does not generalize to discrete variables since the calculation of Fisher information requires partial derivatives of the likelihood function with respect to the encoded variables. For encoding of discrete variables, Kang and Sompolinsky [23] represented an asymptotic relationship between mutual information and Chernoff information for statistically independent neurons in a large population. However, Chernoff information is still hard to calculate in many practical applications. Discrete stimuli or variables occur naturally in sensory coding. While some stimuli are continuous e.g., the direction of movement, and the pitch of a tone, others are discrete e.g., the identities of faces, and the words in human speech. For definiteness, in this paper, we frame our questions in the context of neural population coding; that is, we assume that the stimuli or the input variables are encoded by the pattern of responses elicited from a large population of neurons. The concrete examples used in our numerical simulations were based on Poisson spike model, where the response of each neuron is taken as the spike count within a given time window. While this simple Poisson model allowed us to consider a large neural population, it only captured the spike rate but not any temporal structure of the spike trains [25 28]. Nonetheless, our mathematical results are quite general and should be applicable to other input output systems under suitable conditions to be discussed later. In the following, we first derive several upper and lower bounds on Shannon mutual information using Kullback Leibler divergence and Rényi divergence. Next, we derive several new approximation formulas for Shannon mutual information in the limit of large population size. These formulas are more convenient to calculate than the mutual information in our examples. Finally, we confirm the validity of our approximation formulas using the true mutual information as evaluated by onte Carlo simulations. 2. Theory and ethods 2.1. Notations and Definitions Suppose the input x is a K-dimensional vector, x = x 1,, x K T, which could be interpreted as the parameters that specifies a stimulus for a sensory system, and the outputs is an N-dimensional vector, r = r 1,, r N T, which could be interpreted as the responses of N neurons. We assume N is large, generally N K. We denote random variables by upper case letters, e.g., random variables X and R, in contrast to their vector values x and r. The mutual information between X and R is defined by I = IX; R = ln pr x, 1 pr r,x where x X R K, r R R N, and r,x denotes the expectation with respect to the probability density function pr, x. Similarly, in the following, we use r x and x to denote expectations with respect to pr x and px, respectively. If px and pr x are twice continuously differentiable for almost every x X, then for large N we can use an asymptotic formula to approximate the true value of I with high accuracy [24]: which is sometimes reduced to I I G = 1 2 I I F = 1 2 Gx ln det + HX, 2 2πe x Jx ln det + HX, 3 2πe x
3 3 of 21 where det denotes the matrix determinant, HX = ln px x is the stimulus entropy, and Jx = 2 ln pr x x x T is the Fisher information matrix. We denote the Kullback Leibler divergence as Gx = Jx + P x, 4 Px = 2 ln px x x T, 5 r x ln pr x = x D x ˆx = ln p r x p r ˆx and denote Rényi divergence [29] of order β + 1 as ln pr x x T r x 6, 7 r x D β x ˆx = 1 pr x β β ln. 8 p r ˆx r x Here, βd β x ˆx is equivalent to Chernoff divergence of order β + 1 [3]. It is well known that D β x ˆx D x ˆx in the limit β. We define I u = ln exp D x ˆx ˆx x, 9 I e = ln exp e 1 D x ˆx, 1 ˆx x I β,α = ln exp βd β x ˆx + 1 α ln p x, 11 p ˆx ˆx x where in I β,α we have β, 1 and α, and assume p x > for all x X. In the following, we suppose x takes discrete values, x m, m = {1, 2,, }, and px m > for all m. Now, the definitions in Equations 9 11 become I u = I e = I β,α = p x m ln p x m ln p x m ln ˆ ˆ ˆ p x ˆm p x m exp D x m x ˆm + HX, 12 p x ˆm p x m exp e 1 D x m x ˆm + HX, 13 px ˆm α exp βd px m β x m x ˆm + HX. 14
4 4 of 21 Furthermore, we define I d = I d u = px m ln px m ln 1 + ˆm u m 1 + Iβ,α d = px m ln 1 + I D = px m ln ˆm u m ˆm β m px ˆm px m exp e 1 D x m x ˆm + H X, 15 px ˆm px m exp D x m x ˆm + H X, 16 px ˆm α exp βd px m β x m x ˆm + H X, exp e 1 D x m x ˆm + H X, 18 ˆm u m where ˇ m β = ˆm : ˆm = arg min D β x m x ˇm, 19 ˇm ˆ m β { } ˇ u m = ˆm : ˆm = arg min ˇm ˆ u m D x m x ˇm, 2 ˆ β m = { ˆm : D β x m x ˆm = }, 21 ˆ u m = { ˆm : D x m x ˆm = }, 22 β m = ˇ β m ˆ β m {m}, 23 u m = ˇ u m ˆ u m {m}. 24 Here, notice that, if x is uniformly distributed, then by definition I d and I D become identical. The elements in set ˇ m β are those that make D β x m x ˇm take the minimum value, excluding any element that satisfies the condition D β x m x ˆm =. Similarly, the elements in set ˇ u m are those that minimize D x m x ˇm excluding the ones that satisfy the condition D x m x ˆm = Theorems In the following, we state several conclusions as theorems and prove them in Appendix. Theorem 1. The mutual information I is bounded as follows: Theorem 2. The following inequalities are satisfied, where I β1,1 is a special case of I β,α in Equation 11 with β 1 = e 1 so that Theorem 3. If there exist γ 1 > and γ 2 > such that I β,α I I u. 25 I β1,1 I e I u 26 I β1,1 = ln exp β 1 D β1 x ˆx ˆx. 27 x βd β x m x m1 γ 1 ln N, 28 D x m x m2 γ 2 ln N, 29
5 5 of 21 for discrete stimuli x m, where m, m 1 β m and m 2 u m, then we have the following asymptotic relationships: I β,α = I d β,α + O N γ 1 I I u = I d u + O N γ 2 3 and I e = I d + O N γ 2/e. 31 Theorem 4. Suppose px and pr x are twice continuously differentiable for x X, q x <, q x <, where qx = ln px and and denote partial derivatives / x and 2 / x x T, and G γ x is positive definite with NG 1 x = O 1, where denotes matrix Frobenius norm, γ = β 1 β and β, 1. If there exist an ω = ω x > such that γ det Gx 1/2 det G γ x 1/2 G γ x = γ Jx + P x, 32 X ε x X ε x pˆx exp D x ˆx dˆx = O N 1, 33 pˆx exp βd β x ˆx dˆx = O N 1, 34 for all x X and ε, ω, where X { ω x = X X ω x is the complementary set of X ω x = x R K : x x T Gx x x < Nω 2}, then we have the following asymptotic relationships: where I β,α I γ + O N 1 I I u = I G + K/2 + O N 1, 35 I γ = 1 2 and γ = β 1 β = 1/4 with β = 1/2. X I e = I G + O N 1, 36 I β,α = I γ + O N 1, 37 px ln det Gγ x 2π dx + HX 38 Remark 1. We see from Theorems 1 and 2 that the true mutual information I and the approximation I e both lie between I β1,1 and I u, which implies that their values may be close to each other. For discrete variable x, Theorem 3 tells us that I e and I d are asymptotically equivalent i.e., their difference vanishes in the limit of large N. For continuous variable x, Theorem 4 tells us that I e and I G are asymptotically equivalent in the limit of large N, which means that I e and I are also asymptotically equivalent because I G and I are known to be asymptotically equivalent [24]. Remark 2. To see how the condition in Equation 33 could be satisfied, consider the case where D x ˆx has only one extreme point at ˆx = x for ˆx X ω x and there exists an η > such that N 1 D x ˆx η for ˆx X ω x. Now, the condition in Equation 33 is satisfied because det Gx 1/2 X ε x det Gx 1/2 X ε x pˆx exp D x ˆx dˆx pˆx exp ˆη ε N dˆx = O N K/2 e ˆηεN, 39
6 6 of 21 where by assumption we can find an ˆη ε > for any given ε, ω. The condition in Equation 34 can be satisfied in a similar way. When β = 1/2, β D β x ˆx is the Bhattacharyya distance [31]: β D β x ˆx = ln pr xpr ˆxdr, 4 R and we have J x = 2 D x ˆx ˆx ˆx T = 2 4β D β x ˆx ˆx=x ˆx ˆx T = 2 ˆx=x 8H 2 l x ˆx ˆx ˆx T, 41 ˆx=x where H l x ˆx is the Hellinger distance [32] between pr x and pr ˆx: H 2 l x ˆx = 1 2 By Jensen s inequality, for β, 1 we get Denoting the Chernoff information [8] as where βd β x ˆx achieves its maximum at β m, we have I β,α HX h c = h d = R 2 pr x pr ˆx dr. 42 D β x ˆx D x ˆx. 43 C x ˆx = max βdβ x ˆx = β m D βm x ˆx, 44 β, 1 p x m ln p x m ln ˆ ˆ p x ˆm p x m exp C x m x ˆm p x ˆm p x m exp β m D β x m x ˆm By Theorem 4, max I β,α = I γ + O N 1, 47 β, 1 If β m = 1/2, then, by Equations 5, 46, 47 and 48, we have I γ = I G K 2 ln 4 e. 48 max I β + K β, 1 2 ln 4 e + O N 1 I e = I + O N 1 h d + HX I u. 49 Therefore, from Equations 45, 46 and 49, we can see that I e and I are close to h c + HX Approximations for utual Information In this section, we use the relationships described above to find effective approximations to true mutual information I in the case of large but finite N. First, Theorems 1 and 2 tell us that the true mutual information I and its approximation I e lie between lower and upper bounds given by:
7 7 of 21 I β,α I I u and I β1,1 I e I u. As a special case, I is also bounded by I β1,1 I I u. Furthermore, from Equation 2 and 36 we can obtain the following asymptotic equality under suitable conditions: I = I e + O N 1. 5 Hence, for continuous stimuli, we have the following approximate relationship for large N: I I e I G. 51 For discrete stimuli, by Equation 31 for large but finite N, we have I I e I d = px m ln 1 + ˆm u m px ˆm px m exp e 1 D x m x ˆm + H X. 52 Consider the special case px ˆm px m for ˆm u m. With the help of Equation 18, substitution of px ˆm px m into Equation 52 yields I I D = = I D px m px m ln 1 + exp e 1 D x m x ˆm + H X ˆm u m exp e 1 D x m x ˆm + H X ˆm u m where ID I D and the second approximation follows from the first-order Taylor expansion assuming that the term exp e 1 D x m x ˆm is sufficiently small. ˆm u m The theoretical discussion above suggests that I e and I d are effective approximations to true mutual information I in the limit of large N. oreover, we find that they are often good approximations of mutual information I even for relatively small N, as illustrated in the following section. 3. Results of Numerical Simulations Consider Poisson model neuron whose responses i.e., numbers of spikes within a given time window follow a Poisson distribution [24]. The mean response of neuron n, with n {1, 2,, N}, is described by the tuning function f x; θ n, which takes the form of a Heaviside step function: f x; θ n = { A, if x θ n,, if x < θ n, where the stimulus x [ T, T] with T = 1, A = 1, and the centers θ 1, θ 2,, θ N of the N neurons are uniformly spaced in interval [ T, T], namely, θ n = n 1 d T with d = 2T/N 1 for N 2, and θ n = for N = 1. We suppose that the discrete stimulus x has = 21 possible values that are evenly spaced from T to T, namely, x X = {x m : x m = 2 m 1 T/ 1 T, m = 1, 2,, }. Now, the Kullback Leibler divergence can be written as D x m x ˆm = f x m ; θ n log f xm ; θ n + f x f x ˆm ; θ n ˆm ; θ n f x m ; θ n. 55
8 8 of 21 Thus, we have exp e 1 D x m x ˆm = 1 when f x m ; θ n = f x ˆm ; θ n, exp e 1 D x m x ˆm = exp e 1 A when f x m ; θ n = and f x ˆm ; θ n = A, and exp e 1 D x m x ˆm = when f x m ; θ n = A and f x ˆm ; θ n =. Therefore, in this case, we have I e = I d. 56 ore generally, this equality holds true whenever the tuning function has binary values. In the first example, as illustrated in Figure 1, we suppose the stimulus has a uniform distribution, so that the probability is given by px m = 1/. Figure 1a shows graphs of the input distribution px and a representative tuning function f x; θ with the center θ =. To assess the accuracy of the approximation formulas, we employed onte Carlo C simulation to evaluate the mutual information I [24]. In our C simulation, we first sampled an input x j X from the uniform distribution px j = 1/, then generated the neural responses r j by the conditional distribution pr j x j based on the Poisson model, where j = 1, 2,, j max. The value of mutual information by C simulation was calculated by where IC = 1 j max j max pr j = j=1 prj x j ln, 57 pr j pr j x m px m. 58 To assess the precision of our C simulation, we computed the standard deviation of repeated trials by bootstrapping: I std = 1 i max i max I i 2, C I C 59 i=1 where IC i = 1 j max j max j=1 prγj,i x Γj,i ln, 6 pr Γj,i I C = 1 i max i max IC i, 61 i=1 and Γ j,i {1, 2,, j max } is the j, i-th entry of the matrix Γ N j max i max with samples taken randomly from the integer set {1, 2,, j max } by a uniform distribution. Here, we set j max = 5 1 5, i max = 1 and = 1 3. For different N {1, 2, 3, 4, 6, 1, 14, 2, 3, 5, 1, 2, 4, 7, 1}, we compared I C with I e, I d and I D, as illustrated in Figure 1b d. Here, we define the relative error of approximation, e.g., for I e, as DI e = I e I C I C, 62 and the relative standard deviation DI std = I std I C. 63
9 9 of 21 a.1 1 b 4 px.5 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 1. A comparison of approximations I e, I d and I D against I C obtained by onte Carlo method for one-dimensional discrete stimuli. a Discrete uniform distribution of the stimulus px black dots and the Heaviside step tuning function f x; θ with center θ = blue dashed lines ; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N ; c The relative errors DI e, DI d and DI D for the results in b ; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. For the second example, we only changed the probability distribution of stimulus px m while keeping all other conditions unchanged. Now, px m is a discrete sample from a Gaussian function: px m = Z 1 exp x2 m 2σ 2, m = 1, 2,,, 64 where Z is the normalization constant and σ = T/2. The results are illustrated in Figure 2.
10 1 of 21 a.1 1 b 4 px.5 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 2. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 1 except that the stimulus distribution px is peaked rather flat black dots in a. a Discrete Gaussian-like distribution of the stimulus px black dots and the Heaviside step tuning function f x; θ with center θ = blue dashed lines; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. Next, we changed each tuning function f x; θ n to a rectified linear function: f x; θ n = max, x θ n, 65 Figures 3 and 4 show the results under the same conditions of Figures 1 and 2 except for the shape of the tuning functions.
11 11 of 21 a.1 1 b 4 px 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 3. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 1 except for the shape of the tuning function blue dashed lines in a. a Discrete uniform distribution of the stimulus px black dots and the rectified linear tuning function f x; θ with center θ = blue dashed lines; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. Finally, we let the tuning function f x; θ n have a random form: f x; θ n = { B, if x θ n = { θn, 1 θn, 2, θn K },, otherwise, 66 where the stimulus x X = {1, 2,, 999, 1}, B = 1, the values of { θ 1 n, θ 2 n,, θ K n } are distinct and randomly selected from the set X with K = 1. In this example, we may regard X as a list of natural objects stimuli, and there are a total of N sensory neurons, each of which responds only to K randomly selected objects. Figure 5 shows the results under the condition that px is a uniform distribution. In Figure 6, we assume that px is not flat but a half Gaussian given by Equation 64 with σ = 5.
12 12 of 21 a.1 1 b 4 px 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 4. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 3 except that the stimulus distribution px is peaked rather flat black dots in a. a Discrete Gaussian-like distribution of the stimulus px black dots and the rectified linear tuning function f x; θ with center θ = blue dashed lines; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. In all these examples, we found that the three formulas, namely, I e, I d and I D, provided excellent approximations to the true values of mutual information as evaluated by onte Carlo method. For example, in the examples in Figures 1 and 5, all three approximations were practically indistinguishable. In general, all these approximations were extremely accurate when N > 1. In all our simulations, the mutual information tended to increase with the population size N, eventually reaching a plateau for large enough N. The saturation of information for large N is due to the fact that it requires at most log 2 bits of information to completely distinguish all stimuli. It is impossible to gain more information than this maximum amount regardless of how many neurons are used in the population. In Figure 1, for instance, this maximum is log 2 21 = 4.39 bits, and in Figure 5, this maximum is log 2 1 = 9.97 bits.
13 13 of 21 a.2 1 b 1 8 px.1 5 fx; Information bits IC Ie Id Input x ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 5. A comparison of approximations I e, I d and I D against I C. The situation is similar to that in Figure 1 except that the tuning function is random blue dashed lines in a; see Equation 66. a Discrete uniform distribution of the stimulus px black dots and the random tuning function f x; θ; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. For relatively small values of N, we found that I D tended to be less accurate than I e or I d see Figures 5 and 6. Our simulations also confirmed two analytical results. The first one is that I d = I D when the stimulus distribution is uniform; this result follows directly from the definitions of I d and I D and is confirmed by the simulations in Figures 1, 3, and 5. The second result is that I d = I e Equation 56 when the tuning function is binary, as confirmed by the simulations in Figures 1, 2, 5, and 6. When the tuning function allows many different values, I e can be much more accurate than I d and I D, as shown by the simulations in Figures 3 and 4. To summarize, our best approximation formula is I e because it is more accurate than I d and I D, and, unlike I d and I D, it applies to both discrete and continuous stimuli Equations 1 and 13.
14 14 of 21 a.2 1 b 1 8 px Input x 5 fx; Information bits IC Ie Id ID c Relative error DIe DId DID d Absolute value of relative error DIe DId DID Figure 6. A comparison of approximations I e, I d and I D against I C. The situation is identical to that in Figure 5 except that the stimulus distribution px is not flat black dots in a. a Discrete Gaussian-like distribution of the stimulus px black dots and the random tuning function f x; θ; b The values of I C, I e, I d and I D depend on the population size or total number of neurons N; c The relative errors DI e, DI d and DI D for the results in b; d The absolute values of the relative errors DI e, DI d and DI D as in c, with error bars showing standard deviations of repeated trials. 4. Discussion We have derived several asymptotic bounds and effective approximations of mutual information for discrete variables and established several relationships among different approximations. Our final approximation formulas involve only Kullback Leibler divergence, which is often easier to evaluate than Shannon mutual information in practical applications. Although in this paper our theory is developed in the framework of neural population coding with concrete examples, our mathematical results are generic and should hold true in many related situations beyond the original context. We propose to approximate the mutual information with several asymptotic formulas, including I e in Equation 1 or Equation 13, I d in Equation 15 and I D in Equation 18. Our numerical experimental results show that the three approximations I e, I d and I D were very accurate for large population size N, and sometimes even for relatively small N. Among the three approximations, I D
15 15 of 21 tended to be the least accurate, although, as a special case of I d, it is slightly easier to evaluate than I d. For a comparison of I e and I d, we note that I e is the universal formula, whereas I d is restricted only to discrete variables. The two formulas I e and I d become identical when the responses or the tuning functions have only two values. For more general tuning functions, the performance of I e was better than I d in our simulations. As mentioned before, an advantage of of I e is that it works not only for discrete stimuli but also for continuous stimuli. Theoretically speaking, the formula for I e is well justified, and we have proven that it approaches the true mutual information I in the limit of large population. In our numerical simulations, the performance of I e was excellent and better than that of I d and I D. Overall, I e is our most accurate and versatile approximation formula, although, in some cases, I d and I D are slightly more convenient to calculate. The numerical examples considered in this paper were based on an independent population of neurons whose responses have Poisson statistics. Although such models are widely used, they are appropriate only if the neural responses can be well characterized by the spike counts within a fixed time window. To study the temporal patterns of spike trains, one has to consider more complicated models. Estimation of mutual information from neural spike trains is a difficult computational problem [25 28]. In future work, it would be interesting to apply the asymptotic formulas such as I e to spike trains with small time bins each containing either one spike or nothing. A potential advantage of the asymptotic formula is that it might help reduce the bias caused by small samples in the calculation of the response marginal distribution pr = x pr xpx or the response entropy HR because here one only needs to calculate the Kullback Leibler divergence D x ˆx, which may have a smaller estimation error. Finding effective approximation methods for computing mutual information is a key step for many practical applications of the information theory. Generally speaking, Kullback Leibler divergence Equation 7 is often easier to evaluate and approximate than either Chernoff information Equation 44 or Shannon mutual information Equation 1. In situations where this is indeed the case, our approximation formulas are potentially useful. Besides applications in numerical simulations, the availability of a set of approximation formulas may also provide helpful theoretical tools in future analytical studies of information coding and representations. As mentioned in the Introduction, various methods have been proposed to approximate the mutual information [9 16]. In future work, it would be useful to compare different methods rigorously under identical conditions in order to asses their relative merits. The approximation formulas developed in this paper are relatively easy to compute for practical problems. They are especially suitable for analytical purposes; for example, they could be used explicitly as objective functions for optimization or learning algorithms. Although the examples used in our simulations in this paper are parametric, it should be possible to extend the formulas to nonparametric problem, possibly with help of the copula method to take advantage of its robustness in nonparametric estimations [16]. Author Contributions: W.H. developed and proved the theorems, programmed the numerical experiments and wrote the manuscript. K.Z. verified the proofs and revised the manuscript. Funding: This research was supported by an NIH grant R1 DC Conflicts of Interest: The authors declare no conflict of interest.
16 16 of 21 Appendix A. The Proofs Appendix A.1. Proof of Theorem 1 By Jensen s inequality, we have I β,α = ln X ln X p β r ˆxp α ˆx p β r x p α x p β r ˆxp α ˆx p β r x p α x dˆx dˆx r x r x x x + HX + HX A1 and ln X = ln p β r ˆxp α ˆx p β r x p α x dˆx r x x + HX I 1 pr, ˆx X p r, x dˆx p β r ˆxp α ˆx X p β r x p α x dˆx r x x ln pr X pβ r x p α xdx R X pβ r ˆxp α ˆxdˆx dr =. A2 Combining Equations A1 and A2, we immediately get the lower bound in Equation 25. In this section, we use integral for variable x, although our argument is valid for both continuous variables and discrete variables. For discrete variables, we just need to replace each integral by a summation, and our argument remains valid without other modification. The same is true for the response variable r. To prove the upper bound, let Φ [qˆx] = R p r x X qˆx ln p r x qˆx dˆxdr, p r ˆx pˆx A3 where qˆx satisfies { X qˆxdˆx = 1 qˆx. A4 By Jensen s inequality, we get Φ [qˆx] R p r x ln p r x dr. p r To find a function qˆx that minimizes Φ [qˆx], we apply the variational principle as follows: Φ [qˆx] q ˆx = R p r x ln p r x qˆx dr λ, p r ˆx pˆx A5 A6 where λ is the Lagrange multiplier and Φ [qˆx] = Φ [qˆx] + λ qˆxdˆx 1. A7 X
17 17 of 21 Setting Φ[qˆx] qˆx = and using the constraint in Equation A4, we find the optimal solution q ˆx = X pˆx exp D x ˆx. A8 pˇx exp D x ˇx dˇx Thus, the variational lower bound of Φ [qˆx] is given by Φ [q ˆx] = min Φ [qˆx] = ln pˆx exp D x ˆx dˆx dx. qˆx X A9 Therefore, from Equations 1, A5 and A9, we get the upper bound in Equation 25. This completes the proof of Theorem 1. Appendix A.2. Proof of Theorem 2 It follows from Equation 43 that I β1,α 1 = ln exp β 1 D β1 x ˆx + 1 α 1 ln p x p ˆx ˆx ln exp e 1 D x ˆx = I e ˆx x ln exp D x ˆx ˆx x = I u, x A1 where β 1 = e 1 and α 1 = 1. We immediately get Equation 26. This completes the proof of Theorem 2. Appendix A.3. Proof of Theorem 3 For the lower bound I β,α, we have I β,α = px m ln ˇ p x ˇm α exp βd p x m β x m x ˇm = px m ln 1 + d x m + H X, A11 where Now, consider p x d x m = ˇm α ˇm {m} exp βd p x m β x m x ˇm. A12 ln 1 + d x m = ln 1 + a x m + b x m = ln 1 + a x m + ln 1 + b x m 1 + a x m 1 = ln 1 + a x m + O N γ, A13 where p x a x m = ˆm α β exp βd ˆm m p x m β x m x ˆm, A14a p x b x m = ˇm α β exp βd ˇm m p x m β x m x ˇm N γ 1 ˇm β m p x ˇm p x m α = O N γ 1. A14b
18 18 of 21 Combining Equations A11 and A13 and Theorem 1, we get the lower bound in Equation 3. In a manner similar to the above, we can get the upper bound in Equations 3 and 31. This completes the proof of Theorem 3. Appendix A.4. Proof of Theorem 4 The upper bound I u for mutual information I in Equation 25 can be written as I u = = X ln ln p ˆx exp D x ˆx dˆx X Lr ˆx X exp Lr x r x dˆx p x dx + H X. A15 where L r ˆx = ln p r ˆx p ˆx and L r x = ln p r x p x. Consider the Taylor expansion for Lr ˆx around x. Assuming that Lr ˆx is twice continuously differentiable for any ˆx X ω x, we get x Lr ˆx Lr x r x = y T v yt y 1 2 yt G 1/2 x G x Gx G 1/2 x y A16 where y = G 1/2 x ˆx x, v 1 = G 1/2 x q x A17 A18 and x = x + t ˆx x X ω x, t, 1. A19 where For later use, we also define v = G 1/2 x l r x l r x = ln p r x. A2 A21 Since G x is continuous and symmetric for x X, for any ɛ, 1, there is a ε, ω such that y T G 1/2 x G x Gx G 1/2 x y < ɛ y 2 A22 { for all y Y ε, where Y ε = y R K : y < ε } N. Then, we get ln Lr ˆx X exp Lr x r x dˆx 1 2 Y ln det Gx + ln exp y T v 1 1 ε ɛ yt y dy A23 and with Jensen s inequality, ln exp y T v 1 1 Y ε ɛ yt y dy ln Ψ ε + y T v 1 φ ε y dy Ŷ ε = K 2π 2 ln + O N K/2 e Nδ, 1 + ɛ A24
19 19 of 21 where δ is a positive constant, Ŷ ε y T v 1 φ ε y dy =, φ ε y = Ψ 1 ε Ψ ε = Ŷ ε exp exp ɛ yt y dy ɛ yt y A25 and Now, we evaluate Ŷ ε = Ψ ε = exp R K 2π = 1 + ɛ { y R K : y k < ε } N/K, k = 1, 2,, K Y ε ɛ yt y dy K/2 R K Ŷ ε exp exp R K Ŷ ε ɛ yt y ɛ yt y dy dy. Performing integration by parts with a e t2 /2 dt = e a2 /2 a a R K Ŷ ε exp ɛ yt y dy = O e t2 /2 dt, we find t 2 exp ɛ ε2 N 1 + ɛ 2 ε 2 N/ 4K N K/2 e Nδ, K/2 A26 A27 A28 for some constant δ >. Combining Equations A15, A23 and A24, we get I u 1 2 ln det 1 + ɛ 2π Gx + H X + O N K/2 e Nδ. x On the other hand, from Equation A22 and the condition in Equation 33, we obtain exp Lr ˆx Lr x r x dˆx X ε x det Gx R 1/2 exp y T v 1 1 K 2 1 ɛ yt y dy 1 ɛ 1/2 1 = det 2π Gx exp 2 1 ɛ 1 v T v 1 A29 A3 and Lr ˆx X exp Lr x r x dˆx = exp Lr ˆx Lr x r x dˆx + det X ε x 1 ɛ 1/2 2π Gx exp v T v ɛ X X ε x + O exp Lr ˆx Lr x r x dˆx N 1. A31
20 2 of 21 It follows from Equations A15 and A31 that ln Lr ˆx X exp Lr x r x dˆx 1 2 ln det 1 ɛ Note that v T v 1 x = O N 1. 2π x Gx + 1 x 2 1 ɛ 1 v T v 1 + O N 1. x A32 A33 Now, we have I u 1 2 ln det 1 ɛ 2π Gx + H X + O N 1. x A34 Since ɛ is arbitrary, we can let it go to zero. Therefore, from Equations 25, A29 and A34, we obtain the upper bound in Equation 35. The Taylor expansion of h ˆx, x = pr ˆx β around x is pr x β h ˆx, x = 1 + pr x ˆx x T β 2pr x 2 = 1 pr x x r x ˆx x+ r x β 1 pr x pr x x x T β 1 β ˆx x T Jxˆx x pr x 2 pr x x x T ˆx x + r x In a similar manner as described above, we obtain the asymptotic relationship 37: A35 I β,α = I γ + O N 1 = 1 px ln det 2 X Gγ x 2π dx + HX. A36 Notice that < γ = β 1 β 1/4 and the equality holds when β = β = 1/2. Thus, we have det G γ x det G γ x. Combining Equations 25, A36 and A37 yields the lower bound in Equation 35. The proof of Equation 36 is similar. This completes the proof of Theorem 4. A37 References 1. Borst, A.; Theunissen, F.E. Information theory and neural coding. Nat. Neurosci. 1999, 2, Pouget, A.; Dayan, P.; Zemel, R. Information processing with population codes. Nat. Rev. Neurosci. 2, 1, Laughlin, S.B.; Sejnowski, T.J. Communication in neuronal networks. Science 23, 31, Brown, E.N.; Kass, R.E.; itra, P.P. ultiple neural spike train data analysis: State-of-the-art and future challenges. Nat. Neurosci. 24, 7, Bell, A.J.; Sejnowski, T.J. The independent components of natural scenes are edge filters. Vision Res. 1997, 37, Huang, W.; Zhang, K. An Information-Theoretic Framework for Fast and Robust Unsupervised Learning via Neural Population Infomax. In Proceedings of the 5th International Conference on Learning Representations ICLR, Toulon, France, April 217; arxiv:
21 21 of Huang, W.; Huang, X.; Zhang, K. Information-theoretic interpretation of tuning curves for multiple motion directions. In Proceedings of the st Annual Conference on Information Sciences and Systems CISS, Baltimore, D, USA, arch Cover, T..; Thomas, J.A. Elements of Information, 2nd ed.; Wiley-Interscience: New York, NY, USA, iller, G.A. Note on the bias of information estimates. Inf. Theory Psychol.Probl. ethods 1955, 2, Carlton, A. On the bias of information estimates. Psychol. Bull. 1969, 71, Treves, A.; Panzeri, S. The upward bias in measures of information derived from limited data samples. Neural Comput. 1995, 7, Victor, J.D. Asymptotic bias in information estimates and the exponential Bell polynomials. Neural Comput. 2, 12, Paninski, L. Estimation of entropy and mutual information. Neural Comput. 23, 15, Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 24, 69, Khan, S.; Bandyopadhyay, S.; Ganguly, A.R.; Saigal, S.; Erickson, D.J., III; Protopopescu, V.; Ostrouchov, G. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E 27, 76, Safaai, H.; Onken, A.; Harvey, C.D.; Panzeri, S. Information estimation using nonparametric copulas. Phys. Rev. E 218, 98, Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta ath. Soc. 1945, 37, Van Trees, H.L.; Bell, K.L. Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking; John Wiley: Piscataway, NJ, USA, Clarke, B.S.; Barron, A.R. Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 199, 36, Rissanen, J.J. Fisher information and stochastic complexity. IEEE Trans. Inform. Theory 1996, 42, Brunel, N.; Nadal, J.P. utual information, Fisher information, and population coding. Neural Comput. 1998, 1, Sompolinsky, H.; Yoon, H.; Kang, K.J.; Shamir,. Population coding in neuronal systems with correlated noise. Phys. Rev. E 21, 64, Kang, K.; Sompolinsky, H. utual information of population codes and distance measures in probability space. Phys. Rev. Lett. 21, 86, Huang, W.; Zhang, K. Information-theoretic bounds and approximations in neural population coding. Neural Comput. 218, 3, Strong, S.P.; Koberle, R.; van Steveninck, R.R.D.R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 8, Nemenman, I.; Bialek, W.; van Steveninck, R.D.R. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E 24, 69, Panzeri, S.; Senatore, R.; ontemurro,.a.; Petersen, R.S. Correcting for the sampling bias problem in spike train information measures. J. Neurophysiol. 217, 98, Houghton, C. Calculating the utual Information Between Two Spike Trains. Neural Comput. 219, 31, Rényi, A. On measures of entropy and information. In Fourth Berkeley Symposium on athematical Statistics and Probability; The Regents of the University of California, University of California Press: Berkeley, CA, USA, 1961; pp Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. ath. Stat. 1952, 23, Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta ath. Soc. 1943, 35, Beran, R. inimum Hellinger distance estimates for parametric models. Ann. Stat. 1977, 5,
Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35
1 / 35 Information Theory Mark van Rossum School of Informatics, University of Edinburgh January 24, 2018 0 Version: January 24, 2018 Why information theory 2 / 35 Understanding the neural code. Encoding
More informationEstimation of information-theoretic quantities
Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ liam liam@gatsby.ucl.ac.uk November 16, 2004 Some
More information+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing
Linear recurrent networks Simpler, much more amenable to analytic treatment E.g. by choosing + ( + ) = Firing rates can be negative Approximates dynamics around fixed point Approximation often reasonable
More informationEncoding or decoding
Encoding or decoding Decoding How well can we learn what the stimulus is by looking at the neural responses? We will discuss two approaches: devise and evaluate explicit algorithms for extracting a stimulus
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationNeural coding Ecological approach to sensory coding: efficient adaptation to the natural environment
Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ.
More informationEfficient coding of natural images with a population of noisy Linear-Nonlinear neurons
Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons Yan Karklin and Eero P. Simoncelli NYU Overview Efficient coding is a well-known objective for the evaluation and
More information3.3 Population Decoding
3.3 Population Decoding 97 We have thus far considered discriminating between two quite distinct stimulus values, plus and minus. Often we are interested in discriminating between two stimulus values s
More informationExercises. Chapter 1. of τ approx that produces the most accurate estimate for this firing pattern.
1 Exercises Chapter 1 1. Generate spike sequences with a constant firing rate r 0 using a Poisson spike generator. Then, add a refractory period to the model by allowing the firing rate r(t) to depend
More informationPopulation Coding. Maneesh Sahani Gatsby Computational Neuroscience Unit University College London
Population Coding Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2010 Coding so far... Time-series for both spikes and stimuli Empirical
More informationNeuronal Tuning: To Sharpen or Broaden?
NOTE Communicated by Laurence Abbott Neuronal Tuning: To Sharpen or Broaden? Kechen Zhang Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute for Biological Studies,
More informationMMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only
MMSE Dimension Yihong Wu Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú Department of Electrical Engineering Princeton University
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationSTATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY
2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels
More informationConnections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN
Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University
More informationSignal detection theory
Signal detection theory z p[r -] p[r +] - + Role of priors: Find z by maximizing P[correct] = p[+] b(z) + p[-](1 a(z)) Is there a better test to use than r? z p[r -] p[r +] - + The optimal
More informationVariable selection and feature construction using methods related to information theory
Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and
More informationChapter 2 Review of Classical Information Theory
Chapter 2 Review of Classical Information Theory Abstract This chapter presents a review of the classical information theory which plays a crucial role in this thesis. We introduce the various types of
More informationInformation Theory in Intelligent Decision Making
Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory
More informationIterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk
Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationThe Information Bottleneck Revisited or How to Choose a Good Distortion Measure
The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali
More informationEntropy and information estimation: An overview
Entropy and information estimation: An overview Ilya Nemenman December 12, 2003 Ilya Nemenman, NIPS 03 Entropy Estimation Workshop, December 12, 2003 1 Workshop schedule: Morning 7:30 7:55 Ilya Nemenman,
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More informationencoding and estimation bottleneck and limits to visual fidelity
Retina Light Optic Nerve photoreceptors encoding and estimation bottleneck and limits to visual fidelity interneurons ganglion cells light The Neural Coding Problem s(t) {t i } Central goals for today:
More informationConfidence Measure Estimation in Dynamical Systems Model Input Set Selection
Confidence Measure Estimation in Dynamical Systems Model Input Set Selection Paul B. Deignan, Jr. Galen B. King Peter H. Meckl School of Mechanical Engineering Purdue University West Lafayette, IN 4797-88
More informationBayesian probability theory and generative models
Bayesian probability theory and generative models Bruno A. Olshausen November 8, 2006 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using
More informationLecture 5 Channel Coding over Continuous Channels
Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From
More informationAdapted Feature Extraction and Its Applications
saito@math.ucdavis.edu 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA 95616 email: saito@math.ucdavis.edu URL: http://www.math.ucdavis.edu/
More informationarxiv:physics/ v1 [physics.data-an] 7 Jun 2003
Entropy and information in neural spike trains: Progress on the sampling problem arxiv:physics/0306063v1 [physics.data-an] 7 Jun 2003 Ilya Nemenman, 1 William Bialek, 2 and Rob de Ruyter van Steveninck
More informationthe Information Bottleneck
the Information Bottleneck Daniel Moyer December 10, 2017 Imaging Genetics Center/Information Science Institute University of Southern California Sorry, no Neuroimaging! (at least not presented) 0 Instead,
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationBioinformatics: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More information1/12/2017. Computational neuroscience. Neurotechnology.
Computational neuroscience Neurotechnology https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ 1 Neurotechnology http://www.lce.hut.fi/research/cogntech/neurophysiology Recording
More informationScalable Hash-Based Estimation of Divergence Measures
Scalable Hash-Based Estimation of Divergence easures orteza oshad and Alfred O. Hero III Electrical Engineering and Computer Science University of ichigan Ann Arbor, I 4805, USA {noshad,hero} @umich.edu
More informationAlgebraic Information Geometry for Learning Machines with Singularities
Algebraic Information Geometry for Learning Machines with Singularities Sumio Watanabe Precision and Intelligence Laboratory Tokyo Institute of Technology 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503
More informationInformation Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n
Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua
More informationINFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson
INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS Michael A. Lexa and Don H. Johnson Rice University Department of Electrical and Computer Engineering Houston, TX 775-892 amlexa@rice.edu,
More informationLecture 6. Regression
Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron
More informationSUPPLEMENTARY INFORMATION
Supplementary discussion 1: Most excitatory and suppressive stimuli for model neurons The model allows us to determine, for each model neuron, the set of most excitatory and suppresive features. First,
More informationECE 4400:693 - Information Theory
ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationMaximization of the information divergence from the multinomial distributions 1
aximization of the information divergence from the multinomial distributions Jozef Juríček Charles University in Prague Faculty of athematics and Physics Department of Probability and athematical Statistics
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationStatistical models for neural encoding, decoding, information estimation, and optimal on-line stimulus design
Statistical models for neural encoding, decoding, information estimation, and optimal on-line stimulus design Liam Paninski Department of Statistics and Center for Theoretical Neuroscience Columbia University
More informationWhy is Deep Learning so effective?
Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap
More informationMaximum Likelihood Estimation
Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationMachine Learning Srihari. Information Theory. Sargur N. Srihari
Information Theory Sargur N. Srihari 1 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy
More informationSome New Results on Information Properties of Mixture Distributions
Filomat 31:13 (2017), 4225 4230 https://doi.org/10.2298/fil1713225t Published by Faculty of Sciences and Mathematics, University of Niš, Serbia Available at: http://www.pmf.ni.ac.rs/filomat Some New Results
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling
TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Information Theory and Distribution Modeling Why do we model distributions and conditional distributions using the following objective
More informationSemi-Parametric Importance Sampling for Rare-event probability Estimation
Semi-Parametric Importance Sampling for Rare-event probability Estimation Z. I. Botev and P. L Ecuyer IMACS Seminar 2011 Borovets, Bulgaria Semi-Parametric Importance Sampling for Rare-event probability
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationInformation maximization in a network of linear neurons
Information maximization in a network of linear neurons Holger Arnold May 30, 005 1 Introduction It is known since the work of Hubel and Wiesel [3], that many cells in the early visual areas of mammals
More informationImproved characterization of neural and behavioral response. state-space framework
Improved characterization of neural and behavioral response properties using point-process state-space framework Anna Alexandra Dreyer Harvard-MIT Division of Health Sciences and Technology Speech and
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationMachine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier
Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically
More informationWhat can spike train distances tell us about the neural code?
What can spike train distances tell us about the neural code? Daniel Chicharro a,b,1,, Thomas Kreuz b,c, Ralph G. Andrzejak a a Dept. of Information and Communication Technologies, Universitat Pompeu Fabra,
More informationMathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory
Mathematical Tools for Neuroscience (NEU 34) Princeton University, Spring 206 Jonathan Pillow Homework 8: Logistic Regression & Information Theory Due: Tuesday, April 26, 9:59am Optimization Toolbox One
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationTHE retina in general consists of three layers: photoreceptors
CS229 MACHINE LEARNING, STANFORD UNIVERSITY, DECEMBER 2016 1 Models of Neuron Coding in Retinal Ganglion Cells and Clustering by Receptive Field Kevin Fegelis, SUID: 005996192, Claire Hebert, SUID: 006122438,
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationKullback-Leibler Designs
Kullback-Leibler Designs Astrid JOURDAN Jessica FRANCO Contents Contents Introduction Kullback-Leibler divergence Estimation by a Monte-Carlo method Design comparison Conclusion 2 Introduction Computer
More informationFeature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with
Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationIntroduction to Machine Learning
Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB
More informationThe Method of Types and Its Application to Information Hiding
The Method of Types and Its Application to Information Hiding Pierre Moulin University of Illinois at Urbana-Champaign www.ifp.uiuc.edu/ moulin/talks/eusipco05-slides.pdf EUSIPCO Antalya, September 7,
More informationOn the Chi square and higher-order Chi distances for approximating f-divergences
c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17 On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org
More informationInformation Dynamics Foundations and Applications
Gustavo Deco Bernd Schürmann Information Dynamics Foundations and Applications With 89 Illustrations Springer PREFACE vii CHAPTER 1 Introduction 1 CHAPTER 2 Dynamical Systems: An Overview 7 2.1 Deterministic
More information11. Learning graphical models
Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical
More informationNishant Gurnani. GAN Reading Group. April 14th, / 107
Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,
More informationStatistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach
Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score
More informationHow biased are maximum entropy models?
How biased are maximum entropy models? Jakob H. Macke Gatsby Computational Neuroscience Unit University College London, UK jakob@gatsby.ucl.ac.uk Iain Murray School of Informatics University of Edinburgh,
More informationTHE information capacity is one of the most important
256 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 1, JANUARY 1998 Capacity of Two-Layer Feedforward Neural Networks with Binary Weights Chuanyi Ji, Member, IEEE, Demetri Psaltis, Senior Member,
More informationNeural variability and Poisson statistics
Neural variability and Poisson statistics January 15, 2014 1 Introduction We are in the process of deriving the Hodgkin-Huxley model. That model describes how an action potential is generated by ion specic
More informationMismatched Estimation in Large Linear Systems
Mismatched Estimation in Large Linear Systems Yanting Ma, Dror Baron, Ahmad Beirami Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 7695, USA Department
More informationHomework 1 Due: Wednesday, September 28, 2016
0-704 Information Processing and Learning Fall 06 Homework Due: Wednesday, September 8, 06 Notes: For positive integers k, [k] := {,..., k} denotes te set of te first k positive integers. Wen p and Y q
More informationAUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET
The Problem Identification of Linear and onlinear Dynamical Systems Theme : Curve Fitting Division of Automatic Control Linköping University Sweden Data from Gripen Questions How do the control surface
More informationOn Learning µ-perceptron Networks with Binary Weights
On Learning µ-perceptron Networks with Binary Weights Mostefa Golea Ottawa-Carleton Institute for Physics University of Ottawa Ottawa, Ont., Canada K1N 6N5 050287@acadvm1.uottawa.ca Mario Marchand Ottawa-Carleton
More informationComputational Systems Biology: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#8:(November-08-2010) Cancer and Signals Outline 1 Bayesian Interpretation of Probabilities Information Theory Outline Bayesian
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationFinding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter
Finding a eedle in a Haystack: Conditions for Reliable Detection in the Presence of Clutter Bruno Jedynak and Damianos Karakos October 23, 2006 Abstract We study conditions for the detection of an -length
More informationEstimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1
Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees Alfred Hero Dept. of EECS, The university of Michigan, Ann Arbor, MI 489-, USA Email: hero@eecs.umich.edu Olivier J.J. Michel
More informationConvexity/Concavity of Renyi Entropy and α-mutual Information
Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Email: siuwai.ho@unisa.edu.au
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationUpper Bounds on the Capacity of Binary Intermittent Communication
Upper Bounds on the Capacity of Binary Intermittent Communication Mostafa Khoshnevisan and J. Nicholas Laneman Department of Electrical Engineering University of Notre Dame Notre Dame, Indiana 46556 Email:{mhoshne,
More informationEVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER
EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER Zhen Zhen 1, Jun Young Lee 2, and Abdus Saboor 3 1 Mingde College, Guizhou University, China zhenz2000@21cn.com 2 Department
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationMA22S3 Summary Sheet: Ordinary Differential Equations
MA22S3 Summary Sheet: Ordinary Differential Equations December 14, 2017 Kreyszig s textbook is a suitable guide for this part of the module. Contents 1 Terminology 1 2 First order separable 2 2.1 Separable
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationEfficient Coding. Odelia Schwartz 2017
Efficient Coding Odelia Schwartz 2017 1 Levels of modeling Descriptive (what) Mechanistic (how) Interpretive (why) 2 Levels of modeling Fitting a receptive field model to experimental data (e.g., using
More informationThe Robustness of Stochastic Switching Networks
The Robustness of Stochastic Switching Networks Po-Ling Loh Department of Mathematics California Institute of Technology Pasadena, CA 95 Email: loh@caltech.edu Hongchao Zhou Department of Electrical Engineering
More informationOptimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko
IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 12, DECEMBER 2010 1005 Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko Abstract A new theorem shows that
More informationNumerical Analysis: Interpolation Part 1
Numerical Analysis: Interpolation Part 1 Computer Science, Ben-Gurion University (slides based mostly on Prof. Ben-Shahar s notes) 2018/2019, Fall Semester BGU CS Interpolation (ver. 1.00) AY 2018/2019,
More informationA Modification of Linfoot s Informational Correlation Coefficient
Austrian Journal of Statistics April 07, Volume 46, 99 05. AJS http://www.ajs.or.at/ doi:0.773/ajs.v46i3-4.675 A Modification of Linfoot s Informational Correlation Coefficient Georgy Shevlyakov Peter
More information