To appear in Neural Computation1 Information geometric measure for neural spikes Hiroyuki Nakahara, Shun-ichi Amari Lab. for Mathematical Neuroscience

Size: px
Start display at page:

Download "To appear in Neural Computation1 Information geometric measure for neural spikes Hiroyuki Nakahara, Shun-ichi Amari Lab. for Mathematical Neuroscience"

Transcription

1 To appear in Neural Computation1 Information geometric measure for neural spikes Hiroyuki Nakahara, Shun-ichi Amari Lab. for Mathematical Neuroscience, RIKEN Brain Science Institute Wako, Saitama, , Japan, fhiro, Abstract The present study introduces information-geometric measures to analyze neural ring patterns by taking not only the secondorder but also higher-order interactions among neurons into account. Information geometry provides useful tools and concepts for this purpose, including the orthogonality of coordinate parameters and the Pythagoras relation in the Kullback-Leibler divergence. Based on this orthogonality, we show anovel method to analyze spike ring patterns by decomposing the interactions of neurons of various orders. As a result, purely pairwise, triplewise, and higher-order interactions are singled out. We also demonstrate the benets of our proposal by using real neural data, recorded in the prefrontal and parietal cortices of monkeys.

2 1 Introduction One of the central challenges in neuroscience is to understand what and how information is carried by a population of neural ring (Georgopoulos et al., 1986 Abeles, 1991 Aertsen and Arndt, 1993 Singer and Gray, 1995 Deadwyler and Hampson, 1997 Parker and Newsome, 1998). Many experimental researches have shown, as a rst step towards this end, that the mean ring rate of each single neuron can be signicantly modulated by experimental conditions and thereby, may carry information about these experimental conditions, that is, sensory and/or motor signals. Information conveyed by a population of ring neurons, however, may not only be a sum of mean ring rates. Other statistical structures embedded in the neural ring may also carry behavioral information. In particular, growing attention has been paid to the possibility that coincident ring, correlated ring, synchronization, or specic ring patterns may alter conveyed information and/or carry signicant behavioral information, whether such a possibility is supported or discarded (Gerstein et al., 1989 Engel et al., 1992 Wilson and McNaughton, 1993 Zohary et al., 1994 Vaadia et al., 1995 Nicolelis et al., 1997 Riehle et al., 1997 Lisman, 1997 Zhang et al., 1998 Maynard et al., 1999 Nadasdy et al., 1999 Kudrimoti et al., 1999 Oram et al., 1999 Nawrot et al., 1999 Baker and Lemon, 2000 Reinagel and Reid, 2000 Steinmetz et al., 2000 Salinas and Sejnowski, 2001 Oram et al., 2001). For this purpose, it is important to develop a sound statistical method for analyzing neural data. An obvious rst step is to investigate a signicant coincident ring between two neurons, i.e, the pairwise correlation (Perkel 2

3 et al., 1967 Palm, 1981 Gerstein and Aertsen, 1985 Palm et al., 1988 Aertsen et al., 1989 Grun, 1996 Ito and Tsuji, 2000 Pauluis and Baker, 2000 Roy et al., 2000 Grun et al., 2002a Grun et al., 2002b Gutig et al., 2002). In general, however, it is not sucient to test a pairwise correlation of neural ring, because there can be triplewise and higher correlations. For example, three variables (neurons) are not independent in general even when they are pairwise independent. We need to establish a systematic method of analysis which includes these higher-order correlations (Abeles and Gerstein, 1988 Abeles et al., 1993 Martignon et al., 1995 Grun, 1996 Tetko and Villa, 1992 Victor and Purpura, 1997 Prut et al., 1998 Del Prete and Martingon, 1998 MacLeod et al., 1998 Martignon et al., 2000 Bohte et al., 2000 Roy et al., 2000). We are mostly interested in methods able to address the following issues: (1) to analyze correlated ring of neurons, including higher-order interactions, and (2) further to connect such a technique with behavioral events, for which we use mutual information between ring and behavior (Tsukada et al., 1975 Optican and Richmond, 1987 Richmond et al., 1990 McClurkin et al., 1991 Bialek et al., 1991 Gawne and Richmond, 1993 Tovee et al., 1993 Abbott et al., 1996 Rolls et al., 1997 Richmond and Gawne, 1998 Kitazawa et al., 1998 Sugase et al., 1999 Panzeri et al., 1999a Panzeri et al., 1999b Brenner et al., 2000 Samengo et al., 2000 Panzeri and Schultz, 2001). To address these issues, the present study uses the orthogonality of the natural and expectation parameters in the exponential family of distributions and proposes methods useful for analyzing a population of neural r- 3

4 ing in a systematic manner, based on information geometry (Amari, 1985 Amari and Nagaoka, 2000) and the theory of hierarchical structure (Amari, 2001). By use of the orthogonal coordinates, we will show that both hypothesis testing of neural interaction and calculation of mutual information can be drastically simplied. Extended abstract previously appeared (Nakahara and Amari, 2002). The present paper is organized as follows. In Section 2, we briey give our perspective on the merits of using an information-geometric measure. In Section 3, we begin with an introductory description of information geometry, using two random binary variables, and treat the application of this two variables' case to analysis of two neuron's ring. Section 4 discusses the interaction of three binary variables and shows how to extract pure triplewise correlation, which is dierent from pairwise correlation. Section 5 gives a general theory of decomposition of correlations among n variables and discusses some approaches to overcome practical diculties that arise in this case. Section 6gives illustrative examples. Section 7 gives Discussion. 2 Perspective In this section, we state our perspective on the merits using an informationgeometric measure, briey referring to a general case of n neurons. A detailed discussion in the general case is given in Section 5. We represent a neural ring pattern by a binary random vector variable so that the probability distribution of ring (of any number of neurons) can be exactly expanded by a log linear model. Let X = (X 1 X n ) 4

5 be n binary variables and let p = p(x) x = (x 1 x n ) x i = 0 1, be its probability, where we assume p(x) > 0 for all x. Each X i indicates that the i-th neuron is silent (X i (t i ) = 0) or has a spike (X i (t i ) = 1) in a short time bin, which is denoted by t i. In general, t i can be dierent for each neuron but in the present paper, we assume t i = t for i = 1 n for simplicity and drop t in the following notation (see Discussion). Each p(x) is given by 2 n probabilities X p i 1i n = Prob fx 1 = i 1 X n = i n g i k =0 1 subject to p i 1i n =1 i1 i n and hence, the set of all the probability distributions fp(x)g forms a (2 n ; 1)- dimensional manifold S n. One coordinate system of S n is given by the expectation parameters, i = E [x i ] = Prob fx i =1g i =1 n ij = E [x i x j ] = Prob fx i = x j =1g i<j 12n = E [x i x n ] = Prob fx 1 = x 2 = = x n =1g whichhave2 n ;1 components. This coordinate system is called -coordinates and, as in a more general term, denes m-at structure in S n (see Section 5). On the other hand, p(x) can be exactly expanded by X ij x i x j + ijk x i x j x k + 1n x 1 x n ; log p(x) = X i x i + X i<j i<j<k where the indices of ijk, etc. satisfy i<j<k, etc and is a normalization term, corresponding to ; log p(x 1 = x 2 = ::: = x n = 0). All ijk, etc., 5

6 together have 2 n ;1 components and form another coordinate system, called -coordinates, corresponding to e-at structure in S n (see Section 5). Findings in information geometry assure us that e-at and m-at manifolds are dually at: The -coordinates and -coordinates are dually orthogonal coordinates. The properties of the dual orthogonal coordinates remarkably simplify some apparently complicated issues. For example, the generalized Pythagoras theorem gives a decomposition of the Kullback- Leibler divergence by which we can inspect dierent contributions in the discrepancy of two probability distributions, or contributions of dierent order interactions in neural ring. This is a global property of the dual orthogonal coordinates in the probability space. As a local property, the dual orthogonal coordinates give a simple form of the Fisher information metric, which is useful, for example, in hypothesis testing. The present study exploits these properties. In the next section, we start from the case of two neurons. 3 Pairwise Interaction, Mutual Information and Orthogonal Decomposition 3.1 Orthogonal coordinates Let us begin with two binary random variables X 1 and X 2 whose joint probability p(x) x =(x 1 x 2 ), is given by p ij = Prob fx 1 = i x 2 = jg > 0 i j =0 1: Among four probabilities, fp 00 p 01 p 10 p 11 g, only three are free, because 6

7 of the constraint p 00 + p 01 + p 10 + p 11 =1. Thus, the set of all such distributions of x forms a three-dimensional manifold S 2, where the sux 2 refers to the number of random variables in x. Any three of p ij can be used as a coordinate system of S 2, which we call P -coordinates for later convenience. In the context of neural ring, random variables X 1 and X 2 stand for two neurons, neuron 1 and neuron 2. X i = 1 and X i = 0 indicate whether neuron i (i =1 2) has aspike or not in a short time bin. A distribution p(x) can be decomposed into marginal and (pairwise) correlational components. The two quantities i = Prob fx i =1g = E [x i ] i =1 2 specify the marginal distributions of x i, where E denotes the expectation. Obviously, we have 1 = p 10 + p 11 2 = p 01 + p 11. Let us put 12 = E [x 1 x 2 ]=p 12 : The three quantities =( ) (1) form another coordinate system of S 2, called the ;coordinates. They are the coordinates of the expectation parameters in an exponential probability family in general (Cox and Hinkley, 1974 Barndor-Nielsen, 1978 Lehmann, 1983). In the context of neural data, 1 and 2 are the mean ring rates of neurons 1 and 2, respectively, whereas 12 is the mean rate of their coincident ring. The covariance, Cov[X 1 X 2 ]=E [(x 1 ; 1 )(x 2 ; 2 )] = 12 ; 1 2 7

8 may also be considered as a quantity representing the degree of correlation of X 1 and X 2. Therefore, ( 1 2 Cov[X 1 X 2 ]) can be another coordinate system. The term Cov[X 1 X 2 ] becomes zero when the probability distribution is independent, because we have 12 = 1 2 in that case. There are many candidates to specify the correlation component. The correlation coecient = 12 ; 1 p 2 1 (1 ; 1 ) 2 (1 ; 2 ) is also such a quantity. The triplet ( 1 2 ) then forms another coordinate system of S 2. The correlation coecient is used to show the pairwise correlation of two neurons in N-JPSTH (Aertsen et al., 1989). Which quantity is convenient for representing the pairwise correlational component? It is desirable to dene the degree of pairwise interaction independently of the marginals 1 and 2. To this end, we use the `orthogonal coordinates' ( 1 2 ) such that the coordinate curve of is always orthogonal to those of 1 and 2. This characteristic is particularly desirable in the context of neural data, as shown later. Once such a is dened, we have a subset E() for each, a family of distributions having the same value (Fig. 1 A). The E() is a twodimensional submanifold on which ( 1 2 ) can vary freely but is xed. We put the origin = 0 when there is no correlation (i.e, 12 = 1 2 ) for convenience (see below) and then, E(0) is the set of all the independent distributions. Similarly, we consider the set of all the probability distributions whose marginals are common, specied by ( 1 2 ), but only is free. This is denoted by M ( 1 2 ), forming a one-dimensional submanifold in S 2. The tangential direction of M ( 1 2 ) represents the direction in which 8

9 only the pure correlation changes, while the tangential directions of E() spans the directions in which only 1 and 2 change but is xed. We now require that E() and M ( 1 2 )beorthogonal at any points, that is, the directions of changes in the correlation and marginals to be mutually \ orthogonal". The orthogonalityoftwo directions in S 2 is dened by using the Riemannian metric due to the Fisher information matrix (Rao, 1945 Barndor- Nielsen, 1978 Amari, 1982 Nagaoka and Amari, 1982 Amari and Han, 1989 Amari and Nagaoka, 2000). Here, we dene the orthogonality directly. Let us specify the probability distributions by p (x 1 2 ). The directions of small changes in the coordinates i and are represented, respectively, by the i l(x 1 2 ) i l (x 1 2 ) where l (x 1 2 ) = log p (x 1 2 ). They are random variables, denoting how the log probability changes by small changes in the parameters in the respective directions. These directions are said to be orthogonal when the corresponding random variables are l (x 1 2 l (x 1 2 ) =0 i where E denotes the expectation with respect to p(x 1 2 ). This implies that the cross components of and i in the Fisher information matrix vanish. When the coordinate is dened to be orthogonal to the coordinates 1 and 2 of marginals, we say that represents the pure correlation independently of the marginals. Such is given by the following theorem. 9

10 Theorem 1. The coordinate is orthogonal to the marginals 1 and 2. = log p 11p 00 p 01 p 10 (3) The proof can be shown by direct calculations, which is omitted here. A more general result is shown later. We have another interpretation of. Let us expand log p(x) in the polynomial of x, Since x i log p(x) = 2X i=1 i x i + 12 x 1 x 2 ; : (4) takes on the binary values 0 1, this is an exact expansion. The coecient 12 is given by (3), while 1 = log p 10 p 00 2 =log p 01 p 00 = ; log p 00 : (5) We remark here that the above 12 is well known, having frequently been used in the additive decomposition of log probabilities. It is 0 when and only when X 1 and X 2 are independent. The triple =( ) forms another coordinate system of S 2, called the ;coordinates. They are the coordinates of the natural parameters in the exponential probability family in general (Cox and Hinkley, 1974 Barndor-Nielsen, 1978 Lehmann, 1983). Furthermore, the triple ( ) forms an `orthogonal' coordinate system of S 2, called the mixed coordinates (Amari, 1985 Amari and Nagaoka, 2000). 10

11 3.2 KL-divergence, projections and Pythagoras relation The Kullback-Leibler (KL) divergence between two probabilities p(x) and q(x) is dened by D [p : q] = X x p(x) log p(x) q(x) : (6) The KL divergence provides a quasi distance between two probability distributions: D [p : q] 0 with equality if and only if p(x) = q(x), whereas the symmetrical relationship does not generally hold, i.e., D [p : q] 6= D [q : p]. p(x), Let p(x) be the independent distribution that is closest to a distribution p(x) = argmin q2e(0) D [p : q] where E(0) is the set of all the independent distributions. We call p(x) = i p i (x i )them-projection of p to E(0) (Fig. 1 B). Let the mixed coordinates of p be ( 1 2 ). The coordinates of p are given by ( 1 2 0), because of the orthogonality, so that p(x) = i p i (x i i )=p 1 (x 1 1 )p 2 (x 2 2 ) where p i (x i i ) is the marginal distribution of p. Interestingly, the minimized divergence is given by the mutual information D [p : p] =I(X 1 X 2 )= X p(x 1 x 2 ) log p(x 1 x 2 ) p 1 (x 1 )p 2 (x 2 ) : We have another characterization of p. Let p 0 be the uniform distribution whose mixed coordinates are (0:5 0:5 0). Let M( 1 2 ) be the sub- 11

12 space that includes p. Then, p = argmin q2m (1 2) D [q : p 0] : Such p is called the e-projection of p 0 to M( 1 2 ) and it belongs to E(0). Since we easily have D [q : p 0 ] = ;H [q] +H 0 where H [q] is the entropy of q and H 0 = 2 log 2 is a constant, p has the maximal entropy among those belonging to M( 1 2 ). principle (Jaynes, 1982). It is well known that we have the decomposition, This fact is called the maximum entropy D [p : p 0 ]=D [p : p]+d [p : p 0 ] : Now let us generalize the above observation and let p(x) and q(x) be two probability distributions whose mixed coordinates are p =( p 1 p 2 p 3) and q = ( q 1 q 2 q 3), respectively. Let r (x) be the m;projection of p(x) to E( q ), and r (x) bethe e;projection of p(x) to M( q 1 q 2), r (x) = argmin D [p : r] r2e( q ) r (x) = argmin r2m (1 q q 2 )D [r : p] : The mixed coordinates of r and r are explicitly given by ( p 1 p 2 q 3) and ( q 1 q 2 p 3), respectively. Hence, the following Pythagoras relation holds (Fig. 1 B). Theorem 2. D [p : q] = D [p : r ]+D [r : q] (7) D [q : p] = D [q : r ]+D [r : p] : (8) Theorem 2 shows that the divergence D [p : q] from p to q is decomposed into two terms, D [p : r ] and D [r : q], where the former one represents the 12

13 degree of dierence in their correlation and the latter one the dierence in their marginals. 3.3 Local orthogonality and Fisher information For any parameterization p(x ), the Fisher information matrix G = (g ij ) in terms of the coordinates is given log p(x log p(x ) g ij () j This G() plays the role of a Riemannian metric tensor. The squared distance ds 2 between two nearby distributions p(x ) and p(x + d) is given by the quadratic form of d, ds 2 = X i j2(1 2 3) g ij ()d i d j : It is known that this is approximately twice the Kullback-Leibler divergence : ds 2 2D [p(x ) :p(x + d)] : When we use the mixed coordinates, the Fisher information is of the form G =(g )= ij g 11 g 12 0 g 12 g g 33 as is seen from Eq. 2. This is the local property induced by the orthogonality of and i. In this case, by putting X ds 2 1 = g33(d 3 ) 2 ds 2 2 = g ij d id j i j2(1 2) 13

14 we have the orthogonal decomposition corresponding to Eq. 7. ds 2 = ds ds 2 2 (9) We show the merits of the orthogonal coordinates for statistical inference. Let us estimate the parameter = ( 1 2 ) and from N observed data x 1 ::: x N. The maximum likelihood estimator is asymptotically unbiased and ecient, where the covariance of the estimation error, and, is given asymptotically by 2 Cov = 1 N G;1 : Since the cross terms of G or G ;1 vanish for the orthogonal coordinates, we have Cov[ ] =0 (10) implying that the estimation error of marginals and that of interaction are mutually independent. Such a property does not hold for other nonorthogonal parameterizations such as the correlation coecient, covariance, etc. This property greatly simplies procedures of hypothesis testing as shown below. 3.4 Hypothesis testing Let us consider the estimation of and more directly. A natural estimate for the ;coordinates is ^ i = 1 N #fx i =1g (i =1 2) ^ 12 = 1 N #fx 1x 2 =1g: (11) 14

15 This is the maximum likelihood estimator. The estimator ^ is obtained by the coordinate transformation from ; to ;coordinates, ^ = log ^ 12(1 ; ^ 1 ; ^ 2 +^ 12 ) (^ 1 ; ^ 12 )(^ 2 ; ^ 12 ) : Notably, the estimation of can be performed `independently' from the estimator of in the sense of Eq. 10. This brings a simple procedure of hypothesis testing concerning the null hypothesis against H 0 : = 0 H 1 : 6= 0 : In previous studies, under dierent frameworks (e.g., using N-JPSTH), the null hypothesis of independent ring is often examined. This corresponds to the null hypothesis of 0 =0in the current framework. Let the log likelihood of the models H 0 : = 0 and H 1 : 6= 0, respectively, be l 0 = max log p(x 1 ::: x N 0 ) l 1 = max log p(x 1 ::: x N ): where N is the number of observations. The likelihood ratio test uses the test statistics = 2 log l 0 l 1 (12) which is subject to the 2 -distribution. With the orthogonal coordinates, the likelihood maximization with respect to =( 1 2 ) and can be performed independently, so that we have l 0 =logp(x ^ 0 ) l 1 = log p(x ^ ^) 15

16 where ^ denotes the same marginals in both models. If non-orthogonal parameterization is used, this property does not hold. A similar situation holds in the case of testing = 0 against 6= 0 for unknown. Now let us calculate the test statistics in more detail. Under the hypothesis H 0, is approximated for a large N as = 2 NX i=1 " 2N E ~ 2ND log p(x i ^ 0 ) p(x i ^ ^) # log p(x ^ 0) p(x ^ h ^) p(x ^ 0 ):p(x ^ ^) Ng 33(^ ; 0 ) 2 (13) where ~ E is the expectation over the empirical distribution and the approximation in the third line comes from our assumption of the null hypothesis H 0. g 33 is the Fisher information of the mixed coordinates in the ;direction at 0 =(^ 0 ), which is easily calculated as g 33 = g 33 ( 0 )= ^ 3(^ 1 ; ^ 3 )(^ 2 ; ^ 3 )(^ 1 +^ 2 ; ^ 3 ; 1) : ^ 1^ 2 (^ 1 +^ 2 ; 1 ; 2^ 3 )+^ 2 3 Asymptotically, we have p q N g33(^ 3 ; 3 ) N(0 1) and hence, 2 (1) where m in 2 (m) indicates the degrees of freedom m in the 2 distribution, that is, in our case, the degree of freedom is 1. We must note that the above approach is valid, regardless of 3 = 0 or 6= 0. In contrast, the decomposition as shown in Eq 9 cannot exist, for 16 i

17 example, for the coordinate system ( 1 2 ), where is the correlation coecient. The plane 3 = 0, or E(0), coincides with the plane = 0, which is 3 = 1 2. However, E(c) (c = const 6= 0)cannot be equal to any plane dened by = c 0 where c 0 = const. Only in the case of = 0, it is possible to formulate testing for similarly to the above discussion, which is testing against the hypothesis of independent ring. 3.5 Application to ring of two neurons Here, we discuss the application of the above theoretical results to ring of two neurons and relate dierent choices of the null hypothesis with corresponding hypothesis testings. Given N trials of experiments, the probability distribution of X in a time bin [t t + t] can be estimated, denoted by p(x ^) = p(x ^(t t +t)), where ^ can be any coordinate system. If stationarity is assumed in a certain time interval, we obtain the probability distribution in the interval by averaging the estimated probabilities of many binsoftheinterval. The maximum likelihood estimate (mle) of the P -coordinates is given by, where N ij ^p ij = N ij N (i j = 0 1) indicates the number of trials in which the event (X 1 = i X 2 = j) occurs. The maximum likelihood estimator is retained by any coordinate transformation. Any coordinate transformation is easy in the case of two neurons, so that we freely change the coordinate systems in this section. 17

18 Let us denote our estimated probability distribution by the mixed coordinates ^. We also denote by 0 our null hypothesis. Then, we have h i D 0 : ^ = D where D 1 = D h 0i 0 : ^, D 2 abbreviation such that D the probability distribution according to h 0i 0 : ^ + D h^0 i = D h^0 h 0 : ^ 0i = D i : ^ = D 1 + D 2 (14) : ^, and ^ 0 = ( ^ 3 ). We use h i p(x 0 ):p(x ^ 0 ). Here, D 1 and D 2 are the quantities representing the discrepancies of p(^) from p( 0 ) with respect to the coincident ring and the marginals, respectively. We have 1 = 2ND 1 Ng 33 ( 0 )( 0 3 ; ^ 0 3) 2 2 (1) 2 = 2ND 2 N 2X i j=1 g ij ( 0 )( 0 i ; ^ 0 j) 2 2 (2): Here, 1 is to test whether the estimated coincident ring signicantly diers from that of the null hypothesis, while 2 is to test whether the estimated marginals signicantly dier from the hypothesized marginals. In particular, a test of whether the estimated coincident ring ^ 3 signicantly dierent from zero is given by 0 =(^ 1 ^ 2 0). This p(x 0 ) is the probability distribution that gives the same marginals as those of p(x ^) but with independent rings. In this case, 1 =2nD 1 =2nD h^ : 0 a test statistic against ^ 3 = 0, while D 2 =0. i is gives Let us consider another typical situation, where we need to compare two estimated probability distributions. This case is very important but somewhat ignored in the testing of coincident rings. Many previous studies often assumed independent ring as the null hypothesis. However, for example, to say a single neuron ring as `task-related', e.g., in memory-guided 18

19 saccade task (Hikosaka and Wurtz, 1983), the existence of ring in `task period' alone does not guarantee that the ring is task-related. It is normal to examine the ring in the task period against that in `control period'. The ring in the control period serves as the resting level activity, or as the null hypothesis. We hence propose that a procedure for testing coincident ring should be performed in a similar manner: we should test if two neurons have any signicant pairwise interaction in one period in comparison to the other (control) period. Investigation of coincident ring in the task period against the null hypothesis of independent ring may lead to a wrong interpretation of its signicance, when there is already a weak correlation in the control period (see examples in Section 6). The similar arguments can be applied to dierent tasks. One example would be a rat's maze task: A rat is in the left room in one period, while in the other period it is in the right room. We may like to test if coincident ring of the two neurons, say, in the hippocampus, is signicantly larger or smaller in one room than in the other room. The null hypothesis of independent ring is not plausible in this case. Let us denote the estimated probability distribution in two periods by p(x ^ 1 ) and p(x ^ 2 ). Using the mixed coordinates, by Theorem 2, we have D h^1 2i : ^ = D h^1 3i 2i : ^ + D h^3 : ^ where ^ 3 =(^ 1 1 ^ 1 2 ^ 2 3)=(^ 1 1 ^1 2 ^ 2 3). Here ^ 1 is an estimated probability distribution. If we can guarantee that ^ 1 is a true underlying distribution, denoted by 1, we can have =2ND h 1 : ^ 3i Ng 33 ( 1 )(^ 2 3 ; 1 3) 2 2 (1): (15) 19

20 This 2 test is, precisely speaking, to examine if ^ 2 3 is signicantly dierent from 1 3 when ^ 1 is a true distribution. In general, when ^ 1 is an estimated distribution, we should test whether ^ 1 3 and ^ 2 3 are from the same interaction component, which we denote by 3. In this case, the maximum likelihood estimators, denoted by ^ 10 and ^ 20, are given by (^ 10 ^ 20 ) = argmax NX j log p(x j : 1 )p(x j : 2 ) subject to 1 3 = 2 3 = 3 : Then, our likelihood ratio test against this null hypothesis yields where ^ 3 = h^10 1i 0 = 2ND : ^ h^20 2i +2ND : ^ Ng 33 (^ 10 )(^ 1 3 ; 10 ^ 3 ) 2 + Ng 33 (^ 20 )(^ ; ^ 3 ) 2 (16) ^ 3 = ^ 3. In Eq 15, we can convert into 2 test, because g 33 is the true value by our assumption. In Eq 16, however, rigorously speaking, we cannot convert this into 2 test, because both g 33 are estimates, determined at each estimated point ^, i.e, depending on ^ 3 and ^ 3,respectively. This issue is analogous to the famous Fisher-Beherens problem in the context of the t-test (Stuart et al., 1999). Yet, since all of the terms in Eq 16 asymptotically converge to their true values, we suggest to use 0 Ng 33 (^ 10 )(^ 1 3 ; 10 ^ 3 ) 2 + Ng 33 (^ 20 )(^ ; ^ 3 ) 2 2 (2) This 2 (2) formulation gives a more appropriate test under the null hypothesis against the average activity in control period. At the same time, to compare signicant events between the two null hypotheses, namely, against independent ring and against the average activity in control period, we suggest to still use 2 (1) formulation for the latter hypothesis. 20

21 3.6 Relationship between neural ring and behavior The orthogonality between and parameters has played a fundamental role in the above results so that pairwise coincident ring, characterized by 3, can be examined by a simple hypothesis testing procedure. In the analysis of neural data, it is also important to investigate whether or not any of coincident ring has any behavioral signicance. For this purpose, we use the mutual information to relate neural ring with behavioral events. The above orthogonality can again play an important role. Let us denote by Y a discrete random variable representing behavioral choices, for example, making saccade right or left, and/or presented stimuli, for example red dots, blue rectangles, or green triangles. The mutual information between X =(X 1 X 2 ) and Y is dened by I(X Y )=E p(x Y ) log p(x y) p(x)p(y) which is equivalent to I(X Y )=E p(y ) [D [p(xjy) :p(x)]] =E p(x) [D [p(y jx) :p(y )]] : We can apply the Pythagoras decomposition to the above equation. We use the mixed coordinates for p(xjy) and p(x), denoted by (Xjy) and (X), respectively. Then, we have D [p(xjy) :p(x)] = D [(Xjy) :(X)] = D [(Xjy) : 0 ]+D [ 0 : (X)] where 0 = 0 (X y) =( 1 (Xjy) 2 (Xjy) 3 (X)) = ( 1 (Xjy) 2 (Xjy) 3 (X)): 21

22 Thus, 0 has the rst two components (i.e. 1 2 ) as the same as those of (Xjy) and the third term (i.e. 3 ) as the same as that of (X). Using this relationship, the mutual information between X and Y is decomposed. Theorem 3. I(X Y )=I 1 (X Y )+I 2 (X Y ) (17) where I 1 (X Y ) I 2 (X Y ) are given by I 1 (X Y )=E p(y ) [D [(Xjy) : 0 (X y)]] I 2 (X Y )=E p(y ) [D [ 0 (X y) :(X)]] : A similar result holds for conditional distribution p(y jx). The above decomposition states that the mutual information I(X Y ) is the sum of the two terms: I 1 (X Y ) is the mutual information by modulation of the correlation components of X, while I 2 (X Y ) is the mutual information by modulation of the marginal means of X. This observation helps us investigate the behavioral signicance for each modulation of the coincident ring and the mean ring rate. 4 Triple Interactions among Three Variables The previous section discussed pairwise interaction between two variables. Given more than two variables, we need to look not only into pairwise interaction but also into higher-order interactions. It is useful to study triplewise interactions before stating the general case. 22

23 4.1 Orthogonal coordinates and pure triple interaction Let us consider three binary random variables X 1 X 2 and X 3, and let p(x) > 0, x = (x 1 x 2 x 3 ), be their joint probability distribution. We put p ijk = Prob fx 1 = i x 2 = j x 3 = kg, i j k =0 1. The P set of all such distributions forms a 7-dimensional manifold S 3, because p ijk = 1 among the eight p ijk 's. The marginal and pairwise marginal distributions of X i are dened by i = E [x i ] = Prob fx i =1g (i =1 2 3) ij = E [x i x j ] = Prob fx i = x j =1g (i j =1 2 3): The three quantities i j and ij together determine the joint marginal distribution of any two random variables X i and X j. Let us further put 123 = E [x 1 x 2 x 3 ]=Probfx i = x j = x k =1g : All of these have 7 degrees of freedom, =( 1 2 ::: 7 )=( ) (18) which specify any distribution p(x) in S 3. Hence, this is a coordinate system of S 3 called the m; or ;coordinates. The pairwise correlation between anytwoofx 1 X 2 and X 3 is determined from the marginal distributions of X i and X j or i, j and ij. However, even when all the pairwise correlations vanish, this does not imply that X 1 X 2, and X 3 are independent. Therefore, one should dene intrinsic triplewise 23

24 interaction independently of pairwise correlations. The coordinate 123 itself does not directly give the degree of pure triplewise interaction. In order to dene the degree of pure triplewise interaction, the orthogonality plays a fundamental role. Let us x the three pairwise marginal distributions, specied by the six coordinates 2 =( ) : There are many distributions with the same 2. Let us consider the set M 2 ( 2 ) of all the distributions in which wehave the same single and pairwise marginals 2 but 123 may take any value. This is a one-dimensional submanifold specied by 2. Let us introduce a coordinate in M 2 ( 2 ). ( 2 ) is a coordinate system of S 3. When the coordinate is orthogonal to 2, that is, a change in the log likelihood along is not correlated with that in any of the components of 2, we may say that represents the degree of pure triple interaction irrespective of pairwise marginals 2 and require that has this property. The tangent direction of M 2, that is, the direction in which only changes but the second-order marginals 2 are xed, represents a change in the pure triple interaction among X 1 X 2, and X 3. To show this geometrically, letusconsider a family of submanifolds E 2 () in which all the distributions have the same but the single and pairwise marginals 2 are free. A E 2 () is a six-dimensional submanifold transversal to M 2 ( 2 ). Tangent directions of E 2 () represent changes in marginals 2, keeping xed, and E 2 () andm 2 ( 2 ) are orthogonal at any and 2. In order to obtain such a, let us expand log p(x) in the polynomial of 24

25 x, log p(x) = X i x i + X ij x i x j x 1 x 2 x 3 ; : (19) This is an exact formula, since x i (i =1 2 3) is binary. One can check that the coecient = 123 is given by The other coecients are 123 = log p 111p 100 p 010 p 001 p 110 p 101 p 011 p 000 : (20) 1 =log p 100 p = log p 010 p = log p 001 p 000 (21) 12 = log p 110p 000 p 100 p = log p 011p 000 p 010 p = log p 101p 000 p 100 p 001 (22) = ; log p 000 : (23) Information geometry gives the following theorem. Theorem 4. The quantity 123 represents the pure triplewise interaction in the sense that it is orthogonal to any changes in the single and pairwise marginals. We can prove this directly by calculating the derivatives of the log likelihood. Equation 19 shows that S 3 is an exponential family with the canonical parameters =( ). The corresponding expectation parameters are =( ), so that they are orthogonal. We can compose the mixed orthogonal coordinates, denoted by 2, as: 2 =( )=( ) (24) In this coordinate system, 2 and = 123 are orthogonal. Note that 123 is not orthogonal to Hence, except when there is no triplewise 25

26 interaction ( 123 = 0), the quantities and 13 in (22) do not directly represent the degrees of pairwise correlations of the respective two random variables. Notably, the submanifold E 2 (0) consists of all the distributions having no triple interactions but pairwise interactions. The log probability logp(x) is quadratic and given by log p(x) = P i x i + P ij x i x j ;. A stable distribution of a Boltzmann machine in neural networks belongs to this class, because there are no triple interactions among neurons (Amari et al., 1992). The submanifold E 2 (0) is characterized by 123 =0,or in terms of (see Eq. 20) by 123 = ( 12 ; 123 )( 13 ; 123 )( 23 ; 123 )(1 ; 1 ; 2 ; ; 123 ) : ( 1 ; 12 ; )( 2 ; 23 ; )( 3 ; 13 ; ) 4.2 Another orthogonal coordinate system In the above, we extracted the pure triple interaction by using the coordinate 123, such that 2 and 123 are orthogonal. If we have interest in separating simple marginals from various kinds of interactions, we can use another decomposition. Let us summarize the three simple marginals in 1 = ( ) and then summarize all of the interaction terms in 1 =( ). Here, 1 denotes the coordinates complementary to 1. Using this pair, we have another mixed coordinate system, denoted by 1,as 1 =( ::: 17 )=( 1 1 ) : (25) Here, 1 and 1 are orthogonal. Geometrically, let M 1 ( 1 ), specied by 1 =( ), be the set of all the distributions having the same simple 26

27 marginals 1 = ( ) but having any pairwise and triplewise correlations. M 1 ( 1 ) is a four-dimensional submanifold in which 1 takes arbitrary values. On the other hand, let E 1 ( 1 ) be a three-dimensional submanifold in which all of the distributions have the same 1 =( ) but dierent marginals 1. We have the following theorem. Theorem 5. The coordinates 1 and 1 are orthogonal, that is, E 1 ( 1 ) is orthogonal to M 1 ( 1 ). Here, 1 represents degrees of pure correlations independent of marginals 1 and includes correlations resulting from the triplewise interaction in addition to the pairwise interactions. Because of the non-euclidean character of S 3 (Amari and Nagaoka, 2000 Amari, 2001), we cannot have a coordinate system, ( ), with f i g, fijg, 0 and 123 being mutually orthogonal. The submanifold E 1 (0) has zero pairwise and triplewise correlations and hence, consists entirely of independent distributions in which ij = i j and 123 = hold. The function log p(x) is linear in x, because 1 =0(seeEq. 19). 4.3 Projections and decompositions of divergence Using the above two mixed coordinates, we decompose a probability distribution in the following two ways. Let us consider two probability distributions, p(x) and q(x), where any coordinate system is denoted by p and q, respectively. First let us consider the case where q is the independent uniform distribution. By using the mixed orthogonal coordinate system 2, we now seek 27

28 to extract a pure triplewise interaction 123. For q, we have q 123 =0 q 1 = q 2 = q 3 = 1 2 q 12 = q 23 = q 13 = 1 4 : Furthermore, we note that q 2 E 2 (0) and also q 2 E 1 (0). Let us m-project p to E 2 (0) by p(x) = arg mind [p(x) :r(x)] : r2e 2 (0) This p has the same pairwise marginals as p but does not include any triplewise interaction, and its mixed coordinates are given by p 2 = ( p 2 q 123) = ( p 2 0). The Pythagorean theorem gives us D [p : q] =D [p : p]+d [p : q] where D [p : p] represents the degree of pure triplewise interaction, while D [p : q] represents how p diers from q in simple marginals and pairwise correlations. Let us next extract the pairwise interactions in p(x) by using another mixed coordinates 1. composed of independent distributions, To this end, let us project p to E 1 (0), which is ~p(x) = arg mind [p(x) :s(x)] : s2e 1 (0) More explicitly, we have ~p 1 =( p 1 q 1 )=(p 1 0) and D [p : ~p] =D [p : ~p]+d [~p : q] : Here, D [p : ~p] summarizes the eect of all the pairwise and triplewise interactions, while D [~p : q] represents the dierence of the simple marginals from the uniformity. 28

29 By taking the two decompositions together, we have D [p : q] = D [p : p]+d [p : ~p]+d [~p : q] : (26) Here D [p : p] represents the degree of pure triplewise interaction in the probability distribution p, D [p : ~p] of pairwise interactions, and D [~p : q] of the non-uniformity of ring rate. Let us generalize Eq. 26 by dropping our assumption on q as the independent uniform distribution. We then redene p and ~p as p(x) = We now have Theorem 6. arg min r2e 2 ( q 123 ) D [p(x) :r(x)] ~p(x) = arg min s2e 1 ( q 1 ) D [p(x) :s(x)] : D [p : q] = D [p : p]+d [p : q] (27) = D [p : ~p]+d [~p : q] (28) = D [p : p]+d [p : ~p]+d [~p : q] : (29) The decompositions in the rst and second lines are particularly interesting for neural data analysis purpose, as shown in the next section. Any coordinate transformation can be done freely in this three-variable case in a numerical sense. In general, however, coordinate transformations between and are not easy when the dimensions are high. Later, we discuss several practical approaches in n neuron case for use in neural data analysis. 29

30 4.4 Applications to ring of three neurons Here, we briey discuss our application of the above results to ring of three neurons. Discussion in Section 3.5 can be naturally extended. We consider three binary random variables X = (X 1 X 2 X 3 ), and denote our estimated probability distribution and the distribution of our null hypothesis by p(x ^) and p(x 0 ), respectively, where is now a seven-dimensional coordinate system. We use the following decompositions, h i i i h i i D 0 : ^ = D h h^00 : ^ 2 + D 2 : ^ = D 0 1 : ^ D h^0 1 : ^ : ^ 00 where ^ 0 1 =( 0 ^ 1 1 ) and 2 =( 0 2 ^ 123 ). i In the rst decomposition, D h 02 : represents the discrepancy in the triplewise interaction of p(x ^) from p(x 0 ), xing the pairwise interaction and marginals as specied by p(x 0 ). D i h^00 2 : ^ then collects all the residual discrepancy and, more precisely, represents the discrepancy of the distribution p(x ^) 00 from p(x ^ 2), which has the same simple and secondorder marginals as those of p(x 0 ) (i.e., 0 2) and the same triplewise interaction ^ 123 as that of p(x ^). Therefore, D i h 02 : is particularly useful for investigating if there is any signicant triplewise interaction in our data, i.e., p(x ^), in comparison with our null hypothesis p(x 0 ). A signicant triplewise interaction, for example, may be considered as indicative of three neurons functioning together. As for hypothesis testing, we can use 2 =2ND h 02 : ^ 00 2 i ^ 00 2 ^ 00 2 Ng 77( 0 2)( ; ^ 123 ) 2 2 (1) (30) where N is the number of trials and the indices are 1 = ( 1 :: 6 7 ) = ( ). 30

31 h i In the second decomposition, D 0 1 : ^ 0 1 represents the discrepancy in both the triplewise and pairwise interactions of p(x ^) from p(x 0 ), xing i the marginals as specied by p(x 0 ), while D h^0 1 : ^ collects all the residual discrepancy. D 0 1 h i : ^ 0 1 is useful to investigate if there is a signicant coincident ring, taking the pairwise and triplewise interactions together, compared with the null hypothesis. We now have h i 1 =2ND 0 1 : ^ 0 1 N 7X i j=4 g ij (0 1)( 0 i ; ^ i )( 0 j ; ^ j ) 2 (4) (31) where the indices are given by =( 1 ::: 7 )=( 1 1 ). We can also compare two probability distributions estimated under different experimental conditions. Let us denote two estimated distributions by p(x ^ 1 ) and p(x ^ 2 ). We rst detect the triplewise interaction. The maximum likelihood estimator, denoted by ^ 2 and ^ 2,ofournull hypothesis, that is, ^ = ^ 2 123, isgiven by (^ ^ 2 ) = argmax Then, we have NX h^ = 2ND 2 : ^ 1 2 j log p(x j : 1 2)p(X j : 2 2) subject to = : i i h^20 +2ND 2 : ^ 2 2 Ng 77(^ 10 2 )( ; ^ 1 123) 2 + Ng 77(^ 20 2 )( ; ^ 2 123) 2 (32) 2 (2): (33) When weinvestigate the coincident ring, taking the pairwise and triplewise interactions together, we use the second decomposition above. The mle 31

32 of our null hypothesis in this case is given by (^ ^ 1 ) = argmax NX For hypothesis testing, we can use h^ = 2ND 1 : ^ 1 1 N 7X i j=4 2 (8): i j log p(x j : 1 1)p(X j : 2 1) subject to 1 1 = 2 1 : h^20 +2ND 1 : ^ 2 1 gij(^ 10 1 )( 10 i ; ^ i )( 10 j ; ^ j )+N i 7X i j=4 (34) g ij(^ 20 1 )( 10 i ; ^ i )( 10 j ; ^ j ) The decompositions in the Kullback-Leibler divergence also allows us to decompose mutual information between the ring pattern of three neurons X =(X 1 X 2 X 3 ) and the behavior Y in a similar manner to Section 3.6. Theorem 7. (35) I(X Y ) = E p(x Y ) log p(x y) p(x)p(y) = I 1 (X Y )+I 2 (X Y ) (36) = I 3 (X Y )+I 4 (X Y ) (37) where I 1 (X Y )=E p(y ) [D [ 1 (Xjy) : 1 (X y)]] I 2 (X Y )=E p(y ) [D [ 1 (X y) : 1 (X)]] and we dene 1 (X y) =( 1 (Xjy) 1 (X)). Similarly, I 3 (X Y )=E p(y ) [D [ 2 (Xjy) : 2 (X y)]] I 4 (X Y )=E p(y ) [D [ 2 (X y) : 2 (X)]] 32

33 and we dene 2 (X y) =( 2 (Xjy) 2 (X)). In Eq. 36, the mutual information I(X Y ) is decomposed into two parts: I 1, the mutual information conveyed by the pairwise and triplewise interactions of the ring, and I 2, the mutual information conveyed by the mean ring rate modulation. In Eq. 37, I(X Y ) is decomposed dierently: I 3, conveyed by the triplewise interaction, and I 4, conveyed by the other terms, that is, the pairwise and mean ring rate modulations. 5 General Case : Joint Distributions of X 1 X n Here we study a general case of n neurons. Let X = (X 1 X n ) be n binary variables and let p = p(x) x = (x 1 x n ) x i = 0 1, be its probability, where we assume p(x) > 0 for all x. We begin with briey recapitulating Amari (2001) for the theoretical framework and then move to its applications. 5.1 Coordinate systems of S n As mentioned in Section 2, the set of all probability distributions fp(x)g forms a (2 n ; 1)-dimensional manifold S n. Any p(x) in S n can be represented by the P -coordinate system, -coordinate system or -coordinate system. The P -coordinate system is dened by X p i 1i n = Prob fx 1 = i 1 X n = i n g i k =0 1 subject to p i 1i n =1: i1 i n 33

34 The -coordinate system is dened by the expansion of log p(x) as X ij x i x j + ijk x i x j x k + 1n x 1 x n ; log p(x) = X i x i + X i<j i<j<k (38) where the indices of ijk, etc., satisfy i<j<kand then = ( i ij ijk 12:::n ) (39) has 2 n ; 1 components and forms the -coordinate system. It is easy to compute any components of and for example, we can get 1 =log p 10 ::: 0 p0 ::: 0. For later convenience, we use the notation of 1 = ( i ) 2 = ( ij ) 3 = ( ijk ) n = 12:::n, where l in l runs over l-tuple among n binary numbers, yielding n C l components ( n C l is the binomial coecient). Then, we can write =( 1 2 n ): On the other hand, the -coordinate system is dened by using i = E [x i ] (i =1 n) ij = E [x i x j ] (i <j) ::: 12n = E [x i x n ] which has 2 n ; 1 components (see Section 2), in other words, =( i ij 1n ) forms the -coordinate system in S n. We also write = ( 1 2 n ), which is linearly related to fp i 1i n g. In the rest of this section, let us mention to some notions in information geometry for the latter convenience in informal manner. Readers who are 34

35 interested in more details can refer to (Amari and Nagaoka, 2000). When a submanifold of S n, denoted by E, is represented by linear constraints among the -coordinates, E is called exponentially-at or e-at. On the other hand, when a submanifold of S n, denoted by M, is represented by linear constraints among the -coordinates, M is called mixture-at or m- at. The Fisher information matrices in the respective coordinate systems play the role of Riemannian metric tensors. The two coordinate systems and are dually coupled in the following sense. Let A B etc denote ordered subsets of indices, which stand for components of and, i.e., =( A ) =( B ). Theorem 8. inverse, The two metric tensors G() and G() are mutually G() =G() ;1 (40) where G() =(g AB ()) and G() =(g AB ()) are dened log p(x log p(x log p(x log p(x ) g AB () =E g AB () B The following generalized Pythagoras theorem has been known in S n (Csiszar, 967b Csiszar, 1975 Amari et al., 1992 Amari and Han, 1989). It holds in more general cases, playing a most important role in information geometry (Amari, 1987 Amari and Nagaoka, 2000). Theorem 9. Let p(x), q(x) and r(x) be three distributions where the m-geodesic connecting p(x) and q(x) is orthogonal to the e-geodesic connecting q(x) and r(x). Then, D [p : q]+d [q : r] =D [p : r] : (41) : 35

36 5.2 Higher-order interactions This section aims at dening the higher-order interactions, using the k-cut mixed coordinate system. Section 5.1 introduced = ( 1 n ) and = ( 1 n ), each of which spans S n. Let us dene their partitions, called a k-cut, as follows, =( k; k+ ) = ; k; k+ (42) where k; and k; consist of coordinates whose subindices have no more than k indices, i.e., k; = ( 1 2 k ) k; = ( 1 2 k ) and k+ and k+ consist of the coordinates whose subindices have more than k indices, i.e., k+ =( k+1 k+2 n ), k+ =( k+1 k+2 n ): First note that k; species the marginal distributions of any k (or less than k) random variables among X 1 X n. Let us consider a family of m-at submanifold in S n, M k (m k )= j k; = m k : It consists of all the distributions having the same k-marginals specied by a xed k = m k. They dier from one another only by higher-order interactions of more than k variables. Second, all coordinate curves represented by k+ are orthogonal to k;, or any components of k;. Hence, k+ represents interactions among more than k variables independently of the k marginals, k;. Then, for a constant vector c k, let us compose a family of e-at submanifolds E k+ (c k )=fj k+ = c k g : 36

37 Third, E k+ (c k )andm k (m k ) are mutually orthogonal and introduce a new coordinate system, called the k-cut mixed coordinate system, dened by k = ; k; k+ : Any k-cut mixed coordinate system forms the coordinate system of S n. A change in the k+ part preserves the k-marginals of p(x) (i.e., k; ), while a change in the k; part preserves the interactions among more than k variables. These changes are mutually orthogonal. Thus, E k+ ( k+ ) is regarded as the submanifold consisting of distributions having the same degree of higher-order interactions. When k+ = 0 E k+ (0) denotes the set of all the distributions having no intrinsic interactions of more than k variables. 5.3 Projections and decompositions of higher-order interactions Given p(x), we dene p (k) (x) = Q (k) p by p (k) (x) = (k) Y p = arg min D [p : q] : q2e k+(0) This is the point closest to p among those that do not have intrinsic interactions of more than k variables. We note that another characterization of p (k) is given by p (k) (x) = arg min q2m k ( p k; ) D q : p (0) 37

38 where it should be easy to see p (0) a uniform distribution by denition of p (0). The e-geodesic connecting p (k) and p (0) is orthogonal to M k ( p ) to k; which the original p belongs. ; The k-cut mixed coordinates of p (k) are given by k (p (k) )= k; k+ =0. The degree of interactions higher than k is hence dened by D p : (k) p. Since the m-geodesic connecting p and p (k) is orthogonal to E k+ (0), the Pythagoras theorem guarantees the following decomposition D p : p (0) = D p : p (k) + D p (k) : p (0) : Let us put D k (p) =D p (k) : p (k;1) : Then, D k (p) isinterpreted as the degree of interaction purely among the k variables. We then have the following decomposition in which D k (p) denotes the degree of interaction among k variables. Theorem 10. D p : (0) p = nx k=1 D k (p): (43) It is straightforward to generalize the above results in the case where we are given two distributions p(x) q(x). Let us dene, (k0 ) k = k (p (k0 ) )=( k; (p) k+ (q)): Then, we have D[p : q] = D[ p k : q k ]=D[p k : (k0 ) k ]+D[ (k0 ) k : q k ] 38

39 which is induced from Theorem 9. By dening, h D k 0(p) =D p (k0 ) i : p ((k;1)0 ) we obtain D [p : q] = nx k=1 D k 0(p): (44) The decompositions shown in Eqs 43 and 44 are obviously similar to each other. A critical dierence, however, exists in interpretation of the two decompositions. Each term in Eq 43, D k (p), represents the degree of purely k-th order interaction, whereas D k 0(p) ineq44doesnot necessarily do so. This is because k (p (k) ) always has k+ = 0, that is, the higherorder coordinates than the k-th order. On the other hand, k (p (k0 ) ) does not necessarily have zero in the corresponding part. In other words, k represents the pure k-th order interaction only if k+ = Application to neural ring To apply the results in the above sections to neural ring data, the discussion for the case of the three neurons can be directly applied. Hence, we mainly provide some remarks in this section. First, suppose that we have an estimated probability distribution of n neurons, denoted by p(x ^), and a probability distribution of our null hypothesis, p(x 0 ). Then, using the k-cut mixed coordinates, we obtain the decomposition, D[ 0 : ^] =D[ 0 : ^ 0 k]+d[^ 0 k : ^] 39

Information Geometry on Hierarchy of Probability Distributions

Information Geometry on Hierarchy of Probability Distributions IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001 1701 Information Geometry on Hierarchy of Probability Distributions Shun-ichi Amari, Fellow, IEEE Abstract An exponential family or mixture

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Chapter 2 Exponential Families and Mixture Families of Probability Distributions

Chapter 2 Exponential Families and Mixture Families of Probability Distributions Chapter 2 Exponential Families and Mixture Families of Probability Distributions The present chapter studies the geometry of the exponential family of probability distributions. It is not only a typical

More information

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD C.T.J. DODSON AND HIROSHI MATSUZOE Abstract. For the space of gamma distributions with Fisher metric and exponential connections, natural coordinate systems, potential

More information

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions Dual Peking University June 21, 2016 Divergences: Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all

More information

p(z)

p(z) Chapter Statistics. Introduction This lecture is a quick review of basic statistical concepts; probabilities, mean, variance, covariance, correlation, linear regression, probability density functions and

More information

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................

More information

RIKEN BSI BSIS Technical report No inspecting gene networks (Eisen et al., 998). Graphical models are in principle more powerful than hierarchi

RIKEN BSI BSIS Technical report No inspecting gene networks (Eisen et al., 998). Graphical models are in principle more powerful than hierarchi RIKEN BSI BSIS Technical report No02-02 Log linear model and gene networks in DNA microarray data Hiroyuki Nakahara, Shin-ichi Nishimura 3, Masato Inoue 4, Gen Hori 2, Shun-ichi Amari Lab. for Mathematical

More information

An Introduction to Differential Geometry in Econometrics

An Introduction to Differential Geometry in Econometrics WORKING PAPERS SERIES WP99-10 An Introduction to Differential Geometry in Econometrics Paul Marriott and Mark Salmon An Introduction to Di!erential Geometry in Econometrics Paul Marriott National University

More information

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Mean-field equations for higher-order quantum statistical models : an information geometric approach Mean-field equations for higher-order quantum statistical models : an information geometric approach N Yapage Department of Mathematics University of Ruhuna, Matara Sri Lanka. arxiv:1202.5726v1 [quant-ph]

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

= w 2. w 1. B j. A j. C + j1j2

= w 2. w 1. B j. A j. C + j1j2 Local Minima and Plateaus in Multilayer Neural Networks Kenji Fukumizu and Shun-ichi Amari Brain Science Institute, RIKEN Hirosawa 2-, Wako, Saitama 35-098, Japan E-mail: ffuku, amarig@brain.riken.go.jp

More information

The Effect of Correlated Variability on the Accuracy of a Population Code

The Effect of Correlated Variability on the Accuracy of a Population Code LETTER Communicated by Michael Shadlen The Effect of Correlated Variability on the Accuracy of a Population Code L. F. Abbott Volen Center and Department of Biology, Brandeis University, Waltham, MA 02454-9110,

More information

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig Hierarchical Quantification of Synergy in Channels by Paolo Perrone and Nihat Ay Preprint no.: 86 2015 Hierarchical Quantification

More information

Information Geometry

Information Geometry 2015 Workshop on High-Dimensional Statistical Analysis Dec.11 (Friday) ~15 (Tuesday) Humanities and Social Sciences Center, Academia Sinica, Taiwan Information Geometry and Spontaneous Data Learning Shinto

More information

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata ' / PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE Noboru Murata Waseda University Department of Electrical Electronics and Computer Engineering 3--

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

4.1 Notation and probability review

4.1 Notation and probability review Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Introduction to Information Geometry

Introduction to Information Geometry Introduction to Information Geometry based on the book Methods of Information Geometry written by Shun-Ichi Amari and Hiroshi Nagaoka Yunshu Liu 2012-02-17 Outline 1 Introduction to differential geometry

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

An Information Geometry Perspective on Estimation of Distribution Algorithms: Boundary Analysis

An Information Geometry Perspective on Estimation of Distribution Algorithms: Boundary Analysis An Information Geometry Perspective on Estimation of Distribution Algorithms: Boundary Analysis Luigi Malagò Department of Electronics and Information Politecnico di Milano Via Ponzio, 34/5 20133 Milan,

More information

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals Satoshi AOKI and Akimichi TAKEMURA Graduate School of Information Science and Technology University

More information

Information geometry of the power inverse Gaussian distribution

Information geometry of the power inverse Gaussian distribution Information geometry of the power inverse Gaussian distribution Zhenning Zhang, Huafei Sun and Fengwei Zhong Abstract. The power inverse Gaussian distribution is a common distribution in reliability analysis

More information

Dependence, correlation and Gaussianity in independent component analysis

Dependence, correlation and Gaussianity in independent component analysis Dependence, correlation and Gaussianity in independent component analysis Jean-François Cardoso ENST/TSI 46, rue Barrault, 75634, Paris, France cardoso@tsi.enst.fr Editor: Te-Won Lee Abstract Independent

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

An Information Geometric Perspective on Active Learning

An Information Geometric Perspective on Active Learning An Information Geometric Perspective on Active Learning Chen-Hsiang Yeang Artificial Intelligence Lab, MIT, Cambridge, MA 02139, USA {chyeang}@ai.mit.edu Abstract. The Fisher information matrix plays a

More information

A Theory of Mean Field Approximation

A Theory of Mean Field Approximation A Theory of Mean ield Approximation T. Tanaka Department of Electronics nformation Engineering Tokyo Metropolitan University, MinamiOsawa, Hachioji, Tokyo 920397 Japan Abstract present a theory of mean

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Information geometry of Bayesian statistics

Information geometry of Bayesian statistics Information geometry of Bayesian statistics Hiroshi Matsuzoe Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya 466-8555, Japan Abstract.

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

These outputs can be written in a more convenient form: with y(i) = Hc m (i) n(i) y(i) = (y(i); ; y K (i)) T ; c m (i) = (c m (i); ; c m K(i)) T and n

These outputs can be written in a more convenient form: with y(i) = Hc m (i) n(i) y(i) = (y(i); ; y K (i)) T ; c m (i) = (c m (i); ; c m K(i)) T and n Binary Codes for synchronous DS-CDMA Stefan Bruck, Ulrich Sorger Institute for Network- and Signal Theory Darmstadt University of Technology Merckstr. 25, 6428 Darmstadt, Germany Tel.: 49 65 629, Fax:

More information

Model Complexity of Pseudo-independent Models

Model Complexity of Pseudo-independent Models Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract

More information

Spike Count Correlation Increases with Length of Time Interval in the Presence of Trial-to-Trial Variation

Spike Count Correlation Increases with Length of Time Interval in the Presence of Trial-to-Trial Variation NOTE Communicated by Jonathan Victor Spike Count Correlation Increases with Length of Time Interval in the Presence of Trial-to-Trial Variation Robert E. Kass kass@stat.cmu.edu Valérie Ventura vventura@stat.cmu.edu

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Information geometry in optimization, machine learning and statistical inference

Information geometry in optimization, machine learning and statistical inference Front. Electr. Electron. Eng. China 2010, 5(3): 241 260 DOI 10.1007/s11460-010-0101-3 Shun-ichi AMARI Information geometry in optimization, machine learning and statistical inference c Higher Education

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,

More information

Information Geometric view of Belief Propagation

Information Geometric view of Belief Propagation Information Geometric view of Belief Propagation Yunshu Liu 2013-10-17 References: [1]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Stochastic reasoning, Free energy and Information Geometry, Neural

More information

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Natural Gradient Learning for Over- and Under-Complete Bases in ICA NOTE Communicated by Jean-François Cardoso Natural Gradient Learning for Over- and Under-Complete Bases in ICA Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-01, Japan Independent

More information

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY OUTLINE 3.1 Why Probability? 3.2 Random Variables 3.3 Probability Distributions 3.4 Marginal Probability 3.5 Conditional Probability 3.6 The Chain

More information

Information Geometric Structure on Positive Definite Matrices and its Applications

Information Geometric Structure on Positive Definite Matrices and its Applications Information Geometric Structure on Positive Definite Matrices and its Applications Atsumi Ohara Osaka University 2010 Feb. 21 at Osaka City University 大阪市立大学数学研究所情報幾何関連分野研究会 2010 情報工学への幾何学的アプローチ 1 Outline

More information

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition) Vector Space Basics (Remark: these notes are highly formal and may be a useful reference to some students however I am also posting Ray Heitmann's notes to Canvas for students interested in a direct computational

More information

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999 In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based

More information

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY 2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Connexions module: m11446 1 Maximum Likelihood Estimation Clayton Scott Robert Nowak This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

2 Kanatani Fig. 1. Class A is never chosen whatever distance measure is used. not more than the residual of point tting. This is because the discrepan

2 Kanatani Fig. 1. Class A is never chosen whatever distance measure is used. not more than the residual of point tting. This is because the discrepan International Journal of Computer Vision, 26, 1{21 (1998) c 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Geometric Information Criterion for Model Selection KENICHI KANATANI

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Detection of spike patterns using pattern ltering, with applications to sleep replay in birdsong

Detection of spike patterns using pattern ltering, with applications to sleep replay in birdsong Neurocomputing 52 54 (2003) 19 24 www.elsevier.com/locate/neucom Detection of spike patterns using pattern ltering, with applications to sleep replay in birdsong Zhiyi Chi a;, Peter L. Rauske b, Daniel

More information

PRIME GENERATING LUCAS SEQUENCES

PRIME GENERATING LUCAS SEQUENCES PRIME GENERATING LUCAS SEQUENCES PAUL LIU & RON ESTRIN Science One Program The University of British Columbia Vancouver, Canada April 011 1 PRIME GENERATING LUCAS SEQUENCES Abstract. The distribution of

More information

Why is Deep Learning so effective?

Why is Deep Learning so effective? Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap

More information

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Applications of Information Geometry to Hypothesis Testing and Signal Detection CMCAA 2016 Applications of Information Geometry to Hypothesis Testing and Signal Detection Yongqiang Cheng National University of Defense Technology July 2016 Outline 1. Principles of Information Geometry

More information

Estimation of information-theoretic quantities

Estimation of information-theoretic quantities Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ liam liam@gatsby.ucl.ac.uk November 16, 2004 Some

More information

2 JOSE BURILLO It was proved by Thurston [2, Ch.8], using geometric methods, and by Gersten [3], using combinatorial methods, that the integral 3-dime

2 JOSE BURILLO It was proved by Thurston [2, Ch.8], using geometric methods, and by Gersten [3], using combinatorial methods, that the integral 3-dime DIMACS Series in Discrete Mathematics and Theoretical Computer Science Volume 00, 1997 Lower Bounds of Isoperimetric Functions for Nilpotent Groups Jose Burillo Abstract. In this paper we prove that Heisenberg

More information

Neuronal Tuning: To Sharpen or Broaden?

Neuronal Tuning: To Sharpen or Broaden? NOTE Communicated by Laurence Abbott Neuronal Tuning: To Sharpen or Broaden? Kechen Zhang Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute for Biological Studies,

More information

Inner Product Spaces

Inner Product Spaces Inner Product Spaces Introduction Recall in the lecture on vector spaces that geometric vectors (i.e. vectors in two and three-dimensional Cartesian space have the properties of addition, subtraction,

More information

Error Empirical error. Generalization error. Time (number of iteration)

Error Empirical error. Generalization error. Time (number of iteration) Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp

More information

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. A Superharmonic Prior for the Autoregressive Process of the Second Order

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. A Superharmonic Prior for the Autoregressive Process of the Second Order MATHEMATICAL ENGINEERING TECHNICAL REPORTS A Superharmonic Prior for the Autoregressive Process of the Second Order Fuyuhiko TANAKA and Fumiyasu KOMAKI METR 2006 30 May 2006 DEPARTMENT OF MATHEMATICAL

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

460 HOLGER DETTE AND WILLIAM J STUDDEN order to examine how a given design behaves in the model g` with respect to the D-optimality criterion one uses

460 HOLGER DETTE AND WILLIAM J STUDDEN order to examine how a given design behaves in the model g` with respect to the D-optimality criterion one uses Statistica Sinica 5(1995), 459-473 OPTIMAL DESIGNS FOR POLYNOMIAL REGRESSION WHEN THE DEGREE IS NOT KNOWN Holger Dette and William J Studden Technische Universitat Dresden and Purdue University Abstract:

More information

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix Definition: Let L : V 1 V 2 be a linear operator. The null space N (L) of L is the subspace of V 1 defined by N (L) = {x

More information

Independent Component Analysis on the Basis of Helmholtz Machine

Independent Component Analysis on the Basis of Helmholtz Machine Independent Component Analysis on the Basis of Helmholtz Machine Masashi OHATA *1 ohatama@bmc.riken.go.jp Toshiharu MUKAI *1 tosh@bmc.riken.go.jp Kiyotoshi MATSUOKA *2 matsuoka@brain.kyutech.ac.jp *1 Biologically

More information

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2 COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.

More information

Three-dimensional Stable Matching Problems. Cheng Ng and Daniel S. Hirschberg. Department of Information and Computer Science

Three-dimensional Stable Matching Problems. Cheng Ng and Daniel S. Hirschberg. Department of Information and Computer Science Three-dimensional Stable Matching Problems Cheng Ng and Daniel S Hirschberg Department of Information and Computer Science University of California, Irvine Irvine, CA 92717 Abstract The stable marriage

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

2 1. Introduction. Neuronal networks often exhibit a rich variety of oscillatory behavior. The dynamics of even a single cell may be quite complicated

2 1. Introduction. Neuronal networks often exhibit a rich variety of oscillatory behavior. The dynamics of even a single cell may be quite complicated GEOMETRIC ANALYSIS OF POPULATION RHYTHMS IN SYNAPTICALLY COUPLED NEURONAL NETWORKS J. Rubin and D. Terman Dept. of Mathematics; Ohio State University; Columbus, Ohio 43210 Abstract We develop geometric

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Maximum Likelihood (ML) Estimation

Maximum Likelihood (ML) Estimation Econometrics 2 Fall 2004 Maximum Likelihood (ML) Estimation Heino Bohn Nielsen 1of32 Outline of the Lecture (1) Introduction. (2) ML estimation defined. (3) ExampleI:Binomialtrials. (4) Example II: Linear

More information

Disambiguating Different Covariation Types

Disambiguating Different Covariation Types NOTE Communicated by George Gerstein Disambiguating Different Covariation Types Carlos D. Brody Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 925, U.S.A. Covariations

More information

Page 52. Lecture 3: Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 2008/10/03 Date Given: 2008/10/03

Page 52. Lecture 3: Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 2008/10/03 Date Given: 2008/10/03 Page 5 Lecture : Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 008/10/0 Date Given: 008/10/0 Inner Product Spaces: Definitions Section. Mathematical Preliminaries: Inner

More information

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn Parameter estimation and forecasting Cristiano Porciani AIfA, Uni-Bonn Questions? C. Porciani Estimation & forecasting 2 Temperature fluctuations Variance at multipole l (angle ~180o/l) C. Porciani Estimation

More information

Comparison of single spike train descriptive models. by information geometric measure

Comparison of single spike train descriptive models. by information geometric measure RIKEN BSI BSIS Technical Report: No05-1. Comparison of single spie train descriptive models by information geometric measure Hiroyui Naahara 1,3,, Shun-ichi Amari 1, Barry J. Richmond 2 1 Lab. for Mathematical

More information

On the Equivariance of the Orientation and the Tensor Field Representation Klas Nordberg Hans Knutsson Gosta Granlund Computer Vision Laboratory, Depa

On the Equivariance of the Orientation and the Tensor Field Representation Klas Nordberg Hans Knutsson Gosta Granlund Computer Vision Laboratory, Depa On the Invariance of the Orientation and the Tensor Field Representation Klas Nordberg Hans Knutsson Gosta Granlund LiTH-ISY-R-530 993-09-08 On the Equivariance of the Orientation and the Tensor Field

More information

Example Bases and Basic Feasible Solutions 63 Let q = >: ; > and M = >: ;2 > and consider the LCP (q M). The class of ; ;2 complementary cones

Example Bases and Basic Feasible Solutions 63 Let q = >: ; > and M = >: ;2 > and consider the LCP (q M). The class of ; ;2 complementary cones Chapter 2 THE COMPLEMENTARY PIVOT ALGORITHM AND ITS EXTENSION TO FIXED POINT COMPUTING LCPs of order 2 can be solved by drawing all the complementary cones in the q q 2 - plane as discussed in Chapter.

More information

G. Larry Bretthorst. Washington University, Department of Chemistry. and. C. Ray Smith

G. Larry Bretthorst. Washington University, Department of Chemistry. and. C. Ray Smith in Infrared Systems and Components III, pp 93.104, Robert L. Caswell ed., SPIE Vol. 1050, 1989 Bayesian Analysis of Signals from Closely-Spaced Objects G. Larry Bretthorst Washington University, Department

More information

Inference. Data. Model. Variates

Inference. Data. Model. Variates Data Inference Variates Model ˆθ (,..., ) mˆθn(d) m θ2 M m θ1 (,, ) (,,, ) (,, ) α = :=: (, ) F( ) = = {(, ),, } F( ) X( ) = Γ( ) = Σ = ( ) = ( ) ( ) = { = } :=: (U, ) , = { = } = { = } x 2 e i, e j

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure Entropy 014, 16, 131-145; doi:10.3390/e1604131 OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Article Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable

More information

Unsupervised learning: beyond simple clustering and PCA

Unsupervised learning: beyond simple clustering and PCA Unsupervised learning: beyond simple clustering and PCA Liza Rebrova Self organizing maps (SOM) Goal: approximate data points in R p by a low-dimensional manifold Unlike PCA, the manifold does not have

More information

12 : Variational Inference I

12 : Variational Inference I 10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of

More information

Efficient Coding. Odelia Schwartz 2017

Efficient Coding. Odelia Schwartz 2017 Efficient Coding Odelia Schwartz 2017 1 Levels of modeling Descriptive (what) Mechanistic (how) Interpretive (why) 2 Levels of modeling Fitting a receptive field model to experimental data (e.g., using

More information

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M =

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M = CALCULUS ON MANIFOLDS 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M = a M T am, called the tangent bundle, is itself a smooth manifold, dim T M = 2n. Example 1.

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

RESEARCH STATEMENT. Nora Youngs, University of Nebraska - Lincoln

RESEARCH STATEMENT. Nora Youngs, University of Nebraska - Lincoln RESEARCH STATEMENT Nora Youngs, University of Nebraska - Lincoln 1. Introduction Understanding how the brain encodes information is a major part of neuroscience research. In the field of neural coding,

More information

Measuring Information Spatial Densities

Measuring Information Spatial Densities LETTER Communicated by Misha Tsodyks Measuring Information Spatial Densities Michele Bezzi michele@dma.unifi.it Cognitive Neuroscience Sector, S.I.S.S.A, Trieste, Italy, and INFM sez. di Firenze, 2 I-50125

More information

Probability on a Riemannian Manifold

Probability on a Riemannian Manifold Probability on a Riemannian Manifold Jennifer Pajda-De La O December 2, 2015 1 Introduction We discuss how we can construct probability theory on a Riemannian manifold. We make comparisons to this and

More information

Initial-Value Problems in General Relativity

Initial-Value Problems in General Relativity Initial-Value Problems in General Relativity Michael Horbatsch March 30, 2006 1 Introduction In this paper the initial-value formulation of general relativity is reviewed. In section (2) domains of dependence,

More information

1. Introduction As is well known, the bosonic string can be described by the two-dimensional quantum gravity coupled with D scalar elds, where D denot

1. Introduction As is well known, the bosonic string can be described by the two-dimensional quantum gravity coupled with D scalar elds, where D denot RIMS-1161 Proof of the Gauge Independence of the Conformal Anomaly of Bosonic String in the Sense of Kraemmer and Rebhan Mitsuo Abe a; 1 and Noboru Nakanishi b; 2 a Research Institute for Mathematical

More information

Dually Flat Geometries in the State Space of Statistical Models

Dually Flat Geometries in the State Space of Statistical Models 1/ 12 Dually Flat Geometries in the State Space of Statistical Models Jan Naudts Universiteit Antwerpen ECEA, November 2016 J. Naudts, Dually Flat Geometries in the State Space of Statistical Models. In

More information

Information Theory in Intelligent Decision Making

Information Theory in Intelligent Decision Making Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory

More information

1 Introduction Tasks like voice or face recognition are quite dicult to realize with conventional computer systems, even for the most powerful of them

1 Introduction Tasks like voice or face recognition are quite dicult to realize with conventional computer systems, even for the most powerful of them Information Storage Capacity of Incompletely Connected Associative Memories Holger Bosch Departement de Mathematiques et d'informatique Ecole Normale Superieure de Lyon Lyon, France Franz Kurfess Department

More information

LECTURE 28: VECTOR BUNDLES AND FIBER BUNDLES

LECTURE 28: VECTOR BUNDLES AND FIBER BUNDLES LECTURE 28: VECTOR BUNDLES AND FIBER BUNDLES 1. Vector Bundles In general, smooth manifolds are very non-linear. However, there exist many smooth manifolds which admit very nice partial linear structures.

More information

Contents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces

Contents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v 250) Contents 2 Vector Spaces 1 21 Vectors in R n 1 22 The Formal Denition of a Vector Space 4 23 Subspaces 6 24 Linear Combinations and

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

i=1 h n (ˆθ n ) = 0. (2)

i=1 h n (ˆθ n ) = 0. (2) Stat 8112 Lecture Notes Unbiased Estimating Equations Charles J. Geyer April 29, 2012 1 Introduction In this handout we generalize the notion of maximum likelihood estimation to solution of unbiased estimating

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 3, 25 Random Variables Until now we operated on relatively simple sample spaces and produced measure functions over sets of outcomes. In many situations,

More information

ECE534, Spring 2018: Solutions for Problem Set #3

ECE534, Spring 2018: Solutions for Problem Set #3 ECE534, Spring 08: Solutions for Problem Set #3 Jointly Gaussian Random Variables and MMSE Estimation Suppose that X, Y are jointly Gaussian random variables with µ X = µ Y = 0 and σ X = σ Y = Let their

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information