To appear in Neural Computation1 Information geometric measure for neural spikes Hiroyuki Nakahara, Shun-ichi Amari Lab. for Mathematical Neuroscience

Size: px

Start display at page:

Download "To appear in Neural Computation1 Information geometric measure for neural spikes Hiroyuki Nakahara, Shun-ichi Amari Lab. for Mathematical Neuroscience"

Aldous Lewis
5 years ago
Views:

1 To appear in Neural Computation1 Information geometric measure for neural spikes Hiroyuki Nakahara, Shun-ichi Amari Lab. for Mathematical Neuroscience, RIKEN Brain Science Institute Wako, Saitama, , Japan, fhiro, Abstract The present study introduces information-geometric measures to analyze neural ring patterns by taking not only the secondorder but also higher-order interactions among neurons into account. Information geometry provides useful tools and concepts for this purpose, including the orthogonality of coordinate parameters and the Pythagoras relation in the Kullback-Leibler divergence. Based on this orthogonality, we show anovel method to analyze spike ring patterns by decomposing the interactions of neurons of various orders. As a result, purely pairwise, triplewise, and higher-order interactions are singled out. We also demonstrate the benets of our proposal by using real neural data, recorded in the prefrontal and parietal cortices of monkeys.

2 1 Introduction One of the central challenges in neuroscience is to understand what and how information is carried by a population of neural ring (Georgopoulos et al., 1986 Abeles, 1991 Aertsen and Arndt, 1993 Singer and Gray, 1995 Deadwyler and Hampson, 1997 Parker and Newsome, 1998). Many experimental researches have shown, as a rst step towards this end, that the mean ring rate of each single neuron can be signicantly modulated by experimental conditions and thereby, may carry information about these experimental conditions, that is, sensory and/or motor signals. Information conveyed by a population of ring neurons, however, may not only be a sum of mean ring rates. Other statistical structures embedded in the neural ring may also carry behavioral information. In particular, growing attention has been paid to the possibility that coincident ring, correlated ring, synchronization, or specic ring patterns may alter conveyed information and/or carry signicant behavioral information, whether such a possibility is supported or discarded (Gerstein et al., 1989 Engel et al., 1992 Wilson and McNaughton, 1993 Zohary et al., 1994 Vaadia et al., 1995 Nicolelis et al., 1997 Riehle et al., 1997 Lisman, 1997 Zhang et al., 1998 Maynard et al., 1999 Nadasdy et al., 1999 Kudrimoti et al., 1999 Oram et al., 1999 Nawrot et al., 1999 Baker and Lemon, 2000 Reinagel and Reid, 2000 Steinmetz et al., 2000 Salinas and Sejnowski, 2001 Oram et al., 2001). For this purpose, it is important to develop a sound statistical method for analyzing neural data. An obvious rst step is to investigate a signicant coincident ring between two neurons, i.e, the pairwise correlation (Perkel 2

3 et al., 1967 Palm, 1981 Gerstein and Aertsen, 1985 Palm et al., 1988 Aertsen et al., 1989 Grun, 1996 Ito and Tsuji, 2000 Pauluis and Baker, 2000 Roy et al., 2000 Grun et al., 2002a Grun et al., 2002b Gutig et al., 2002). In general, however, it is not sucient to test a pairwise correlation of neural ring, because there can be triplewise and higher correlations. For example, three variables (neurons) are not independent in general even when they are pairwise independent. We need to establish a systematic method of analysis which includes these higher-order correlations (Abeles and Gerstein, 1988 Abeles et al., 1993 Martignon et al., 1995 Grun, 1996 Tetko and Villa, 1992 Victor and Purpura, 1997 Prut et al., 1998 Del Prete and Martingon, 1998 MacLeod et al., 1998 Martignon et al., 2000 Bohte et al., 2000 Roy et al., 2000). We are mostly interested in methods able to address the following issues: (1) to analyze correlated ring of neurons, including higher-order interactions, and (2) further to connect such a technique with behavioral events, for which we use mutual information between ring and behavior (Tsukada et al., 1975 Optican and Richmond, 1987 Richmond et al., 1990 McClurkin et al., 1991 Bialek et al., 1991 Gawne and Richmond, 1993 Tovee et al., 1993 Abbott et al., 1996 Rolls et al., 1997 Richmond and Gawne, 1998 Kitazawa et al., 1998 Sugase et al., 1999 Panzeri et al., 1999a Panzeri et al., 1999b Brenner et al., 2000 Samengo et al., 2000 Panzeri and Schultz, 2001). To address these issues, the present study uses the orthogonality of the natural and expectation parameters in the exponential family of distributions and proposes methods useful for analyzing a population of neural r- 3

4 ing in a systematic manner, based on information geometry (Amari, 1985 Amari and Nagaoka, 2000) and the theory of hierarchical structure (Amari, 2001). By use of the orthogonal coordinates, we will show that both hypothesis testing of neural interaction and calculation of mutual information can be drastically simplied. Extended abstract previously appeared (Nakahara and Amari, 2002). The present paper is organized as follows. In Section 2, we briey give our perspective on the merits of using an information-geometric measure. In Section 3, we begin with an introductory description of information geometry, using two random binary variables, and treat the application of this two variables' case to analysis of two neuron's ring. Section 4 discusses the interaction of three binary variables and shows how to extract pure triplewise correlation, which is dierent from pairwise correlation. Section 5 gives a general theory of decomposition of correlations among n variables and discusses some approaches to overcome practical diculties that arise in this case. Section 6gives illustrative examples. Section 7 gives Discussion. 2 Perspective In this section, we state our perspective on the merits using an informationgeometric measure, briey referring to a general case of n neurons. A detailed discussion in the general case is given in Section 5. We represent a neural ring pattern by a binary random vector variable so that the probability distribution of ring (of any number of neurons) can be exactly expanded by a log linear model. Let X = (X 1 X n ) 4

5 be n binary variables and let p = p(x) x = (x 1 x n ) x i = 0 1, be its probability, where we assume p(x) > 0 for all x. Each X i indicates that the i-th neuron is silent (X i (t i ) = 0) or has a spike (X i (t i ) = 1) in a short time bin, which is denoted by t i. In general, t i can be dierent for each neuron but in the present paper, we assume t i = t for i = 1 n for simplicity and drop t in the following notation (see Discussion). Each p(x) is given by 2 n probabilities X p i 1i n = Prob fx 1 = i 1 X n = i n g i k =0 1 subject to p i 1i n =1 i1 i n and hence, the set of all the probability distributions fp(x)g forms a (2 n ; 1)- dimensional manifold S n. One coordinate system of S n is given by the expectation parameters, i = E [x i ] = Prob fx i =1g i =1 n ij = E [x i x j ] = Prob fx i = x j =1g i<j 12n = E [x i x n ] = Prob fx 1 = x 2 = = x n =1g whichhave2 n ;1 components. This coordinate system is called -coordinates and, as in a more general term, denes m-at structure in S n (see Section 5). On the other hand, p(x) can be exactly expanded by X ij x i x j + ijk x i x j x k + 1n x 1 x n ; log p(x) = X i x i + X i<j i<j<k where the indices of ijk, etc. satisfy i<j<k, etc and is a normalization term, corresponding to ; log p(x 1 = x 2 = ::: = x n = 0). All ijk, etc., 5

6 together have 2 n ;1 components and form another coordinate system, called -coordinates, corresponding to e-at structure in S n (see Section 5). Findings in information geometry assure us that e-at and m-at manifolds are dually at: The -coordinates and -coordinates are dually orthogonal coordinates. The properties of the dual orthogonal coordinates remarkably simplify some apparently complicated issues. For example, the generalized Pythagoras theorem gives a decomposition of the Kullback- Leibler divergence by which we can inspect dierent contributions in the discrepancy of two probability distributions, or contributions of dierent order interactions in neural ring. This is a global property of the dual orthogonal coordinates in the probability space. As a local property, the dual orthogonal coordinates give a simple form of the Fisher information metric, which is useful, for example, in hypothesis testing. The present study exploits these properties. In the next section, we start from the case of two neurons. 3 Pairwise Interaction, Mutual Information and Orthogonal Decomposition 3.1 Orthogonal coordinates Let us begin with two binary random variables X 1 and X 2 whose joint probability p(x) x =(x 1 x 2 ), is given by p ij = Prob fx 1 = i x 2 = jg > 0 i j =0 1: Among four probabilities, fp 00 p 01 p 10 p 11 g, only three are free, because 6

7 of the constraint p 00 + p 01 + p 10 + p 11 =1. Thus, the set of all such distributions of x forms a three-dimensional manifold S 2, where the sux 2 refers to the number of random variables in x. Any three of p ij can be used as a coordinate system of S 2, which we call P -coordinates for later convenience. In the context of neural ring, random variables X 1 and X 2 stand for two neurons, neuron 1 and neuron 2. X i = 1 and X i = 0 indicate whether neuron i (i =1 2) has aspike or not in a short time bin. A distribution p(x) can be decomposed into marginal and (pairwise) correlational components. The two quantities i = Prob fx i =1g = E [x i ] i =1 2 specify the marginal distributions of x i, where E denotes the expectation. Obviously, we have 1 = p 10 + p 11 2 = p 01 + p 11. Let us put 12 = E [x 1 x 2 ]=p 12 : The three quantities =( ) (1) form another coordinate system of S 2, called the ;coordinates. They are the coordinates of the expectation parameters in an exponential probability family in general (Cox and Hinkley, 1974 Barndor-Nielsen, 1978 Lehmann, 1983). In the context of neural data, 1 and 2 are the mean ring rates of neurons 1 and 2, respectively, whereas 12 is the mean rate of their coincident ring. The covariance, Cov[X 1 X 2 ]=E [(x 1 ; 1 )(x 2 ; 2 )] = 12 ; 1 2 7

8 may also be considered as a quantity representing the degree of correlation of X 1 and X 2. Therefore, ( 1 2 Cov[X 1 X 2 ]) can be another coordinate system. The term Cov[X 1 X 2 ] becomes zero when the probability distribution is independent, because we have 12 = 1 2 in that case. There are many candidates to specify the correlation component. The correlation coecient = 12 ; 1 p 2 1 (1 ; 1 ) 2 (1 ; 2 ) is also such a quantity. The triplet ( 1 2 ) then forms another coordinate system of S 2. The correlation coecient is used to show the pairwise correlation of two neurons in N-JPSTH (Aertsen et al., 1989). Which quantity is convenient for representing the pairwise correlational component? It is desirable to dene the degree of pairwise interaction independently of the marginals 1 and 2. To this end, we use the `orthogonal coordinates' ( 1 2 ) such that the coordinate curve of is always orthogonal to those of 1 and 2. This characteristic is particularly desirable in the context of neural data, as shown later. Once such a is dened, we have a subset E() for each, a family of distributions having the same value (Fig. 1 A). The E() is a twodimensional submanifold on which ( 1 2 ) can vary freely but is xed. We put the origin = 0 when there is no correlation (i.e, 12 = 1 2 ) for convenience (see below) and then, E(0) is the set of all the independent distributions. Similarly, we consider the set of all the probability distributions whose marginals are common, specied by ( 1 2 ), but only is free. This is denoted by M ( 1 2 ), forming a one-dimensional submanifold in S 2. The tangential direction of M ( 1 2 ) represents the direction in which 8

9 only the pure correlation changes, while the tangential directions of E() spans the directions in which only 1 and 2 change but is xed. We now require that E() and M ( 1 2 )beorthogonal at any points, that is, the directions of changes in the correlation and marginals to be mutually \ orthogonal". The orthogonalityoftwo directions in S 2 is dened by using the Riemannian metric due to the Fisher information matrix (Rao, 1945 Barndor- Nielsen, 1978 Amari, 1982 Nagaoka and Amari, 1982 Amari and Han, 1989 Amari and Nagaoka, 2000). Here, we dene the orthogonality directly. Let us specify the probability distributions by p (x 1 2 ). The directions of small changes in the coordinates i and are represented, respectively, by the i l(x 1 2 ) i l (x 1 2 ) where l (x 1 2 ) = log p (x 1 2 ). They are random variables, denoting how the log probability changes by small changes in the parameters in the respective directions. These directions are said to be orthogonal when the corresponding random variables are l (x 1 2 l (x 1 2 ) =0 i where E denotes the expectation with respect to p(x 1 2 ). This implies that the cross components of and i in the Fisher information matrix vanish. When the coordinate is dened to be orthogonal to the coordinates 1 and 2 of marginals, we say that represents the pure correlation independently of the marginals. Such is given by the following theorem. 9

10 Theorem 1. The coordinate is orthogonal to the marginals 1 and 2. = log p 11p 00 p 01 p 10 (3) The proof can be shown by direct calculations, which is omitted here. A more general result is shown later. We have another interpretation of. Let us expand log p(x) in the polynomial of x, Since x i log p(x) = 2X i=1 i x i + 12 x 1 x 2 ; : (4) takes on the binary values 0 1, this is an exact expansion. The coecient 12 is given by (3), while 1 = log p 10 p 00 2 =log p 01 p 00 = ; log p 00 : (5) We remark here that the above 12 is well known, having frequently been used in the additive decomposition of log probabilities. It is 0 when and only when X 1 and X 2 are independent. The triple =( ) forms another coordinate system of S 2, called the ;coordinates. They are the coordinates of the natural parameters in the exponential probability family in general (Cox and Hinkley, 1974 Barndor-Nielsen, 1978 Lehmann, 1983). Furthermore, the triple ( ) forms an `orthogonal' coordinate system of S 2, called the mixed coordinates (Amari, 1985 Amari and Nagaoka, 2000). 10

11 3.2 KL-divergence, projections and Pythagoras relation The Kullback-Leibler (KL) divergence between two probabilities p(x) and q(x) is dened by D [p : q] = X x p(x) log p(x) q(x) : (6) The KL divergence provides a quasi distance between two probability distributions: D [p : q] 0 with equality if and only if p(x) = q(x), whereas the symmetrical relationship does not generally hold, i.e., D [p : q] 6= D [q : p]. p(x), Let p(x) be the independent distribution that is closest to a distribution p(x) = argmin q2e(0) D [p : q] where E(0) is the set of all the independent distributions. We call p(x) = i p i (x i )them-projection of p to E(0) (Fig. 1 B). Let the mixed coordinates of p be ( 1 2 ). The coordinates of p are given by ( 1 2 0), because of the orthogonality, so that p(x) = i p i (x i i )=p 1 (x 1 1 )p 2 (x 2 2 ) where p i (x i i ) is the marginal distribution of p. Interestingly, the minimized divergence is given by the mutual information D [p : p] =I(X 1 X 2 )= X p(x 1 x 2 ) log p(x 1 x 2 ) p 1 (x 1 )p 2 (x 2 ) : We have another characterization of p. Let p 0 be the uniform distribution whose mixed coordinates are (0:5 0:5 0). Let M( 1 2 ) be the sub- 11

12 space that includes p. Then, p = argmin q2m (1 2) D [q : p 0] : Such p is called the e-projection of p 0 to M( 1 2 ) and it belongs to E(0). Since we easily have D [q : p 0 ] = ;H [q] +H 0 where H [q] is the entropy of q and H 0 = 2 log 2 is a constant, p has the maximal entropy among those belonging to M( 1 2 ). principle (Jaynes, 1982). It is well known that we have the decomposition, This fact is called the maximum entropy D [p : p 0 ]=D [p : p]+d [p : p 0 ] : Now let us generalize the above observation and let p(x) and q(x) be two probability distributions whose mixed coordinates are p =( p 1 p 2 p 3) and q = ( q 1 q 2 q 3), respectively. Let r (x) be the m;projection of p(x) to E( q ), and r (x) bethe e;projection of p(x) to M( q 1 q 2), r (x) = argmin D [p : r] r2e( q ) r (x) = argmin r2m (1 q q 2 )D [r : p] : The mixed coordinates of r and r are explicitly given by ( p 1 p 2 q 3) and ( q 1 q 2 p 3), respectively. Hence, the following Pythagoras relation holds (Fig. 1 B). Theorem 2. D [p : q] = D [p : r ]+D [r : q] (7) D [q : p] = D [q : r ]+D [r : p] : (8) Theorem 2 shows that the divergence D [p : q] from p to q is decomposed into two terms, D [p : r ] and D [r : q], where the former one represents the 12

13 degree of dierence in their correlation and the latter one the dierence in their marginals. 3.3 Local orthogonality and Fisher information For any parameterization p(x ), the Fisher information matrix G = (g ij ) in terms of the coordinates is given log p(x log p(x ) g ij () j This G() plays the role of a Riemannian metric tensor. The squared distance ds 2 between two nearby distributions p(x ) and p(x + d) is given by the quadratic form of d, ds 2 = X i j2(1 2 3) g ij ()d i d j : It is known that this is approximately twice the Kullback-Leibler divergence : ds 2 2D [p(x ) :p(x + d)] : When we use the mixed coordinates, the Fisher information is of the form G =(g )= ij g 11 g 12 0 g 12 g g 33 as is seen from Eq. 2. This is the local property induced by the orthogonality of and i. In this case, by putting X ds 2 1 = g33(d 3 ) 2 ds 2 2 = g ij d id j i j2(1 2) 13

14 we have the orthogonal decomposition corresponding to Eq. 7. ds 2 = ds ds 2 2 (9) We show the merits of the orthogonal coordinates for statistical inference. Let us estimate the parameter = ( 1 2 ) and from N observed data x 1 ::: x N. The maximum likelihood estimator is asymptotically unbiased and ecient, where the covariance of the estimation error, and, is given asymptotically by 2 Cov = 1 N G;1 : Since the cross terms of G or G ;1 vanish for the orthogonal coordinates, we have Cov[ ] =0 (10) implying that the estimation error of marginals and that of interaction are mutually independent. Such a property does not hold for other nonorthogonal parameterizations such as the correlation coecient, covariance, etc. This property greatly simplies procedures of hypothesis testing as shown below. 3.4 Hypothesis testing Let us consider the estimation of and more directly. A natural estimate for the ;coordinates is ^ i = 1 N #fx i =1g (i =1 2) ^ 12 = 1 N #fx 1x 2 =1g: (11) 14

15 This is the maximum likelihood estimator. The estimator ^ is obtained by the coordinate transformation from ; to ;coordinates, ^ = log ^ 12(1 ; ^ 1 ; ^ 2 +^ 12 ) (^ 1 ; ^ 12 )(^ 2 ; ^ 12 ) : Notably, the estimation of can be performed `independently' from the estimator of in the sense of Eq. 10. This brings a simple procedure of hypothesis testing concerning the null hypothesis against H 0 : = 0 H 1 : 6= 0 : In previous studies, under dierent frameworks (e.g., using N-JPSTH), the null hypothesis of independent ring is often examined. This corresponds to the null hypothesis of 0 =0in the current framework. Let the log likelihood of the models H 0 : = 0 and H 1 : 6= 0, respectively, be l 0 = max log p(x 1 ::: x N 0 ) l 1 = max log p(x 1 ::: x N ): where N is the number of observations. The likelihood ratio test uses the test statistics = 2 log l 0 l 1 (12) which is subject to the 2 -distribution. With the orthogonal coordinates, the likelihood maximization with respect to =( 1 2 ) and can be performed independently, so that we have l 0 =logp(x ^ 0 ) l 1 = log p(x ^ ^) 15

16 where ^ denotes the same marginals in both models. If non-orthogonal parameterization is used, this property does not hold. A similar situation holds in the case of testing = 0 against 6= 0 for unknown. Now let us calculate the test statistics in more detail. Under the hypothesis H 0, is approximated for a large N as = 2 NX i=1 " 2N E ~ 2ND log p(x i ^ 0 ) p(x i ^ ^) # log p(x ^ 0) p(x ^ h ^) p(x ^ 0 ):p(x ^ ^) Ng 33(^ ; 0 ) 2 (13) where ~ E is the expectation over the empirical distribution and the approximation in the third line comes from our assumption of the null hypothesis H 0. g 33 is the Fisher information of the mixed coordinates in the ;direction at 0 =(^ 0 ), which is easily calculated as g 33 = g 33 ( 0 )= ^ 3(^ 1 ; ^ 3 )(^ 2 ; ^ 3 )(^ 1 +^ 2 ; ^ 3 ; 1) : ^ 1^ 2 (^ 1 +^ 2 ; 1 ; 2^ 3 )+^ 2 3 Asymptotically, we have p q N g33(^ 3 ; 3 ) N(0 1) and hence, 2 (1) where m in 2 (m) indicates the degrees of freedom m in the 2 distribution, that is, in our case, the degree of freedom is 1. We must note that the above approach is valid, regardless of 3 = 0 or 6= 0. In contrast, the decomposition as shown in Eq 9 cannot exist, for 16 i

17 example, for the coordinate system ( 1 2 ), where is the correlation coecient. The plane 3 = 0, or E(0), coincides with the plane = 0, which is 3 = 1 2. However, E(c) (c = const 6= 0)cannot be equal to any plane dened by = c 0 where c 0 = const. Only in the case of = 0, it is possible to formulate testing for similarly to the above discussion, which is testing against the hypothesis of independent ring. 3.5 Application to ring of two neurons Here, we discuss the application of the above theoretical results to ring of two neurons and relate dierent choices of the null hypothesis with corresponding hypothesis testings. Given N trials of experiments, the probability distribution of X in a time bin [t t + t] can be estimated, denoted by p(x ^) = p(x ^(t t +t)), where ^ can be any coordinate system. If stationarity is assumed in a certain time interval, we obtain the probability distribution in the interval by averaging the estimated probabilities of many binsoftheinterval. The maximum likelihood estimate (mle) of the P -coordinates is given by, where N ij ^p ij = N ij N (i j = 0 1) indicates the number of trials in which the event (X 1 = i X 2 = j) occurs. The maximum likelihood estimator is retained by any coordinate transformation. Any coordinate transformation is easy in the case of two neurons, so that we freely change the coordinate systems in this section. 17

18 Let us denote our estimated probability distribution by the mixed coordinates ^. We also denote by 0 our null hypothesis. Then, we have h i D 0 : ^ = D where D 1 = D h 0i 0 : ^, D 2 abbreviation such that D the probability distribution according to h 0i 0 : ^ + D h^0 i = D h^0 h 0 : ^ 0i = D i : ^ = D 1 + D 2 (14) : ^, and ^ 0 = ( ^ 3 ). We use h i p(x 0 ):p(x ^ 0 ). Here, D 1 and D 2 are the quantities representing the discrepancies of p(^) from p( 0 ) with respect to the coincident ring and the marginals, respectively. We have 1 = 2ND 1 Ng 33 ( 0 )( 0 3 ; ^ 0 3) 2 2 (1) 2 = 2ND 2 N 2X i j=1 g ij ( 0 )( 0 i ; ^ 0 j) 2 2 (2): Here, 1 is to test whether the estimated coincident ring signicantly diers from that of the null hypothesis, while 2 is to test whether the estimated marginals signicantly dier from the hypothesized marginals. In particular, a test of whether the estimated coincident ring ^ 3 signicantly dierent from zero is given by 0 =(^ 1 ^ 2 0). This p(x 0 ) is the probability distribution that gives the same marginals as those of p(x ^) but with independent rings. In this case, 1 =2nD 1 =2nD h^ : 0 a test statistic against ^ 3 = 0, while D 2 =0. i is gives Let us consider another typical situation, where we need to compare two estimated probability distributions. This case is very important but somewhat ignored in the testing of coincident rings. Many previous studies often assumed independent ring as the null hypothesis. However, for example, to say a single neuron ring as `task-related', e.g., in memory-guided 18

19 saccade task (Hikosaka and Wurtz, 1983), the existence of ring in `task period' alone does not guarantee that the ring is task-related. It is normal to examine the ring in the task period against that in `control period'. The ring in the control period serves as the resting level activity, or as the null hypothesis. We hence propose that a procedure for testing coincident ring should be performed in a similar manner: we should test if two neurons have any signicant pairwise interaction in one period in comparison to the other (control) period. Investigation of coincident ring in the task period against the null hypothesis of independent ring may lead to a wrong interpretation of its signicance, when there is already a weak correlation in the control period (see examples in Section 6). The similar arguments can be applied to dierent tasks. One example would be a rat's maze task: A rat is in the left room in one period, while in the other period it is in the right room. We may like to test if coincident ring of the two neurons, say, in the hippocampus, is signicantly larger or smaller in one room than in the other room. The null hypothesis of independent ring is not plausible in this case. Let us denote the estimated probability distribution in two periods by p(x ^ 1 ) and p(x ^ 2 ). Using the mixed coordinates, by Theorem 2, we have D h^1 2i : ^ = D h^1 3i 2i : ^ + D h^3 : ^ where ^ 3 =(^ 1 1 ^ 1 2 ^ 2 3)=(^ 1 1 ^1 2 ^ 2 3). Here ^ 1 is an estimated probability distribution. If we can guarantee that ^ 1 is a true underlying distribution, denoted by 1, we can have =2ND h 1 : ^ 3i Ng 33 ( 1 )(^ 2 3 ; 1 3) 2 2 (1): (15) 19

20 This 2 test is, precisely speaking, to examine if ^ 2 3 is signicantly dierent from 1 3 when ^ 1 is a true distribution. In general, when ^ 1 is an estimated distribution, we should test whether ^ 1 3 and ^ 2 3 are from the same interaction component, which we denote by 3. In this case, the maximum likelihood estimators, denoted by ^ 10 and ^ 20, are given by (^ 10 ^ 20 ) = argmax NX j log p(x j : 1 )p(x j : 2 ) subject to 1 3 = 2 3 = 3 : Then, our likelihood ratio test against this null hypothesis yields where ^ 3 = h^10 1i 0 = 2ND : ^ h^20 2i +2ND : ^ Ng 33 (^ 10 )(^ 1 3 ; 10 ^ 3 ) 2 + Ng 33 (^ 20 )(^ ; ^ 3 ) 2 (16) ^ 3 = ^ 3. In Eq 15, we can convert into 2 test, because g 33 is the true value by our assumption. In Eq 16, however, rigorously speaking, we cannot convert this into 2 test, because both g 33 are estimates, determined at each estimated point ^, i.e, depending on ^ 3 and ^ 3,respectively. This issue is analogous to the famous Fisher-Beherens problem in the context of the t-test (Stuart et al., 1999). Yet, since all of the terms in Eq 16 asymptotically converge to their true values, we suggest to use 0 Ng 33 (^ 10 )(^ 1 3 ; 10 ^ 3 ) 2 + Ng 33 (^ 20 )(^ ; ^ 3 ) 2 2 (2) This 2 (2) formulation gives a more appropriate test under the null hypothesis against the average activity in control period. At the same time, to compare signicant events between the two null hypotheses, namely, against independent ring and against the average activity in control period, we suggest to still use 2 (1) formulation for the latter hypothesis. 20

21 3.6 Relationship between neural ring and behavior The orthogonality between and parameters has played a fundamental role in the above results so that pairwise coincident ring, characterized by 3, can be examined by a simple hypothesis testing procedure. In the analysis of neural data, it is also important to investigate whether or not any of coincident ring has any behavioral signicance. For this purpose, we use the mutual information to relate neural ring with behavioral events. The above orthogonality can again play an important role. Let us denote by Y a discrete random variable representing behavioral choices, for example, making saccade right or left, and/or presented stimuli, for example red dots, blue rectangles, or green triangles. The mutual information between X =(X 1 X 2 ) and Y is dened by I(X Y )=E p(x Y ) log p(x y) p(x)p(y) which is equivalent to I(X Y )=E p(y ) [D [p(xjy) :p(x)]] =E p(x) [D [p(y jx) :p(y )]] : We can apply the Pythagoras decomposition to the above equation. We use the mixed coordinates for p(xjy) and p(x), denoted by (Xjy) and (X), respectively. Then, we have D [p(xjy) :p(x)] = D [(Xjy) :(X)] = D [(Xjy) : 0 ]+D [ 0 : (X)] where 0 = 0 (X y) =( 1 (Xjy) 2 (Xjy) 3 (X)) = ( 1 (Xjy) 2 (Xjy) 3 (X)): 21

22 Thus, 0 has the rst two components (i.e. 1 2 ) as the same as those of (Xjy) and the third term (i.e. 3 ) as the same as that of (X). Using this relationship, the mutual information between X and Y is decomposed. Theorem 3. I(X Y )=I 1 (X Y )+I 2 (X Y ) (17) where I 1 (X Y ) I 2 (X Y ) are given by I 1 (X Y )=E p(y ) [D [(Xjy) : 0 (X y)]] I 2 (X Y )=E p(y ) [D [ 0 (X y) :(X)]] : A similar result holds for conditional distribution p(y jx). The above decomposition states that the mutual information I(X Y ) is the sum of the two terms: I 1 (X Y ) is the mutual information by modulation of the correlation components of X, while I 2 (X Y ) is the mutual information by modulation of the marginal means of X. This observation helps us investigate the behavioral signicance for each modulation of the coincident ring and the mean ring rate. 4 Triple Interactions among Three Variables The previous section discussed pairwise interaction between two variables. Given more than two variables, we need to look not only into pairwise interaction but also into higher-order interactions. It is useful to study triplewise interactions before stating the general case. 22

23 4.1 Orthogonal coordinates and pure triple interaction Let us consider three binary random variables X 1 X 2 and X 3, and let p(x) > 0, x = (x 1 x 2 x 3 ), be their joint probability distribution. We put p ijk = Prob fx 1 = i x 2 = j x 3 = kg, i j k =0 1. The P set of all such distributions forms a 7-dimensional manifold S 3, because p ijk = 1 among the eight p ijk 's. The marginal and pairwise marginal distributions of X i are dened by i = E [x i ] = Prob fx i =1g (i =1 2 3) ij = E [x i x j ] = Prob fx i = x j =1g (i j =1 2 3): The three quantities i j and ij together determine the joint marginal distribution of any two random variables X i and X j. Let us further put 123 = E [x 1 x 2 x 3 ]=Probfx i = x j = x k =1g : All of these have 7 degrees of freedom, =( 1 2 ::: 7 )=( ) (18) which specify any distribution p(x) in S 3. Hence, this is a coordinate system of S 3 called the m; or ;coordinates. The pairwise correlation between anytwoofx 1 X 2 and X 3 is determined from the marginal distributions of X i and X j or i, j and ij. However, even when all the pairwise correlations vanish, this does not imply that X 1 X 2, and X 3 are independent. Therefore, one should dene intrinsic triplewise 23

24 interaction independently of pairwise correlations. The coordinate 123 itself does not directly give the degree of pure triplewise interaction. In order to dene the degree of pure triplewise interaction, the orthogonality plays a fundamental role. Let us x the three pairwise marginal distributions, specied by the six coordinates 2 =( ) : There are many distributions with the same 2. Let us consider the set M 2 ( 2 ) of all the distributions in which wehave the same single and pairwise marginals 2 but 123 may take any value. This is a one-dimensional submanifold specied by 2. Let us introduce a coordinate in M 2 ( 2 ). ( 2 ) is a coordinate system of S 3. When the coordinate is orthogonal to 2, that is, a change in the log likelihood along is not correlated with that in any of the components of 2, we may say that represents the degree of pure triple interaction irrespective of pairwise marginals 2 and require that has this property. The tangent direction of M 2, that is, the direction in which only changes but the second-order marginals 2 are xed, represents a change in the pure triple interaction among X 1 X 2, and X 3. To show this geometrically, letusconsider a family of submanifolds E 2 () in which all the distributions have the same but the single and pairwise marginals 2 are free. A E 2 () is a six-dimensional submanifold transversal to M 2 ( 2 ). Tangent directions of E 2 () represent changes in marginals 2, keeping xed, and E 2 () andm 2 ( 2 ) are orthogonal at any and 2. In order to obtain such a, let us expand log p(x) in the polynomial of 24

25 x, log p(x) = X i x i + X ij x i x j x 1 x 2 x 3 ; : (19) This is an exact formula, since x i (i =1 2 3) is binary. One can check that the coecient = 123 is given by The other coecients are 123 = log p 111p 100 p 010 p 001 p 110 p 101 p 011 p 000 : (20) 1 =log p 100 p = log p 010 p = log p 001 p 000 (21) 12 = log p 110p 000 p 100 p = log p 011p 000 p 010 p = log p 101p 000 p 100 p 001 (22) = ; log p 000 : (23) Information geometry gives the following theorem. Theorem 4. The quantity 123 represents the pure triplewise interaction in the sense that it is orthogonal to any changes in the single and pairwise marginals. We can prove this directly by calculating the derivatives of the log likelihood. Equation 19 shows that S 3 is an exponential family with the canonical parameters =( ). The corresponding expectation parameters are =( ), so that they are orthogonal. We can compose the mixed orthogonal coordinates, denoted by 2, as: 2 =( )=( ) (24) In this coordinate system, 2 and = 123 are orthogonal. Note that 123 is not orthogonal to Hence, except when there is no triplewise 25

26 interaction ( 123 = 0), the quantities and 13 in (22) do not directly represent the degrees of pairwise correlations of the respective two random variables. Notably, the submanifold E 2 (0) consists of all the distributions having no triple interactions but pairwise interactions. The log probability logp(x) is quadratic and given by log p(x) = P i x i + P ij x i x j ;. A stable distribution of a Boltzmann machine in neural networks belongs to this class, because there are no triple interactions among neurons (Amari et al., 1992). The submanifold E 2 (0) is characterized by 123 =0,or in terms of (see Eq. 20) by 123 = ( 12 ; 123 )( 13 ; 123 )( 23 ; 123 )(1 ; 1 ; 2 ; ; 123 ) : ( 1 ; 12 ; )( 2 ; 23 ; )( 3 ; 13 ; ) 4.2 Another orthogonal coordinate system In the above, we extracted the pure triple interaction by using the coordinate 123, such that 2 and 123 are orthogonal. If we have interest in separating simple marginals from various kinds of interactions, we can use another decomposition. Let us summarize the three simple marginals in 1 = ( ) and then summarize all of the interaction terms in 1 =( ). Here, 1 denotes the coordinates complementary to 1. Using this pair, we have another mixed coordinate system, denoted by 1,as 1 =( ::: 17 )=( 1 1 ) : (25) Here, 1 and 1 are orthogonal. Geometrically, let M 1 ( 1 ), specied by 1 =( ), be the set of all the distributions having the same simple 26

27 marginals 1 = ( ) but having any pairwise and triplewise correlations. M 1 ( 1 ) is a four-dimensional submanifold in which 1 takes arbitrary values. On the other hand, let E 1 ( 1 ) be a three-dimensional submanifold in which all of the distributions have the same 1 =( ) but dierent marginals 1. We have the following theorem. Theorem 5. The coordinates 1 and 1 are orthogonal, that is, E 1 ( 1 ) is orthogonal to M 1 ( 1 ). Here, 1 represents degrees of pure correlations independent of marginals 1 and includes correlations resulting from the triplewise interaction in addition to the pairwise interactions. Because of the non-euclidean character of S 3 (Amari and Nagaoka, 2000 Amari, 2001), we cannot have a coordinate system, ( ), with f i g, fijg, 0 and 123 being mutually orthogonal. The submanifold E 1 (0) has zero pairwise and triplewise correlations and hence, consists entirely of independent distributions in which ij = i j and 123 = hold. The function log p(x) is linear in x, because 1 =0(seeEq. 19). 4.3 Projections and decompositions of divergence Using the above two mixed coordinates, we decompose a probability distribution in the following two ways. Let us consider two probability distributions, p(x) and q(x), where any coordinate system is denoted by p and q, respectively. First let us consider the case where q is the independent uniform distribution. By using the mixed orthogonal coordinate system 2, we now seek 27

28 to extract a pure triplewise interaction 123. For q, we have q 123 =0 q 1 = q 2 = q 3 = 1 2 q 12 = q 23 = q 13 = 1 4 : Furthermore, we note that q 2 E 2 (0) and also q 2 E 1 (0). Let us m-project p to E 2 (0) by p(x) = arg mind [p(x) :r(x)] : r2e 2 (0) This p has the same pairwise marginals as p but does not include any triplewise interaction, and its mixed coordinates are given by p 2 = ( p 2 q 123) = ( p 2 0). The Pythagorean theorem gives us D [p : q] =D [p : p]+d [p : q] where D [p : p] represents the degree of pure triplewise interaction, while D [p : q] represents how p diers from q in simple marginals and pairwise correlations. Let us next extract the pairwise interactions in p(x) by using another mixed coordinates 1. composed of independent distributions, To this end, let us project p to E 1 (0), which is ~p(x) = arg mind [p(x) :s(x)] : s2e 1 (0) More explicitly, we have ~p 1 =( p 1 q 1 )=(p 1 0) and D [p : ~p] =D [p : ~p]+d [~p : q] : Here, D [p : ~p] summarizes the eect of all the pairwise and triplewise interactions, while D [~p : q] represents the dierence of the simple marginals from the uniformity. 28

29 By taking the two decompositions together, we have D [p : q] = D [p : p]+d [p : ~p]+d [~p : q] : (26) Here D [p : p] represents the degree of pure triplewise interaction in the probability distribution p, D [p : ~p] of pairwise interactions, and D [~p : q] of the non-uniformity of ring rate. Let us generalize Eq. 26 by dropping our assumption on q as the independent uniform distribution. We then redene p and ~p as p(x) = We now have Theorem 6. arg min r2e 2 ( q 123 ) D [p(x) :r(x)] ~p(x) = arg min s2e 1 ( q 1 ) D [p(x) :s(x)] : D [p : q] = D [p : p]+d [p : q] (27) = D [p : ~p]+d [~p : q] (28) = D [p : p]+d [p : ~p]+d [~p : q] : (29) The decompositions in the rst and second lines are particularly interesting for neural data analysis purpose, as shown in the next section. Any coordinate transformation can be done freely in this three-variable case in a numerical sense. In general, however, coordinate transformations between and are not easy when the dimensions are high. Later, we discuss several practical approaches in n neuron case for use in neural data analysis. 29

30 4.4 Applications to ring of three neurons Here, we briey discuss our application of the above results to ring of three neurons. Discussion in Section 3.5 can be naturally extended. We consider three binary random variables X = (X 1 X 2 X 3 ), and denote our estimated probability distribution and the distribution of our null hypothesis by p(x ^) and p(x 0 ), respectively, where is now a seven-dimensional coordinate system. We use the following decompositions, h i i i h i i D 0 : ^ = D h h^00 : ^ 2 + D 2 : ^ = D 0 1 : ^ D h^0 1 : ^ : ^ 00 where ^ 0 1 =( 0 ^ 1 1 ) and 2 =( 0 2 ^ 123 ). i In the rst decomposition, D h 02 : represents the discrepancy in the triplewise interaction of p(x ^) from p(x 0 ), xing the pairwise interaction and marginals as specied by p(x 0 ). D i h^00 2 : ^ then collects all the residual discrepancy and, more precisely, represents the discrepancy of the distribution p(x ^) 00 from p(x ^ 2), which has the same simple and secondorder marginals as those of p(x 0 ) (i.e., 0 2) and the same triplewise interaction ^ 123 as that of p(x ^). Therefore, D i h 02 : is particularly useful for investigating if there is any signicant triplewise interaction in our data, i.e., p(x ^), in comparison with our null hypothesis p(x 0 ). A signicant triplewise interaction, for example, may be considered as indicative of three neurons functioning together. As for hypothesis testing, we can use 2 =2ND h 02 : ^ 00 2 i ^ 00 2 ^ 00 2 Ng 77( 0 2)( ; ^ 123 ) 2 2 (1) (30) where N is the number of trials and the indices are 1 = ( 1 :: 6 7 ) = ( ). 30

31 h i In the second decomposition, D 0 1 : ^ 0 1 represents the discrepancy in both the triplewise and pairwise interactions of p(x ^) from p(x 0 ), xing i the marginals as specied by p(x 0 ), while D h^0 1 : ^ collects all the residual discrepancy. D 0 1 h i : ^ 0 1 is useful to investigate if there is a signicant coincident ring, taking the pairwise and triplewise interactions together, compared with the null hypothesis. We now have h i 1 =2ND 0 1 : ^ 0 1 N 7X i j=4 g ij (0 1)( 0 i ; ^ i )( 0 j ; ^ j ) 2 (4) (31) where the indices are given by =( 1 ::: 7 )=( 1 1 ). We can also compare two probability distributions estimated under different experimental conditions. Let us denote two estimated distributions by p(x ^ 1 ) and p(x ^ 2 ). We rst detect the triplewise interaction. The maximum likelihood estimator, denoted by ^ 2 and ^ 2,ofournull hypothesis, that is, ^ = ^ 2 123, isgiven by (^ ^ 2 ) = argmax Then, we have NX h^ = 2ND 2 : ^ 1 2 j log p(x j : 1 2)p(X j : 2 2) subject to = : i i h^20 +2ND 2 : ^ 2 2 Ng 77(^ 10 2 )( ; ^ 1 123) 2 + Ng 77(^ 20 2 )( ; ^ 2 123) 2 (32) 2 (2): (33) When weinvestigate the coincident ring, taking the pairwise and triplewise interactions together, we use the second decomposition above. The mle 31

32 of our null hypothesis in this case is given by (^ ^ 1 ) = argmax NX For hypothesis testing, we can use h^ = 2ND 1 : ^ 1 1 N 7X i j=4 2 (8): i j log p(x j : 1 1)p(X j : 2 1) subject to 1 1 = 2 1 : h^20 +2ND 1 : ^ 2 1 gij(^ 10 1 )( 10 i ; ^ i )( 10 j ; ^ j )+N i 7X i j=4 (34) g ij(^ 20 1 )( 10 i ; ^ i )( 10 j ; ^ j ) The decompositions in the Kullback-Leibler divergence also allows us to decompose mutual information between the ring pattern of three neurons X =(X 1 X 2 X 3 ) and the behavior Y in a similar manner to Section 3.6. Theorem 7. (35) I(X Y ) = E p(x Y ) log p(x y) p(x)p(y) = I 1 (X Y )+I 2 (X Y ) (36) = I 3 (X Y )+I 4 (X Y ) (37) where I 1 (X Y )=E p(y ) [D [ 1 (Xjy) : 1 (X y)]] I 2 (X Y )=E p(y ) [D [ 1 (X y) : 1 (X)]] and we dene 1 (X y) =( 1 (Xjy) 1 (X)). Similarly, I 3 (X Y )=E p(y ) [D [ 2 (Xjy) : 2 (X y)]] I 4 (X Y )=E p(y ) [D [ 2 (X y) : 2 (X)]] 32

33 and we dene 2 (X y) =( 2 (Xjy) 2 (X)). In Eq. 36, the mutual information I(X Y ) is decomposed into two parts: I 1, the mutual information conveyed by the pairwise and triplewise interactions of the ring, and I 2, the mutual information conveyed by the mean ring rate modulation. In Eq. 37, I(X Y ) is decomposed dierently: I 3, conveyed by the triplewise interaction, and I 4, conveyed by the other terms, that is, the pairwise and mean ring rate modulations. 5 General Case : Joint Distributions of X 1 X n Here we study a general case of n neurons. Let X = (X 1 X n ) be n binary variables and let p = p(x) x = (x 1 x n ) x i = 0 1, be its probability, where we assume p(x) > 0 for all x. We begin with briey recapitulating Amari (2001) for the theoretical framework and then move to its applications. 5.1 Coordinate systems of S n As mentioned in Section 2, the set of all probability distributions fp(x)g forms a (2 n ; 1)-dimensional manifold S n. Any p(x) in S n can be represented by the P -coordinate system, -coordinate system or -coordinate system. The P -coordinate system is dened by X p i 1i n = Prob fx 1 = i 1 X n = i n g i k =0 1 subject to p i 1i n =1: i1 i n 33

34 The -coordinate system is dened by the expansion of log p(x) as X ij x i x j + ijk x i x j x k + 1n x 1 x n ; log p(x) = X i x i + X i<j i<j<k (38) where the indices of ijk, etc., satisfy i<j<kand then = ( i ij ijk 12:::n ) (39) has 2 n ; 1 components and forms the -coordinate system. It is easy to compute any components of and for example, we can get 1 =log p 10 ::: 0 p0 ::: 0. For later convenience, we use the notation of 1 = ( i ) 2 = ( ij ) 3 = ( ijk ) n = 12:::n, where l in l runs over l-tuple among n binary numbers, yielding n C l components ( n C l is the binomial coecient). Then, we can write =( 1 2 n ): On the other hand, the -coordinate system is dened by using i = E [x i ] (i =1 n) ij = E [x i x j ] (i <j) ::: 12n = E [x i x n ] which has 2 n ; 1 components (see Section 2), in other words, =( i ij 1n ) forms the -coordinate system in S n. We also write = ( 1 2 n ), which is linearly related to fp i 1i n g. In the rest of this section, let us mention to some notions in information geometry for the latter convenience in informal manner. Readers who are 34

35 interested in more details can refer to (Amari and Nagaoka, 2000). When a submanifold of S n, denoted by E, is represented by linear constraints among the -coordinates, E is called exponentially-at or e-at. On the other hand, when a submanifold of S n, denoted by M, is represented by linear constraints among the -coordinates, M is called mixture-at or m- at. The Fisher information matrices in the respective coordinate systems play the role of Riemannian metric tensors. The two coordinate systems and are dually coupled in the following sense. Let A B etc denote ordered subsets of indices, which stand for components of and, i.e., =( A ) =( B ). Theorem 8. inverse, The two metric tensors G() and G() are mutually G() =G() ;1 (40) where G() =(g AB ()) and G() =(g AB ()) are dened log p(x log p(x log p(x log p(x ) g AB () =E g AB () B The following generalized Pythagoras theorem has been known in S n (Csiszar, 967b Csiszar, 1975 Amari et al., 1992 Amari and Han, 1989). It holds in more general cases, playing a most important role in information geometry (Amari, 1987 Amari and Nagaoka, 2000). Theorem 9. Let p(x), q(x) and r(x) be three distributions where the m-geodesic connecting p(x) and q(x) is orthogonal to the e-geodesic connecting q(x) and r(x). Then, D [p : q]+d [q : r] =D [p : r] : (41) : 35

36 5.2 Higher-order interactions This section aims at dening the higher-order interactions, using the k-cut mixed coordinate system. Section 5.1 introduced = ( 1 n ) and = ( 1 n ), each of which spans S n. Let us dene their partitions, called a k-cut, as follows, =( k; k+ ) = ; k; k+ (42) where k; and k; consist of coordinates whose subindices have no more than k indices, i.e., k; = ( 1 2 k ) k; = ( 1 2 k ) and k+ and k+ consist of the coordinates whose subindices have more than k indices, i.e., k+ =( k+1 k+2 n ), k+ =( k+1 k+2 n ): First note that k; species the marginal distributions of any k (or less than k) random variables among X 1 X n. Let us consider a family of m-at submanifold in S n, M k (m k )= j k; = m k : It consists of all the distributions having the same k-marginals specied by a xed k = m k. They dier from one another only by higher-order interactions of more than k variables. Second, all coordinate curves represented by k+ are orthogonal to k;, or any components of k;. Hence, k+ represents interactions among more than k variables independently of the k marginals, k;. Then, for a constant vector c k, let us compose a family of e-at submanifolds E k+ (c k )=fj k+ = c k g : 36

37 Third, E k+ (c k )andm k (m k ) are mutually orthogonal and introduce a new coordinate system, called the k-cut mixed coordinate system, dened by k = ; k; k+ : Any k-cut mixed coordinate system forms the coordinate system of S n. A change in the k+ part preserves the k-marginals of p(x) (i.e., k; ), while a change in the k; part preserves the interactions among more than k variables. These changes are mutually orthogonal. Thus, E k+ ( k+ ) is regarded as the submanifold consisting of distributions having the same degree of higher-order interactions. When k+ = 0 E k+ (0) denotes the set of all the distributions having no intrinsic interactions of more than k variables. 5.3 Projections and decompositions of higher-order interactions Given p(x), we dene p (k) (x) = Q (k) p by p (k) (x) = (k) Y p = arg min D [p : q] : q2e k+(0) This is the point closest to p among those that do not have intrinsic interactions of more than k variables. We note that another characterization of p (k) is given by p (k) (x) = arg min q2m k ( p k; ) D q : p (0) 37

38 where it should be easy to see p (0) a uniform distribution by denition of p (0). The e-geodesic connecting p (k) and p (0) is orthogonal to M k ( p ) to k; which the original p belongs. ; The k-cut mixed coordinates of p (k) are given by k (p (k) )= k; k+ =0. The degree of interactions higher than k is hence dened by D p : (k) p. Since the m-geodesic connecting p and p (k) is orthogonal to E k+ (0), the Pythagoras theorem guarantees the following decomposition D p : p (0) = D p : p (k) + D p (k) : p (0) : Let us put D k (p) =D p (k) : p (k;1) : Then, D k (p) isinterpreted as the degree of interaction purely among the k variables. We then have the following decomposition in which D k (p) denotes the degree of interaction among k variables. Theorem 10. D p : (0) p = nx k=1 D k (p): (43) It is straightforward to generalize the above results in the case where we are given two distributions p(x) q(x). Let us dene, (k0 ) k = k (p (k0 ) )=( k; (p) k+ (q)): Then, we have D[p : q] = D[ p k : q k ]=D[p k : (k0 ) k ]+D[ (k0 ) k : q k ] 38

39 which is induced from Theorem 9. By dening, h D k 0(p) =D p (k0 ) i : p ((k;1)0 ) we obtain D [p : q] = nx k=1 D k 0(p): (44) The decompositions shown in Eqs 43 and 44 are obviously similar to each other. A critical dierence, however, exists in interpretation of the two decompositions. Each term in Eq 43, D k (p), represents the degree of purely k-th order interaction, whereas D k 0(p) ineq44doesnot necessarily do so. This is because k (p (k) ) always has k+ = 0, that is, the higherorder coordinates than the k-th order. On the other hand, k (p (k0 ) ) does not necessarily have zero in the corresponding part. In other words, k represents the pure k-th order interaction only if k+ = Application to neural ring To apply the results in the above sections to neural ring data, the discussion for the case of the three neurons can be directly applied. Hence, we mainly provide some remarks in this section. First, suppose that we have an estimated probability distribution of n neurons, denoted by p(x ^), and a probability distribution of our null hypothesis, p(x 0 ). Then, using the k-cut mixed coordinates, we obtain the decomposition, D[ 0 : ^] =D[ 0 : ^ 0 k]+d[^ 0 k : ^] 39

Information Geometry on Hierarchy of Probability Distributions

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001 1701 Information Geometry on Hierarchy of Probability Distributions Shun-ichi Amari, Fellow, IEEE Abstract An exponential family or mixture