Information Geometry on Hierarchy of Probability Distributions

Similar documents
Chapter 2 Exponential Families and Mixture Families of Probability Distributions

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

F -Geometry and Amari s α Geometry on a Statistical Manifold

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Information Geometry and Algebraic Statistics

Information geometry of Bayesian statistics

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

Information Geometric Structure on Positive Definite Matrices and its Applications

An Information Geometry Perspective on Estimation of Distribution Algorithms: Boundary Analysis

Information Geometric view of Belief Propagation

A Theory of Mean Field Approximation

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

Information Geometry

Literature on Bregman divergences

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

Learning Binary Classifiers for Multi-Class Problem

Symplectic and Kähler Structures on Statistical Manifolds Induced from Divergence Functions

Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling

Received: 20 December 2011; in revised form: 4 February 2012 / Accepted: 7 February 2012 / Published: 2 March 2012

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

3 Undirected Graphical Models

To appear in Neural Computation1 Information geometric measure for neural spikes Hiroyuki Nakahara, Shun-ichi Amari Lab. for Mathematical Neuroscience

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Belief Propagation, Information Projections, and Dykstra s Algorithm

SUBTANGENT-LIKE STATISTICAL MANIFOLDS. 1. Introduction

An Introduction to Differential Geometry in Econometrics

STA 4273H: Statistical Machine Learning

ASIGNIFICANT research effort has been devoted to the. Optimal State Estimation for Stochastic Systems: An Information Theoretic Approach

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. A Superharmonic Prior for the Autoregressive Process of the Second Order

Decomposing Bent Functions

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

arxiv: v1 [stat.me] 5 Apr 2013

Information geometry of the power inverse Gaussian distribution

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M =

Stochastic Variational Inference

DIFFERENTIAL GEOMETRY. LECTURE 12-13,

Pattern Recognition and Machine Learning

Problems in Linear Algebra and Representation Theory

An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle and Bayes Codes

Bregman divergence and density integration Noboru Murata and Yu Fujimoto

Legendre transformation and information geometry

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation

Information Geometry: Background and Applications in Machine Learning

Variational Principal Components

On the entropy flows to disorder

Some New Information Inequalities Involving f-divergences

Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information

Representation of Correlated Sources into Graphs for Transmission over Broadcast Channels

Information geometry connecting Wasserstein distance and Kullback Leibler divergence via the entropy-relaxed transportation problem

Discriminative Direction for Kernel Classifiers

Introduction to Information Geometry

Natural Gradient for the Gaussian Distribution via Least Squares Regression

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Neural Learning in Structured Parameter Spaces Natural Riemannian Gradient

Höst, Stefan; Johannesson, Rolf; Zigangirov, Kamil; Zyablov, Viktor V.

Characterizing the Region of Entropic Vectors via Information Geometry

Bregman Divergences for Data Mining Meta-Algorithms

Riemannian geometry of positive definite matrices: Matrix means and quantum Fisher information

HOPFIELD neural networks (HNNs) are a class of nonlinear

Information geometry for bivariate distribution control

BLIND source separation (BSS) aims at recovering a vector

Marginal log-linear parameterization of conditional independence models

Information geometry of mirror descent

Invariant HPD credible sets and MAP estimators

On the exponential map on Riemannian polyhedra by Monica Alice Aprodu. Abstract

ARTICLE IN PRESS. Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect. Journal of Multivariate Analysis

arxiv: v2 [math.fa] 27 Sep 2016

Neural Codes and Neural Rings: Topology and Algebraic Geometry

Toric statistical models: parametric and binomial representations

Why is Deep Learning so effective?

THE information capacity is one of the most important

REGULAR TRIPLETS IN COMPACT SYMMETRIC SPACES

Results from MathSciNet: Mathematical Reviews on the Web c Copyright American Mathematical Society 2000

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES

Mixture Models and Representational Power of RBM s, DBN s and DBM s

Journée Interdisciplinaire Mathématiques Musique

arxiv:cs/ v2 [cs.it] 1 Oct 2006

On bisectors in Minkowski normed space.

Amobile satellite communication system, like Motorola s

Manifold Monte Carlo Methods

ENTANGLED STATES ARISING FROM INDECOMPOSABLE POSITIVE LINEAR MAPS. 1. Introduction

TOPICS IN HARMONIC ANALYSIS WITH APPLICATIONS TO RADAR AND SONAR. Willard Miller

Local minima and plateaus in hierarchical structures of multilayer perceptrons

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

On the Chi square and higher-order Chi distances for approximating f-divergences

Page 52. Lecture 3: Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 2008/10/03 Date Given: 2008/10/03

A Single-letter Upper Bound for the Sum Rate of Multiple Access Channels with Correlated Sources

Bounding the Entropic Region via Information Geometry

(df (ξ ))( v ) = v F : O ξ R. with F : O ξ O

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Model Complexity of Pseudo-independent Models

Basic Principles of Unsupervised and Unsupervised

On the entropy flows to disorder

CHAPTER 1 PRELIMINARIES

Vector-valued quadratic forms in control theory

Classical differential geometry of two-dimensional surfaces

Initial-Value Problems in General Relativity

Transcription:

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001 1701 Information Geometry on Hierarchy of Probability Distributions Shun-ichi Amari, Fellow, IEEE Abstract An exponential family or mixture family of probability distributions has a natural hierarchical structure. This paper gives an orthogonal decomposition of such a system based on information geometry. A typical example is the decomposition of stochastic dependency among a number of rom variables. In general, they have a complex structure of dependencies. Pairwise dependency is easily represented by correlation, but it is more difficult to measure effects of pure triplewise or higher order interactions (dependencies) among these variables. Stochastic dependency is decomposed quantitatively into an orthogonal sum of pairwise, triplewise, further higher order dependencies. This gives a new invariant decomposition of joint entropy. This problem is important for extracting intrinsic interactions in firing patterns of an ensemble of neurons for estimating its functional connections. The orthogonal decomposition is given in a wide class of hierarchical structures including both exponential mixture families. As an example, we decompose the dependency in a higher order Markov chain into a sum of those in various lower order Markov chains. Index Terms Decomposition of entropy, - -projections, extended Pythagoras theorem, higher order interactions, higher order Markov chain, information geometry, Kullback divergence. Manuscript received November 15, 1999; revised December 28, 2000. The author is with the Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Saitama 351-0198, Japan (e-mail: amari@brain. riken.go.jp). Communicated by I. Csiszár, Associate Editor for Shannon Theory. Publisher Item Identifier S 0018-9448(01)04420-0. I. INTRODUCTION WE study structures of hierarchical systems of probability distributions by information geometry. Examples of such systems are exponential families, mixture families, higher order Markov chains, autoregressive (AR) moving average (MA) models, others. Given a probability distribution, we decompose it into hierarchical components. Different from the Euclidean space, no orthogonal decomposition into components exists. However, when a system of probability distributions forms a dually flat Riemannian manifold, we can decompose the effects in various hierarchies in a quasi-orthogonal manner. A typical example we study is interactions among a number of rom variables. Interactions among them include not only pairwise correlations, but also triplewise higher interactions, forming a hierarchical structure. This case has been studied extensively by the log-linear model [2] which gives a hierarchical structure, but the log-linear model itself does not give an orthogonal decomposition of interactions. Given a joint distribution of rom variables, it is important to search for an invariant orthogonal decomposition of their de- grees or amounts of interactions into pairwise, triplewise, higher order interactions. To this end, we study a family of joint probability distributions of variables which have at most -way interactions but no higher interactions. Two dual types of projections, namely, the -projection -projection, to such subspaces play a fundamental role. The present paper studies such a hierarchical structure the related invariant quasi-orthogonal quantitative decomposition by using information geometry [3], [8], [12], [14], [28], [30]. Information geometry studies the intrinsic geometrical structure to be introduced in the manifold of a family of probability distributions. Its Riemannian structure was introduced by Rao [37]. Csiszár [21], [22], [23] studied the geometry of -divergence in detail applied it to information theory. It was Chentsov [19] who developed Rao s idea further introduced new invariant affine connections in the manifolds of probability distributions. Nagaoka Amari [31] developed a theory of dual structures unified all of these theories in the dual differential-geometrical framework (see also [3], [14], [31]). Information geometry has been used so far not only for mathematical foundations of statistical inferences ([3], [12], [28] many others) but also applied to information theory [5], [11], [25], [18], neural networks [6], [7], [9], [13], systems theory [4], [32], mathematical programming [33], statistical physics [10], [16], [38], others. Mathematical foundations of information geometry in the function space were given by Pistone his coworkers [35], [36] are now developing further. The present paper shows how information geometry gives an answer to the problem of invariant decomposition for hierarchical systems of probability distributions. This leads to a new invariant decomposition of entropy information. It can be applied to the analysis of synchronous firing patterns of neurons by decomposing their effects into hierarchies [34], [1]. Such a hierarchical structure also exists in the family of higher order Markov chains also graphical conditional independence models [29]. The present paper is organized as follows. After the Introduction, Section II is devoted to simple introductory explanations of information geometry. We then study the -flat -flat hierarchical structures, give the quantitative orthogonal decomposition of higher order effects in these structures. We then show a simple example consisting of three binary variables study how a joint distribution is quantitatively decomposed into pairwise pure triplewise interactions. This gives a new decomposition of the joint entropy. Section IV is devoted to a general theory on decomposition of interactions among variables. A new decomposition of entropy is also derived there. We 0018 9448/01$10.00 2001 IEEE

1702 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001 touch upon the cases of multivalued rom variables continuous variables. We finally explain the hierarchical structures of higher order Markov chains. II. PRELIMINARIES FROM INFORMATION GEOMETRY A. Manifold, Curve, Orthogonality Let us consider a parameterized family of probability distributions, is a rom variable is a real vector parameter to specify a distribution. The family is regarded as an -dimensional manifold having as a coordinate system. When the Fisher information matrix denotes expectation with respect to, is nondegenerate, is a Riemannian manifold, plays the role of a Riemannian metric tensor. The squared distance between two nearby distributions is given by the quadratic form of It is known that this is twice the Kullback Leibler divergence Let us consider a curve parameterized by in, that is, a one-parameter family of distributions in.itis convenient to represent the tangent vector of the curve at by the rom variable called the score which shows how the log probability changes as increases. Given two curves intersecting at, the inner product of the two tangent vectors is given by (1) (2) (3) (4) (5) (6) B. Dually Flat Manifolds A manifold is said to be -flat (exponential-flat), when there exists a coordinate system (parameterization) such that, for all identically. Such is called -affine coordinates. When a curve is given by a linear function in the -coordinates, are constant vectors, it is called an -geodesic. Any coordinate curve itself is an -geodesic. (It is possible to define an -geodesic in any manifold, but it is no more linear we need the concept of the affine connection.) A typical example of an -flat manifold is the well-known exponential family written as are given functions is the normalizing factor. The -affine coordinates are the canonical parameters, (8) holds because (8) (9) (10) does not depend on. Dually to the above, a manifold is said to be -flat (mixtureflat), when there exists a coordinate system such that (11) identically. Here, is called -affine coordinates. A curve is called an -geodesic when it is represented by a linear function in the -affine coordinates. Any coordinate curve of is an -geodesic. A typical example of an -flat manifold is the mixture family (12) are given probability distributions,. The following theorem is known in information geometry. Theorem 1: A manifold is -flat when only when it is -flat vice versa. This shows that an exponential family is automatically -flat although it is not necessarily a mixture family. A mixture family is -flat, although it is not in general an exponential family. The -affine coordinates ( -coordinates) of an exponential family are given by (13) The two curves intersect at orthogonally when (7) which is known as the expectation parameters. The coordinate transformation between is given by the Legendre transformation, the inverse transformation is that is, when the two scores are noncorrelated. (14)

AMARI: INFORMATION GEOMETRY ON HIERARCHY OF PROBABILITY DISTRIBUTIONS 1703 is the negative entropy (15) (16) holds with. This was first remarked by Barndorff- Nielsen [15] in the case of exponential families. Given a distribution in a flat manifold, in the coordinates in the coordinates have different function forms, so that they should be denoted differently such as, respectively. However, we abuse notation such that the parameter or decides the function form of or automatically. Dually to the above, a mixture family is -flat, although it is not an exponential family in general. The -affine coordinates are derived from The -function of (16) in a mixture family is (17) (18) When is a dually flat manifold, the -affine coordinates -affine coordinates, connected by the Legendre transformations (13) (14), satisfy the following dual relation. Theorem 2: The tangent vectors (represented by rom variables) of the coordinate curves (19) Fig. 1. Generalized Pythagoras theorem. The divergence satisfies with equality when, only when,. In the cases of an exponential family a mixture family, this is equal to the Kullback Leibler divergence (23) is the expectation with respect to. For a dually flat manifold, the following Pythagoras theorem plays a key role (Fig. 1). Theorem 3: Let be three distributions in. When the -geodesic connecting is orthogonal at to the -geodesic connecting (24) The same theorem can be reformulated in a dual way. Theorem 4: For, when the -geodesic connecting is orthogonal at to the -geodesic connecting the tangent vectors of the coordinate curves (20) with (25) (26) are orthonormal at all the points (21) III. FLAT HIERARCHICAL STRUCTURES We have summarized the geometrical features of dually flat families of probability distributions. We extend them to the geometry of flat hierarchical structures. is the Kronecker delta. C. Divergence Generalized Pythagoras Theorem Let be two distributions in a dually flat manifold, let be the corresponding -affine coordinates. They have two convex potential functions. In the case of exponential families, is the cumulant generating function is the negative entropy. For a mixture family, is also the negative entropy. By using the two functions, we can define a divergence from to by (22) A. -Flat Structures Let be a submanifold of a dually flat manifold.itis called an -flat submanifold, when is written as a linear subspace in the -affine coordinates of. It is called an -flat submanifold, when it is linear in the -affine coordinates of. An -flat submanifold is by itself an -flat manifold, hence is an -flat manifold because of Theorem 1. However, it is not usually an -flat submanifold of, because it is not linear in. (Mathematically speaking, an -flat submanifold has vanishing embedding curvature in the sense of the -affine connection, but its -embedding curvature is nonvanishing, although both - -Riemann Christoffel curvatures vanish.) Dually,

1704 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001 an -flat submanifold is an -flat manifold but is not an -flat submanifold. Let us consider a nested series of -flat submanifolds (27) every is an -flat submanifold of. Each is automatically dually flat, but is not an -flat submanifold. We call such a nested series an -flat hierarchical structure or, shortly, the -structure. A typical example of the -structure is the following exponential-type distributions: (28) Fig. 2. Information projection (m-projection). is a partition of the entire parameter, being the th subvector, is a rom vector variable corresponding to it. The expectation parameter of (28) is partitioned correspondingly as (29) is expectation with respect to. Thus,. Here, is a function of the entire, not of only. B. Orthogonal Foliations Let us consider a new coordinate system called the mixed ones -cut (30) It consists of a pair of complementary parts of, namely, (31) (32) It is defined for any. We define a subset for of consisting of all the distributions having the same coordinates, specified by, but the other coordinates are free. This is written as They give a foliation of the entire manifold (33) (34) The hierarchical -structure is introduced in by putting They define -flat foliations. Because of (21), the two foliations are orthogonal in the sense that submanifolds are complementary orthogonal at any point. C. Projections Pythagoras Theorem Given a distribution with partition, it belongs to when.in general, is considered to represent the effect emerging from but not from. We call it the th effect of. In later examples, it represents the th-order effect of mutual interactions of rom variables. The submanifold includes only the probability distributions that do not have effects higher than. Consider the problem of evaluating the effects higher than. To this end, we define the information projection of to by (37) This is the point in that is closest to in the sense of divergence. Theorem 5: Let be the point in such that the -geodesic connecting is orthogonal to. Then, is unique is given by. Proof: For any point, the -geodesic connecting is included in, hence is orthogonal to the -geodesic connecting (Fig. 2). Hence, the Pythagoras theorem holds. This proves is unique. We call the -projection of to, write (38) (39) (35) Let be a fixed distribution. We then have Dually to the above, let (36) be the subset in in which the -part of the distributions have a common fixed value. This is an -flat submanifold. (40) Hence, is regarded as the amount representing effects of higher than, as is that not higher than.

AMARI: INFORMATION GEOMETRY ON HIERARCHY OF PROBABILITY DISTRIBUTIONS 1705 The coordinates of are not simply Theorem 7: The projection of to is the maximizer of entropy among having the same -marginals as since the manifold is not Euclidean but is Riemannian. In order to obtain, the -cut mixed coordinates are convenient. Theorem 6: Let the - -coordinates of be. Let the - -coordinates of the projection be, respectively. Then, the -cut mixed coordinates of is (41) that is,. Proof: The point given by (41) is included in. Since the -coordinates of differ only in, the -geodesic connecting is included in. Since is orthogonal to, this is the -projection of to. In order to obtain the full -or -coordinates of,, we need to solve the set of equations (42) (43) (44) (45) Note that,,. Hence, changes by the -projection. This implies that does not give any orthogonal decomposition of the effects of various orders that does not give the pure order effect except for the case of. D. Maximal Principle The projection is closely related with the maximal entropy principle [27]. The projection belongs to which consists of all the probability distributions having the same -marginal distributions as, that is, the same coordinates. For any, is its projection to. Hence, because of the Pythagoras theorem the minimizer of for is. We have (46) (47) This relation is useful for calculating. (48) E. Orthogonal Decomposition The next problem is to single out the amount of the th-order effects in, by separating it from the others. This is not given by. The amount of the effect of order is given by the divergence (49) which is certified by the following orthogonal decomposition. Theorem 8: (50) Theorem 8 shows that the th-order effects are orthogonally decomposed in (50). The theorem might be thought as a trivial result from the Pythagoras theorem. This is so when is a Euclidean space. However, this is not trivial in the case of a dually flat manifold. To show this, let us consider the theorem of three squares in a Euclidean space: Let be four corners in a rectangular parallelpiped. Then (51) In a dually flat manifold, the Pythagoras theorem holds when one edge is an -geodesic the other is an -geodesic. Hence, (51) cannot trivially be extended to this case, because cannot be an -geodesic an -geodesic at the same time. The theorem is proved by the properties of the orthogonal foliation. We show the simplest case. Let us consider let be the -projections of. We write their coordinates as (52) (53) (54) Then, we see that the -projection of to is, although the -geodesic connecting is not usually included in. This proves because depends only on the marginal distributions of is constant for all. Hence, we have the geometric form of the maximum principle [27]. The general case is similar. (55)

1706 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001 F. -Flat Structure Dually to the -structure, we can study the -hierarchical structure (56) is an -flat submanifold of. A typical example is the mixture family The -projection of to is defined by We can also show the orthogonal decomposition theorem. The quantity (57) (58) (59) (60) is regarded as the th-order effect of in the mixture family or in a more general -structure. A hierarchical MA model in time series analysis is another example of the -structure [4], the minimal principle holds instead of the maximal principle [4]. IV. SIMPLE EXAMPLE: TRIPLEWISE INTERACTIONS The general results are explained by a simple example of the set of joint probability distributions,, of binary rom variables, or. We can exp (61) obtaining the log-linear model [2]. This shows that is an exponential family. The canonical or -affine coordinates are which are partitioned as (62) (63) (64) This defines a hierarchical -structure in, represents pairwise interactions represents the triple interaction, although they are not orthogonal. The corresponding -affine coordinates are partitioned as (65),, with (66) We have a hierarchical -structure (67) (68) (69) is a singleton with, is defined by, is defined by.on the other h, gives the marginal distributions of, gives the all the pairwise marginal distributions. Consider the two mixed cut coordinates:. Since are orthogonal to the coordinates that specify the marginal distributions, represents the effect of mutual interactions of,,, independently of their marginals. Similarly, defined by is composed of all the distributions which have no intrinsic triplewise interactions but pairwise correlations. The two distributions given by have the same pairwise marginal distributions but differ only in the pure triplewise interactions. Since is orthogonal to, it represents purely triplewise interactions, as is well known in the log-linear model [2], [18]. The partitioned coordinates are not orthogonal, so that we cannot say that summarizes all the pure pairwise correlations, except for the special case of. Given, we need to separate pairwise correlations triplewise interactions invariantly obtain the orthogonal quantitative decomposition of these effects. We project to, giving, respectively. Then, is the independent distribution having the same marginals as, is the distribution having the same pairwise marginals as but no triplewise interaction. By putting we have the decomposition (70) (71) (72) (73) Here, represents the amount of purely triplewise interaction, the amount of purely pairwise interaction, the degree of deviations of the marginals of from the uniform distribution. We have thus the orthogonal quantitative decomposition of interactions. It is interesting to show a new type of decomposition of entropy or information from this result. We have (74) (75)

AMARI: INFORMATION GEOMETRY ON HIERARCHY OF PROBABILITY DISTRIBUTIONS 1707 are the total marginal entropies of. Hence, The manifold is an exponential family is also a mixture family at the same time. Let us exp to obtain the log-linear model [2], [17] (76) One can define mutual information among,, by (77) Then, (73) gives a new invariant positive decomposition of (78) In information theory, the mutual information among three variables is sometimes defined by (79) Unfortunately, this quantity is not necessarily nonnegative ([20], see also [24]). Hence, our decomposition is new is completely different from the conventional one [20] (80) It is easy to obtain from given. The coordinates of are obtained by solving (42) (45). In the present case, the mixed cut of is. Hence, the -coordinates of are, the marginals are the same as is determined such that becomes. Since we have (81) by putting, the of is given by solving (82) shown at the bottom of the page. V. HIGHER ORDER INTERACTIONS OF RANDOM VARIABLES A. Coordinate Systems of Let be binary variables let be its probability,,. We assume that for all. The set of all such probability distributions is a -dimensional manifold, probabilities (83) constitute a coordinate system in which one of them is determined from (84) (85) indexes of, etc., satisfy, etc. Then (86) has components. They define the -flat coordinate system of. In order to simplify index notation, we introduce the following two hyper-index notations. Indexes, etc., run over -tuples of binary numbers (87) except for. Hence, the cardinality of the index set is. Indexes, etc., run over the set consisting of a single index, a pair of indexes,a triple of indexes, -tuple, that is, sts for any element in the set. In terms of these indexes, the two coordinate systems given by (83) (86) are written as (88) (89) respectively. We now study the coordinate transformations among them. Let us consider functions of defined by (90) otherwise. Here, implies that, for,. Each is a polynomial of degree, written as we put The polynomials can be exped as or, shortly, as. (91) (92) (93) (94) (82)

1708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001 we set which proves (103). We have (95) (109) when is. The matrix is a nonsingular matrix. We have (96) B. -Hierarchical Structure of Let us naturally decompose the coordinates into are the elements of the inverse of. It is not difficult to write down these elements explicitly. The probability distributions of is rewritten as (97) when otherwise. This shows that is a mixture family. The expansion (85) is rewritten as (98) This shows that is an exponential family. The corresponding -affine coordinates are given by the expectation parameters For (99) (110) summarizes with, that is, those consisting of indexes. Hence has components. Let be the subspace defined by. Then, s form an -hierarchical structure. We consider the corresponding partition of Then, the coordinates consist of all (111) (112) for. All the marginal distributions of rom variables are completely determined by.for any, we can define the -cut coordinates (113) The relations among the coordinate systems are given by the following theorem. Theorem 9: (100) (101) (102) (103) (104) C. Orthogonal Decomposition of Interactions Entropy Given, is the point closest to among those that do not have intrinsic interactions more than variables. The amount of interactions higher than is defined by. Moreover, may be interpreted as the degree of purely order interactions among variables. We then have the following decomposition. From the orthogonal decomposition, we have a new invariant decomposition of entropy information, which is different from those studied by many researchers as the decomposition of entropy of a number of rom variables. Theorem 10: (114) (105) (115) Proof: Given a probability distribution,wehave (106) (116) On the other h, from (85) we have (107) (108) D. Extension to Finite Alphabet We have so far treated binary rom variables. The hierarchical structure of interactions the decomposition of divergence or entropy holds in the general case. We consider the case are rom variables taking on a common

AMARI: INFORMATION GEOMETRY ON HIERARCHY OF PROBABILITY DISTRIBUTIONS 1709 finite alphabet set. The argument is extended easily to the case take on different alphabet sets. Let us define the indicator functions for by The corresponding -coordinates are given by (122) (117) (123) denotes any element in. Let the joint probability of be. The joint probability function is now written as We now exp as (118) (119) sts for nonzero elements of. The coefficients s form the -coordinate system. Here, the term representing interactions of variables have a number of components as. The corresponding coordinates consist of (120) (124) When holds identically, there are no intrinsic interactions among the three, so that all interactions are pairwise in this case. When,, there are no interactions so that the three rom variables are independent. Given, the independent distribution is easily obtained by (125) However, it is difficult to obtain the analytical expression of. In general, which consists of distributions having pairwise correlations but no intrinsic triplewise interactions, is written in the form (126) which represent the marginal distributions of variables. We can then define the -cut, in which are orthogonal. Given, we can define The is the one that satisfies (127) (128) which has the same marginals as up to variables, but has no intrinsic interactions more than rom variables. The orthogonal decompositions of mutual information in terms of s hold as before. E. Continuous Distributions It was difficult to give a rigorous mathematical foundation to the function space of all the density functions with respect to the Lebesgue measure on the real. Recently, Pistone et al. [35], [36] have succeeded in constructing information geometry in the infinite-dimensional case, although we do not enter in its mathematical details. Roughly speaking, the -coordinates of is the density function itself,, the -coordinates are the function (121) They are dually coupled. We define higher order interactions among variables. In order to avoid notational complications, we show only the case of. The three marginals three joint marginals of two variables are given by, respectively. They are obtained by integrating. VI. HIGHER ORDER MARKOV PROCESSES As another example of the -structure, we briefly touch upon the higher order Markov processes of the binary alphabet. The th-order Markov process is specified by the conditional probability of the next output (129) is the past sequence of letters. We define functions (130) (131) Then, the Markov chain is specified by parameters, taking on any binary sequences of length. For an observed long data sequence, let be the relative number of subsequence appearing in. The Markov chain is an exponential family when is sufficiently large, s are sufficient statistics. The probability of is written in terms of the relative frequencies of various letter sequences (132)

1710 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 5, JULY 2001 Hence, the th-order Markov chain forms an -flat manifold of dimensions. The set of Markov chains of various orders naturally has the hierarchical structure (133) is the Bernoulli process (independent identically distributed process) is the first-order Markov chain. In order to decompose the degree of dependency of a Markov chain into various orders, we use the following frequencies observed from the sequence. In order to avoid complicated notations, we use as an exemple the second-order Markov chain, but generalization is easy. We define rom variables related to a sequence by relative frequency of (134) relative frequency of (135) relative frequency of (136) We summarize their expectations as is arbitrary relative frequency of (137) (138) They form a mixture affine coordinate system to specify the second-order Markov chain, are functions of,,,. Higher order generalization is easy. In order to obtain the third-order Markov chain, for example, we obtain eight s by adding in the suffix of each, for example, emerge from. We then have eight quantities whose expectations form the -coordinates. The coordinate is responsible only for the Bernoulli structure. Therefore, if the process is Bernoulli, determines its probability distribution. The coordinates together are responsible for the first-order Markov structure all of are necessary to determine the second-order Markov structure. Therefore, we have the following decomposition of the -coordinates: (139),,. The parameters define the th-order Markov structure. We have the corresponding -coordinates, because s are linear combinations of s. They are given in the partitioned form in this case as (140) (141) (142) (143) (144) (145) (146) (147) We can see here that together show how the Markov structure is apart from the first-order one, because those coordinates are orthogonal to. The projections to lower order Markov chains are defined in the same way as before, we have the following decomposition: (148) is the projection to the th-order Markov chain is a fair Bernoulli process (that is, ). VII. CONCLUSION The present paper used information geometry to elucidate the hierarchical structures of rom variables their quasi-orthogonal decomposition. A typical example is the decomposition of interactions among binary rom variables into a quasi-orthogonal sum of interactions among exactly rom variables. The Kullback Leibler divergence, entropy, information are decomposed into a sum of nonnegative quantities representing the order interactions. This is a new result in information theory. This problem is important for analyzing joint firing patterns of an ensemble of neurons. The present paper gives a method of extracting various degrees of interactions among these neurons. The present theories of hierarchical structures can be applicable to more general cases including mixture families. Applications to higher order Markov chains are briefly described. REFERENCES [1] A. M. H. J. Aertsen, G. L. Gerstein, M. K. Habib, G. Palm, Dynamics of neuronal firing correlation: Modulation of effective connectivity, J. Neurophysiol., vol. 61, pp. 900 918, 1989. [2] A. Agresti, Categorical Data Analysis, New York: Wiley, 1990. [3] S. Amari, Differential-Geometrical Methods of Statistics (Lecture Notes in Statistics). Berlin, Germany: Springer, 1985, vol. 25. [4], Differential geometry of a parametric family of invertible linear systems Riemannian metric, dual affine connections divergence, Math. Syst. Theory, vol. 20, pp. 53 82, 1987. [5], Fisher information under restriction of Shannon information, Ann. Inst. Statist. Math., vol. 41, pp. 623 648, 1989. [6], Dualistic geometry of the manifold of higher-order neurons, Neural Networks, vol. 4, pp. 443 451, 1991. [7], Information geometry of the EM em algorithms for neural networks, Neural Networks, vol. 8, no. 9, pp. 1379 1408, 1995. [8], Information geometry, Contemp. Math., vol. 203, pp. 81 95, 1997. [9], Natural gradient works efficiently in learning, Neural Comput., vol. 10, pp. 251 276, 1998. [10] S. Amari, S. Ikeda, H. Shimokawa, Information geometry of -projection in mean-field approximation, in Recent Developments of Mean Field Approximation, M. Opper D. Saad, Eds. Cambridge, MA: MIT Press, 2000. [11] S. Amari T. S. Han, Statistical inference under multi-terminal rate restrictions A differential geometrical approach, IEEE Trans. Inform. Theory, vol. 35, pp. 217 227, Jan. 1989.

AMARI: INFORMATION GEOMETRY ON HIERARCHY OF PROBABILITY DISTRIBUTIONS 1711 [12] S. Amari M. Kawanabe, Information geometry of estimating functions in semi parametric statistical models, Bernoulli, vol. 3, pp. 29 54, 1997. [13] S. Amari, K. Kurata, H. Nagaoka, Information geometry of Boltzmann machines, IEEE Trans. Neural Networks, vol. 3, pp. 260 271, Mar. 1992. [14] S. Amari H. Nagaoka, Methods of Information Geometry. New York: AMS Oxford Univ. Press, 2000. [15] O. E. Barndorff-Nielsen, Information Exponential Families in Statistical Theory, New York: Wiley, 1978. [16] C. Bhattacharyya S. S. Keerthi, Mean-field methods for stochastic connections networks, Phys. Rev., to be published. [17] Y. M. M. Bishop, S. E. Fienberg, P. W. Holl, Discrete Multivariate Analysis: Theory Practice. Cambridge, MA: MIT Press, 1975. [18] L. L. Campbell, The relation between information theory the differential geometric approach to statistics, Inform. Sci., vol. 35, pp. 199 210, 1985. [19] N. N. Chentsov, Statistical Decision Rules Optimal Inference (in Russian). Moscow, U.S.S.R.: Nauka, 1972. English translation: Providence, RI: AMS, 1982. [20] T. M. Cover J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [21] I. Csiszár, On topological properties of f -divergence, Studia Sci. Math. Hungar., vol. 2, pp. 329 339, 1967. [22], I-divergence geometry of probability distributions minimization problems, Ann. Probab., vol. 3, pp. 146 158, 1975. [23] I. Csiszár, T. M. Cover, B. S. Choi, Conditional limit theorems under Markov conditioning, IEEE Trans. Inform. Theory, vol. IT-33, pp. 788 801, Nov. 1987. [24] T. S. Han, Nonnegative entropy measures of multivariate symmetric correlations, Inform. Contr., vol. 36, pp. 133 156, 1978. [25] T. S. Han S. Amari, Statistical inference under multiterminal data compression, IEEE Trans. Inform. Theory, vol. 44, pp. 2300 2324, Oct. 1998. [26] H. Ito S. Tsuji, Model dependence in quantification of spike interdependence by joint peri-stimulus time histogram, Neural Comput., vol. 12, pp. 195 217, 2000. [27] E. T. Jaynes, On the rationale of maximum entropy methods, Proc. IEEE, vol. 70, pp. 939 952, 1982. [28] R. E. Kass P. W. Vos, Geometrical Foundations of Asymptotic Inference, New York: Wiley, 1997. [29] S. L. Lauritzen, Graphical Models. Oxford, U.K.: Oxford Science, 1996. [30] M. K. Murray J. W. Rice, Differential Geometry Statistics. London, U.K.: Chapman & Hall, 1993. [31] H. Nagaoka S. Amari, Differential geometry of smooth families of probability distributions, Univ. Tokyo, Tokyo, Japan, METR, 82-7, 1982. [32] A. Ohara, N. Suda, S. Amari, Dualistic differential geometry of positive definite matrices its applications to related problems, Linear Alg. its Applic., vol. 247, pp. 31 53, 1996. [33] A. Ohara, Information geometric analysis of an interior point method for semidefinite programming, in Geometry in Present Day Science, O. E. Barndorff-Nielsen E. B. V. Jensen, Eds. Singapore: World Scientific, 1999, pp. 49 74. [34] G. Palm, A. M. H. J. Aertsen, G. L. Gerstein, On the significance of correlations among neuronal spike trains, Biol. Cybern., vol. 59, pp. 1 11, 1988. [35] G. Pistone C. Sempi, An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one, Ann. Statist., vol. 23, pp. 1543 1561, 1995. [36] G. Pistone M.-P. Rogantin, The exponential statistical manifold: Mean parameters, orthogonality, space transformation, Bernoulli, vol. 5, pp. 721 760, 1999. [37] C. R. Rao, Information accuracy attainable in the estimation of statistical parameters, Bull. Calcutta. Math. Soc., vol. 37, pp. 81 91, 1945. [38] T. Tanaka, Information geometry of mean field approximation, Neural Comput., vol. 12, pp. 1951 1968, 2000.