Multivariate Dependence and the Sarmanov-Lancaster Expansion

Size: px

Start display at page:

Download "Multivariate Dependence and the Sarmanov-Lancaster Expansion"

Abraham Blankenship
5 years ago
Views:

1 Multivariate Dependence and the Sarmanov-Lancaster Expansion Ilan N. Goodman and Don H. Johnson ECE Department, Rice University Houston, TX July 5 Abstract We extend the work of Sarmanov and Lancaster to obtain an expansion for multivariate distributions. The expansion reveals a flexible, detailed dependence structure that goes beyond pairwise linear correlation to include non-linear and higher-order statistical dependencies. Through examples we show how to use the expansion to analyze existing distributions and to construct new distributions with given properties. We also provide a related dependence measure which we decompose into the separate contributions from each subset of random variables. Using the decomposition we analyze neural population data, revealing significant dependencies not captured by cross-correlation. 1 Introduction Characterizing statistical dependencies in large groups of random variables presents a considerable challenge; traditional multivariate models tend to be quite narrow in scope, and most commonly used dependence measures are appropriate only for a small class of random variables. For instance, the most popular dependence measure, the correlation coefficient, measures only linear dependence between pairs of random variables. For jointly Gaussian variables, this pairwise linear dependence completely characterizes the dependence structure. However, even simple examples show that, in general, groups of random variables can express more than just pairwise dependence. For example, consider N binary random variables. Fully specifying their joint distribution requires N 1 parameters; there are N(N 1)/ pairwise correlations, which, together with the N marginal probabilities, can specify the joint distribution only when N =. For larger groups, third and higher order dependencies must also be determined. While for some applications simple models like the multivariate Gaussian may be perfectly adequate, many applications require more flexible models that account for interactions between large numbers of variables. For example, neuroscientists are currently able to make recordings of tens to hundreds of neurons simultaneously, and studies have shown that these 1

2 ensembles exhibit time-varying dependencies that may contribute to stimulus encoding [1]. Consequently, dependence analysis techniques that can be applied to such high-dimensional random vectors are vital to cracking the neural population code, the way in which neurons encode information jointly. The challenge is to model dependencies in a coherent, meaningful way. In this paper we extend the work of Sarmanov [, 3] and Lancaster [4 6] to obtain the Sarmanov-Lancaster (SL) expansion that generates a highly flexible family of multivariate distributions having attractive properties for many applications. Like the more familiar copula models [7], the SL expansion expresses a joint probability function in terms of its univariate marginal distributions and a multivariate dependence structure. However, the SL expansion benefits both from being more flexible than most copulas, since it is applicable to an extremely broad class of distributions, and from inducing a more intuitive and meaningful dependence structure. From the SL expansion we also derive a powerful non-parametric dependence measure φ, which is a generalization of Pearson s coefficient of mean-square contingency [8]. This measure has the particularly useful property that it can be decomposed into elements that quantify the dependencies within each subset of the random variables. As a practical example, we analyze neural population data and show how the decomposition of φ provides a level of detail unavailable using conventional techniques. Background In their study of noise in nonlinear devices, Barrett and Lampard [9] discussed a double Fourier series expansion for certain bivariate probability distributions using orthogonal polynomials. They restricted their study to distributions whose expansions are diagonal; in other words, if the expansion coefficients are arranged into a matrix, they considered only distributions for which the off-diagonal elements are zero. Though this type of expansion would later be used by Sarmanov and Lancaster to generalize the theory of bivariate dependence, Barrett and Lampard focused on applications concerning processes that are subject to a nonlinear distortion. Several years later, Bahadur [1] described a similar expansion for the joint distribution of binary random variables. Bahadur also used orthogonal polynomials to expand the joint probability function, though his focus was quite different; he was concerned with analyzing multivariate categorical data, for instance to determine relationships between survey questions in a psychology experiment. Apparently unaware of this previous work, in papers published between 1958 and 1963 O. V. Sarmanov [, 3] and H. O. Lancaster [4, 5] each described what was essentially the same series expansion of bivariate probability functions. These authors were interested in two related dependence properties. Sarmanov sought to generalize the maximum correlation coefficient of Hirschfeld [11]. Sarmanov defined the maximum correlation coefficient to be the largest eigenvalue of a kernel related to the bivariate probability function, and showed that in the finite discrete case this quantity corresponds to Hirschfeld s coefficient. Furthermore, Sarmanov showed that if the correlation is purely linear, such as in the bivariate Gaussian case, his maximum correlation coefficient corresponds to the usual product-moment correlation coefficient.

3 Around the same time, Lancaster obtained the same expansion while generalizing Hotelling s [1] canonical correlation theory. Lancaster derived a set of canonical variables and correlations for a class of bivariate distributions, including (but not limited to) the bivariate Gaussian distribution. Moreover, he showed that if the expansion is diagonal, then the expansion variables and coefficients are the canonical variables and correlations as defined by Hotelling. Sarmanov and Lancaster each sought to generalize linear correlation theory to a larger class of non-gaussian bivariate distributions. While their methods differed slightly, they both treated correlation analysis as a problem of evaluating the spectrum of a kernel corresponding to the ratio of the joint distribution to the product of the marginals. The result is an expansion for a class of bivariate distributions that is expressed in terms of the product of the marginals and a set of correction terms that completely specify the dependence structure. Lancaster s derivation is somewhat more general; he considers the expansion of the Radon-Nikodym derivative of the joint distribution measure with respect to the product measure. As a result, the derivation is not limited to distributions having a density with respect to the Lebesgue measure. In the next section, we extend Lancaster s method to generalize the expansion to the case of more than two variables. 3 The Sarmanov-Lancaster (SL) Expansion To obtain the SL expansion, we begin by constructing a Hilbert basis for a space of univariate probability functions. Consider a random variable X on the probability space (Ω, B, P ), and define the space L (P ) to be the set of measurable functions g(x) having finite variance, L (P ) = { g : R R, g B(R)/B(R) s.t. E [g(x)] < }, (1) where E [g(x)] = Ω g (X)dP. Letting F = P X 1 be the distribution of X on (R, B(R)), we see that L (P ) defines a separable Hilbert space on R [13] with the inner product g, h = g(x)h(x)f (dx) = E [g(x)h(x)]. R Hence, L (P ) contains a complete orthonormal sequence {ψ i = ψ i (X)} i N L (P ), and every function g L (P ) can be expanded as g(x) = a i ψ i (X), i= where a i = E [g(x) ψ i (X)]. Extending the construction to a collection of random variables, we let X = (X 1,..., X N ), where each X n : (Ω n, B n ) (R, B(R)) is equipped with probability measure P n, and for each n we define the space L (P n ) in the same way as equation { } 1. As we have already shown, for each n there exists a complete orthonormal sequence ψ (n) i n in L (P n ) which is a basis for i n N that space. Define (Ω, B) = (Ω 1 Ω N, B 1 B N ) to be the product space, and let P = P 1 P N be the product measure. Finally, define the likelihood ratio Λ = dq/dp, 3

4 where Q is an arbitrary measure on the product space, absolutely continuous with respect to P (Λ is the Radon-Nikodym derivative of Q with respect to P ). Then, if Λ L (P ), we can expand Q as [ ] N dq = dp ψ (n) i n, () where a i1 i N = E [ Λ a i1 i N i 1,...,i N n=1 N n=1 ] ψ (n) i n = N Ω n=1 ψ (n) i n dq. The expansion is a straightforward application of Hilbert theory; { the space L (P } ) = L (P 1 ) L (P N ), so the tensor product of the marginal bases ψ (1) i 1 ψ (N) i N is a basis for L (P ) [14]. Equation provides a multivariate model that is uniquely determined by the marginal distributions and a set of dependence parameters which are expectations of products of functions defined on the marginals. In general, the choice of marginal bases is arbitrary, which makes the dependence structure induced by this model very flexible. However, this also means that the dependence structure may tell us very little about the actual statistical interactions between the variables. The Sarmanov-Lancaster (SL) expansion solves this problem by imposing a structure on the expansion coefficients. Specifically, we choose the marginal bases such that ψ (n) = 1 n. Then, the basis element N n=1 ψ(n) i n is a function of a subset of the N variables. Using the terminology of Bahadur [1], we say that the coefficient a i1 i N is of order k if the corresponding basis element is a function of exactly k variables. Define S (N) k, k =,..., N to be the class of all unique combinations of k integers between 1 and N, so that the elements of S (N) correspond to all distinct pairs, S (N) 3 to triplets, and so on. For example, if N = 3, we would have S (3) = {{1, }, {1, 3}, {, 3}} and S (3) 3 = {1,, 3}; in general, S (N) k = ( ) N k. Each element of S (N) k corresponds to a distinct subset of variables, so for example the set {1, } denotes the random variables X 1, X. Noting that each subset of variables is, in general, described by more than one coefficient, we define I (N) j (k) to be the class of indices denoting coefficients for the j th subset of order k. In other words, { } I (N) j (k) = i 1 i N : i n = n Sj c and i n > n S j, S j S (N) k. Continuing { the above } example, we have I (3) 1 () = {11, 1, 13,, 1,, 3, }. The classes (k) form a disjoint partition of the set of indices {i 1 i N }. We can now write I (N) j equation as j,k dq = dp 1 + N ( N k) k= j=1 i I (N) j (k) a i ψ i, (3) where for clarity we have streamlined the notation so that a i = a i1 i N and ψ i = N n=1 ψ(n) i n. Now, for a given i I (N) j (k), ψ i is a function only of the j th subset of k variables; hence, the 4

5 coefficients a i represent the interactions within that particular subset of variables. In other words, the subscript k denotes the order of the interaction, j indexes a particular subset of k variables, and i is an index into the coefficients corresponding to that set. Note that, in general, I (N) j (k) > 1, so there may be multiple coefficients associated with each subset of variables. Moreover, since the basis elements are orthogonal, the interaction in a given subset is independent of interactions in every other subset. For example, if N = 3, the coefficient a 11 denotes a pairwise interaction between X 1 and X, whereas the coefficient a 111 corresponds strictly to a third-order interaction between all three variables. In the special case of N =, Lancaster [4] showed that for any distribution for which Λ L (P ) there exist bases such that the resulting expansion is diagonal; in other words, a i1 i = when i 1 i. Moreover, the diagonal basis is unique, and the functions ψ i = ψ (1) i ψ () i and the coefficients a i are exactly equal to the canonical variables and correlations of the well known canonical correlation analysis. In adapting the expansion to the multivariate (N > ) distributions, it is unclear whether there is an analogous property. As an analytic tool, the SL expansion can be used in one of two ways. First, given a set of univariate marginal distributions, we can choose a corresponding set of bases and construct arbitrary joint distributions with a given dependence structure. When doing this, however, we must take care to ensure that the result is in fact a valid distribution. In general, the span of the coefficients a i that produces a valid distribution is not the entire real line. For example, certain sets of coefficients may result in the distribution being negative. It is usually difficult to know a priori if a given set of coefficients will produce a valid distribution. Hence, a certain amount of experimentation may be necessary to obtain a valid distribution this way. The second way in which the SL expansion can be used is to evaluate the dependence structure inherent in a given distribution. Given a joint distribution for which Λ L (P ), we can choose a convenient set of bases for the marginal spaces and determine the corresponding dependence coefficients. Exploiting the dependence structure of the SL expansion, we would then be able to tell what sort of interactions are induced by the given distribution. As we will show in section 4, it turns out that this information is independent of the choice of basis; as a result, performing this type of analysis does not require explicit calculation of the SL expansion, but rather can be found by decomposing a non-parametric dependence measure known as φ. Before introducing this dependence measure, however, we will discuss a few useful families of distributions that can be characterized by the SL expansion. 3.1 Example 1: Distributions with Gaussian marginals. An obvious choice of basis for the Gaussian distribution is the collection of Hermite polynomials, which are orthogonal with respect to the weighting function e x. The n th Hermite polynomial is defined as [15] dn H n (x) = ( 1) n e x dx n e x. Letting X = (X 1,..., X N ) be a collection of Gaussian random variables with means µ 1,..., µ N and variances σ 1,..., σ N, the SL expansion for the joint probability density function p X(x) is p X (x) = [ N p Xn (x n ) n=1 i 1 = a i1 i N N/ i N = 5 N n=1 i n i n! H i n ( ) ] xn µ n. (4) σ n

6 (a) (b) Figure 1: Expanding the joint distribution of Gaussian random variables. Panel (a) shows the contours of the distribution of two standard normal random variables that are jointly Gaussian. Panel (b) shows the contours of a different distribution of two standard normal variables that has off-diagonal terms in its SL expansion. The random variables in this case are not jointly Gaussian. Note the addition of weighting factors which normalize the Hermite polynomials with respect to the non-standard marginal distributions. In general, distributions taking the form of equation 4 are not jointly Gaussian, although each of the marginal distributions is Gaussian. This is clearly the case when the expansion includes non-linear and non-pairwise terms. Figure 1 shows an example of two bivariate distributions with standard normal marginals having very different dependence structures. In the special case of two jointly Gaussian variables, Barrett and Lampard [9] used Mehler s expansion to express the expansion coefficients in terms of the correlation coefficient, resulting in the simplified expression [ ( ) ( ) ] ρ i p X (x) = p X1 (x 1 )p X (x ) 1 + i i! H x1 µ 1 x µ i H i. σ 1 σ i=1 Slepian [16] further generalized Mehler s formula for N >. His method, while complicated, can be used to obtain the expansion coefficients for an arbitrary Gaussian random vector. For example, consider the case N = 3, with σ 1 = σ = σ 3 = 1, and correlation coefficients ρ 1, ρ 13, and ρ 3. The resulting expression is then [ 3 ] ρ i 3 p X (x) = p Xn (x n ) 1 ρ i 13 ρ i 1 3 i 1!i!i 3! H i +i 3 (z 1 )H i1 +i 3 (z )H i1 +i (z 3 ), n=1 where z n = (x n µ n )/( σ n ). i 1 = i = i 3 = 3. Example : Distributions with uniform marginals. Another interesting example is the case of random variables that are uniformly distributed on the interval [, 1]. In this case, the Haar wavelet basis is a natural choice. The Haar functions 6

7 (a) (b) (c) (d) Figure : Expanding the joint distribution of uniform random variables. The first two panels depict basis functions for the SL expansion of three uniform random variables; panel (a) is a second-order function of X 1 and X, and panel (b) is a third-order function of all three variables. The last two panels depict two different joint densities; panel (c) shows a density having only third-order dependencies, while the density in panel (d) has both second- and third-order dependencies. are defined by W ij (x) = W ( i x j), i N, j < i, where 1 if x 1, W (x) = 1 if 1 < x 1, otherwise. The Haar functions together with the weighting function ψ = 1 form a complete basis for the space L [, 1] [17]. So, given a collection X = (X 1,..., X N ) of random variables, each one being uniformly distributed on the unit interval, the joint pdf p X (x) can be expressed by the SL expansion p X (x) = 1 + N W in j n (x n ). i 1,j 1 a i1 j 1 i N j N i N,j N n=1 For example, let N = 3, and consider the family of densities p(x 1, x, x 3 ) = 1 + aw (x 1 )W (x ) + bw (x 1 )W (x )W (x 3 ). The two basis functions for this distribution are depicted in figure. When a = and b, there is only third-order dependence between the variables. Panel (c) shows the corresponding joint density when b =.5. Though the density is clearly highly structured, the correlations between each pair of variables here is zero. If a, there is pairwise dependence between X 1 and X, yielding a distribution like the one in panel (d). Here, a =.5 and b =.5, resulting in a non-zero correlation ρ 1 = 3/ Example 3: Distributions on the integers. The Bahadur representation [1] is a well known representation of the joint distribution of Bernoulli random variables. Letting X = (X 1,..., X N ) be a collection of random variables with P [X n = 1] = p n and P [X n = ] = 1 p n, and letting p [1] (x) = N n=1 pxn n (1 p n ) 1 x n be the product distribution, Bahadur showed that the joint distribution can always be expressed 7

8 as [ p X (x) = p [1] (x) 1 + ] r ij z i z j + r ijk z i z j z k + + r 1 n z 1 z z n, (5) <i<j N <i<j<k N where z n = (x n p n )/ p n (1 p n ), and r ij = E [z i z j ], r ijk = E [z i z j z k ], etc., with the expectations being with respect to the joint distribution p X (x). It is easy to see that the Bahadur representation is a special case of the SL expansion, using the sets {1, z n } as bases for the marginal spaces l (p Xn ). We can extend the Bahadur representation to distributions on the integers by enlarging the span of the marginal bases. Again, consider a collection of random variables X = (X 1,..., X N ), and for each n, set p n (x) = P [X n = x]. For clarity we assume that X n M almost surely; the extension to the negative integers is straightforward. Now, we can find a complete orthonormal basis for each space l (p Xn ) by applying the Gram-Schmidt procedure [18] to the polynomials { 1, x, x,..., x M}. Letting ψ (n) i (x) = f(x i ) be the resulting i th basis function, we obtain the expansion ] p X (x) = [ N M p Xn (x n ) n=1 M i 1 = i = M N a i1 i N i N = n=1 ψ (n) i n (x n ) The functions ψ (n) 1 (x) = (x E [X n ]) /σ n take the same form as Bahadur s functions in equation 5. Thus, the Bahadur expansion is simply the SL expansion for integer-valued random variables using orthonormal polynomials, for the special case M = 1. Later, in section 4.3, we show how this construction is useful for analyzing neural population recordings. We have seen how the SL expansion provides a flexible model with a useful and intuitive multivariate dependence structure. This type of model is extremely useful for many applications, particularly those in which the variables have a complicated high-order dependence structure that must be preserved. However, it is often infeasible or undesirable to compute the expansion or to estimate it from data. For example, while we can easily construct an SL expansion for a collection of discrete random variables, estimating the combinatorial number of parameters might require a prohibitively large quantity of data. In that case, we would prefer to use a non-parametric measure of dependence that can be estimated more reliably from fewer data. The case of continuous random variables is even more problematic, since the number of non-zero parameters may be infinite, and computing the bases may be analytically intractable. In the next section, we describe a summary measure of dependence derived from the SL expansion that can be used to characterize the collection of variables when direct estimation of the SL parameters is infeasible.. 4 The Phi-Squared Dependence Measure Pearson defined φ, his coefficient of mean-square contingency, as a generalization of the χ statistic to test association in a multi-dimensional contingency table [8]. Letting X = (X 1,..., X N ) be a random vector with joint probability density function (or 8

9 mass function) p X (x) and marginal densities p Xn (x n ), the classical definition is φ = p X (x)/ N n=1 p X n (x n )dx 1. Rearranging this formula, we obtain ( φ = E p X (x) N n=1 p 1, (6) X n (x n )) where the expectation is with respect to the product density. Thus, φ is the variance of the likelihood ratio Λ, as we defined it in section 3. It is also a member of a general class of dependence measures that can be defined as Ali-Silvey distances [19] between the joint distribution and the product distribution. Besides φ, another notable member of this class is mutual information, which has gained use in recent years as a dependence measure [ 3]. Dependence measures in this class have a number of important properties, which are discussed at length in [4]. 4.1 Components of Phi-Squared One of the most important properties of the Ali-Silvey dependence measures is that if X and Y are jointly Gaussian random vectors then every Ali-Silvey dependence measure between them is a non-decreasing function of each of the canonical correlations [4]. Letting ρ 1,..., ρ M be the canonical correlations between X and Y, we obtain φ = M m=1 (1 ρ m) 1 1, which is clearly non-decreasing in ρ m. However, we get a more general result if we consider the SL expansion of p X (x); substituting equation into equation 6 we obtain φ = i a i, which is a consequence of Parseval s theorem. Thus, φ is an increasing function of each of the dependence parameters in the SL expansion. So for example, recalling the bivariate example of section 3.1 where X and Y were jointly Gaussian with correlation ρ, we get φ = i=1 ρi, a geometric sum having the well-known solution (1 ρ ) 1 1 as noted earlier. In addition, by exploiting the inherent structure in the SL parameters we obtain an explicit decomposition of φ. Recalling equation 3, we let I (N) j (k) be the class of coefficients that correspond to interactions in the j th subset of k variables. Now, letting φ j(k) = i I (N) j (k) a i, we can rewrite the total dependence φ as the sum φ = N ( N k) φ j(k). k= j=1 For clarity, we re-index the components to indicate directly which subset of variables they represent: φ = φ ij + φ ijk + + φ 13 N. i<j i<j<k For example, φ 1 is the component of φ due entirely to pairwise interactions between the variables X 1 and X, and φ 13 is the component corresponding to third-order interactions between X 1, X, and X 3. 9

10 Although we used the SL expansion explicitly to obtain the decomposition of φ, it is important to note that the decomposition does not depend on the basis used for the expansion. This fact is easiest to see if we consider the simple case of N = 3. There are three second-order subsets of variables and one third-order subset to consider. Expanding equation 3 and choosing any complete orthonormal basis, we have the SL expansion p X (x) = 3 p Xn (x n ) 1 + n=1 i I (3) 1 () a i ψ i + i I (3) () a i ψ i + i I (3) 3 () a i ψ i + i I (3) 1 (3) a i ψ i, (7) and the corresponding decomposition φ = φ 1 + φ 13 + φ 3 + φ 13. Now, suppose that I (3) 1 () denotes the set of indices corresponding to pairwise interactions between X 1 and X. Then, integrating equation 7 with respect to X 3, we obtain the marginal density p(x 1, x ) = p X1 (x 1 )p X (x ) 1 + i I (3) 1 () a i ψ i Letting φ be the phi-squared dependence in the pair (X 1, X ), we get p φ (x 1, x ) = p X1 (x 1 )p X (x ) dx 1dx 1 = i a i I (3) 1 () = φ 1, which is independent of the bases ψ i. Consequently, the φ decomposition is uniquely specified by the joint distribution, and does not depend on any particular choice of basis. Moreover, the φ decomposition can always be computed without explicitly computing an SL expansion, by a process of onion peeling. Here, we first compute all second-order components by computing every pairwise marginal distribution and applying equation 6. Then, we compute the thirdorder components by computing each third-order marginal distribution, applying equation 6 and subtracting the appropriate second-order components. Proceeding in this way, we obtain the complete decomposition of φ without calculating the SL expansion explicitly. As Ali and Silvey noted, which dependence measure one chooses to use is largely arbitrary since all measures in the Ali-Silvey class possess the same essential properties [4]. We have shown, however, that φ possesses the additional property that it can be decomposed into separate contributions from each subset of variables. In section 4.3 we will illustrate the usefulness of this property in data analysis. First, however, it is worth discussing a second dependence measure possessing a similar, though less-detailed, decomposition. 4. Kullback-Leibler (KL) Dependence Perhaps the best known dependence measure in the Ali-Silvey class is the mutual information between two random variables, which is widely used in information theory and statistics. Here 1.

11 we generalize it for multiple random variables. Let X = (X 1,..., X N ) be a random vector with joint probability density function p X (x) and marginal densities p Xn (x n ), and define the likelihood ratio Λ = p X (x)/ N n=1 p X n (x n ). The Kullback-Leibler (KL) dependence ν is the KL divergence between the joint probability function and the product distribution, p X (x) ν = E [Λ log Λ] = p X (x) log N n=1 p X n (x n ) dx. x Unlike φ, no simple expression exists for ν as a function of the SL parameters. However, for a certain class of distributions a similar decomposition of the dependence measure exists. Amari [5] describes a decomposition of the KL dependence measure for N variables that can be described by a log-linear model: N ν = ν n. n= Here, each component ν n represents interactions strictly of the n th order. For example, ν summarizes all pairwise dependencies, and ν 3 summarizes all third-order dependencies. Finding the decomposition is extremely computationally intensive, even for small N []. Moreover, the decomposition is less detailed than the decomposition of φ, since it only separates interactions of each order but does not distinguish between different sets of variables. Finally, the ν decomposition is only available when the variables can be described by a log-linear model, which heavily restricts the class of distributions to which it applies. Consequently, while the total KL dependence measure is widely used in many fields, we prefer φ for its detail, flexibility, and the computational efficiency of its decomposition. We illustrate the use of φ in data analysis by analyzing a neural population recording. 4.3 Example: Neural Populations. An important topic in computational neuroscience is the study of population codes, the mechanism through which sensory information is encoded in the coordinated action of multiple neurons. As a first step toward understanding this mechanism, uncovering the statistical dependencies between neural responses is vital. Neurons encode sensory information in sequences of identical electrical spikes, differing only in their timing. Consequently, to analyze a neural recording, we divide the response time into discrete time bins and count the number of spikes that occurred in each bin. The neural response in each bin can then be viewed as a random variable distributed on the non-negative integers. Thus, the probability law described in section 3.3 provides a natural characterization of the neural response. For the dependence analysis, we compute a normalized φ dependence measure. All Ali- Silvey dependence measures achieve their maximum value when the random variables are completely mutually dependent. In general, the maximum value is infinity, so no normalization exists. However, when we are dealing with a finite discrete alphabet, we can normalize φ so that φ = 1 when the variables are completely mutually dependent. Using Bayes Theorem [6], for any n we can write the joint distribution as p X (x) = p Xn (x n )p(x 1,..., x n 1, x n+1,..., x N x n ). Since the conditional probabilities are all less than or equal to 1, the joint distribution is upper 11

12 rate (spikes/s) 1 5 φ 13 6 x φ 1 6 x x x x φ 4 φ 13 φ 3 4 time (seconds) 4 time (seconds) 4 time (seconds) Figure 3: Dependence analysis of three spiking neurons in the crayfish optic nerve. The top left plot shows the mean response for each neuron over 9 repetitions of the stimulus. The remaining plots show the normalized φ dependence in each subgroup of neurons. 9% confidence intervals were computed using the bootstrap method, and are indicated by dotted lines. Note the significant levels of nd and 3 rd order dependence throughout the experiment. bounded by min n p Xn (x n ). Hence, φ x:p X (x)> min n p X n (x n ) N n=1 p X n (x n ) 1 φ max. The normalized measure is then φ = φ /φ max. Complete mutual dependence occurs when the random variables are one-to-one functions of each other. In that case, p X (x) = p Xn (x n ) at exactly N points and p X (x) = everywhere else; hence φ = 1 if the variables in X are completely mutually dependent. This normalization should be used with caution, however. Since complete mutual dependence can only be achieved under specific constraints on the marginal distributions, the upper bound is not always achievable for an arbitrary set of distributions. Thus, although it is always true that φ 1, small values could indicate relatively strong dependencies with a given set of marginal distributions whereas large values could correspond to weak dependencies for another set of marginal distributions. Figure 3 shows the results of an experiment on the crayfish optic nerve 1. Micro-electrodes inserted into a crayfish brain recorded from three neurons responding to a visual light stimulus (in this case, a triangle-wave light grating moving at constant spatial frequency). We estimated the dependence measure φ in each 1 ms time bin using histogram estimates for the response distributions, and computed confidence intervals using the bootstrap method [8]. The neurons exhibited statistically significant dependencies throughout the stimulus presentation. It is particularly interesting that the third-order dependence has the same order of magnitude as the pairwise dependencies; consequently, a significant portion of the total dependence would not be revealed by cross-correlation analysis. 1 The authors are grateful to Dr. R.M. Glantz, Professor of Biochemistry and Cell Biology, Rice University for providing the data analyzed here. For more experimental details, see [7]. 1

13 5 Conclusion The Sarmanov-Lancaster expansion provides an intuitive characterization of the dependence structure in any number of random variables, and applies to a large class of distributions. The SL expansion has utility as a constructive model; by selecting a convenient basis we can compute families of distributions with a given set of marginals and a particular dependence structure. We can also use it to analyze an existing distribution, projecting the distribution onto an orthogonal basis to reveal its inherent dependence structure. As an added bonus, the dependencies revealed by the SL expansion are completely captured by the φ dependence measure. This dependence measure, which has the same basic properties as other more commonly used dependence measures (such as mutual information), has the additional property that it can be decomposed into the separate contributions of each subset of variables to the overall dependence. The decomposition of φ summarizes the essential elements of the dependence structure, revealing exactly which variables are interacting, and on what level. Moreover, the decomposition is independent of the choice of SL basis, making it easy to compute and universally applicable. The uses and limitations of pairwise linear dependence models are well understood in statistics, but nevertheless they are often used inappropriately when significant higher-order and non-linear dependencies exist [9]. For example, as we saw in section 4.3, neural populations exhibit complicated dependencies that may be the key to understanding how they encode information, and which would be missed by traditional correlation analysis. High-order dependencies exists in other applications as well; in multi-modal data fusion, for example, different kinds of signals produced by the same source (eg. audio and video) often exhibit dependencies that are not well modeled by pairwise linear correlation [3, 31]. Document retrieval systems have similarly complicated dependence structures, and researchers have been able to improve performance of such systems using higher-order dependence models to describe the data [3, 33]. The SL expansion and the φ dependence measure can detail the statistical dependence structure of an arbitrary number of random variables, and thus prove to be useful in a variety of applications. References [1] M. Bezzi, M. E. Diamond, and A. Treves, Redundancy and synergy arising from pairwise correlations in neuronal ensembles, J. Computational Neuroscience, vol. 1, no. 3, pp , May-June. [] O. V. Sarmanov, Maximum correlation coefficient (nonsymmetric case), in Selected Translations in Mathematical Statistics and Probability, vol., pp Amer. Math. Soc., 196. [3] O. V. Sarmanov, Maximum correlation coefficient (symmetric case), in Selected Translations in Mathematical Statistics and Probability, vol. 4, pp Amer. Math. Soc., [4] H. O. Lancaster, The structure of bivariate distributions, Ann. Math. Statistics, vol. 9, no. 3, pp , September

14 [5] H. O. Lancaster, Correlation and complete dependence of random variables, Ann. Math. Statistics, vol. 34, no. 4, pp , December [6] H. O. Lancaster, Correlations and canonical forms of bivariate distributions, Ann. Math. Statistics, vol. 34, no., pp , June [7] H. Joe, Multivariate Models and Dependence Concepts, Chapman & Hall, [8] L. A. Goodman and W. H. Kruskal, Measures of association for cross classifications, J. Amer. Stat. Assoc., vol. 49, no. 68, pp , [9] J. F. Barrett and D. G. Lampard, An expansion for some second-order probability distributions and its application to noise problems, IRE Transactions - Information Theory, vol. 1, pp. 1 15, [1] R. R. Bahadur, A representation of the joint distribution of responses to n dichotomous items, in Studies in Item Analysis and Prediction, H. Solomon, Ed., pp Stanford University Press, [11] H. O. Hirschfeld, A connection between correlation and contingency, Proc. Cambridge Philos. Soc., vol. 31, pp. 5 54, [1] H. Hotelling, Relations between two sets of variates, Biometrika, vol. 8, pp , [13] N. Young, An Introduction to Hilbert Space, Cambridge University Press, [14] M. Reed and B. Simon, Methods of Modern Mathematical Physics: Functional Analysis, vol. 1, Academic Press, New York, NY, 198. [15] A. M. Krall, Hilbert Space, Boundary Value Problems, and Orthogonal Polynomials, vol. 133 of Operator Theory Advances and Applications, Birkhauser,. [16] D. Slepian, On the symmetrized Kronecker power of a matrix and extensions of Mehler s formula for Hermite polynomials, SIAM J. Math. Anal., vol. 3, no. 4, pp , 197. [17] G. Strang, Wavelet transforms versus Fourier transforms, Bulletin of the American Mathematical Society, vol. 8, no., pp , [18] G. Strang, Introduction to Linear Algebra, Wellesley-Cambridge Press, edition, [19] S. M. Ali and S. D. Silvey, A general class of coefficients of divergence of one distribution from another, J. Royal Stat. Soc. Series B, vol. 8, no. 1, pp , [] I. N. Goodman and D. H. Johnson, Orthogonal decompositions of multivariate statistical dependence measures, in Proc. 4 International Conference of Acoustics, Speech, and Signal Processing, May 4. [1] H. Joe, Relative entropy measures of multivariate dependence, Journal of the American Statistical Association, vol. 84, no. 45, pp , March

15 [] C. B. Bell, Mutual information and maximal correlation as measures of dependence, Ann. Math. Stat., vol. 33, no., pp , June 196. [3] G. Pola, A. Thiele, K-P. Hoffmann, and S. Panzeri, An exact method to quantify the information transmitted by different mechanisms of correlational coding, Network: Comput. Neural Syst., vol. 14, pp. 35 6, 3. [4] S. M. Ali and S. D. Silvey, Association between random variables and the dispersion of a Radon-Nikodym derivative, J. Royal Stat. Soc. Series B, vol. 7, no. 1, pp. 1 17, [5] S. Amari, Information geometry on hierarchy of probability distributions, IEEE Trans. Info. Theory, vol. 47, no. 5, pp , July 1. [6] H. Stark and J. W. Woods, Probability, Random Processes, and Estimation Theory for Engineers, Prentice Hall, nd edition, [7] C. S. Miller, D. H. Johnson, J. P. Schroeter, L. L. Myint, and R. M. Glantz, Visual signals in an optomotor reflex: Systems and information theoretic analysis, J. Computational Neuroscience, vol. 13, no. 1, pp. 5 1, July. [8] B. Efron, Better bootstrap confidence intervals, J. American Statistical Assosiation, vol. 8, no. 397, pp , March [9] D. Drouet Mari and S. Kotz, Correlation and Dependence, Imperial College Press, 1. [3] J. W. Fisher III, T. Darrell, W. T. Freeman, and P. Viola, Learning joint statistical models for audio-visual fusion and segregation, Advances in Neural Information Processing Systems, Nov.. [31] J. Hershey and J. Movellan, Using audio-visual synchrony to locate sounds, in Advances in Neural Information Processing Systems 1, S. A. Solla, T. K. Leen, and K-R. Mller, Eds., pp MIT Press, [3] R. M. Losee, Term dependence: Truncating the Bahadur-Lazarsfeld expansion, Information Processing and Management, vol. 3, no., pp , [33] G. Salton, C. Buckley, and C. T. Yu, An evaluation of term dependence models in information retrieval, in SIGIR 8: Proceedings of the 5th annual ACM conference on Research and development in information retrieval, G. Goos and J. Hartmanis, Eds. 198, pp , Springer-Verlag New York, Inc. 15

Correlations in Populations: Information-Theoretic Limits

Correlations in Populations: Information-Theoretic Limits Don H. Johnson Ilan N. Goodman dhj@rice.edu Department of Electrical & Computer Engineering Rice University, Houston, Texas Population coding Describe