Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs

Size: px

Start display at page:

Download "Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs"

Georgiana Fowler
5 years ago
Views:

1 1 Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs Christopher J. Quinn*, Student Member, IEEE, Negar Kiyavash, Senior Member, IEEE, and Todd P. Coleman, Senior Member, IEEE Abstract Recently, directed information graphs have been proposed as concise graphical representations of the statistical dynamics amongst multiple random processes. A directed edge from one node to another indicates that the past of one random process statistically affects the future of another, given the past of all other processes. When the number of processes is large, computing those conditional dependence tests becomes difficult. Also, when the number of interactions becomes too large, the graph no longer facilitates visual extraction of relevant information for decision-making. This work considers approximating the true joint distribution on multiple random processes by another, whose directed information graph has at most one parent for any node. Under a Kullback-Leibler (KL) divergence minimization criterion, we show that the optimal approximate joint distribution can be obtained by maximizing a sum of directed informations. In particular, (a) each directed information calculation only involves statistics amongst a pair of processes and can be efficiently estimated; (b) given all pairwise directed informations, an efficient minimum weight spanning directed tree algorithm can be solved to find the best tree. We demonstrate the efficacy of this approach using simulated and experimental data. In both, the approximations preserve the relevant information for decisionmaking. EDICS classification identifier: MLR-GRKN I. INTRODUCTION Many important inference problems involve reasoning about dynamic relationships between time series. In such cases, observations of multiple time series are recorded and the objective of the observer is to understand relationships between the past of some processes, and how they affect the future of others. In general, with knowledge of joint statistics amongst multiple random processes, such decision-making could in principle be done. However, if these processes exhibit complex dynamics, gaining knowledge can be prohibitive from computational and storage perspectives. As such, it is appealing to develop an approximation of the joint distribution on multiple random processes which can be calculated efficiently and is less complex for inference. Moreover, simplified representations of joint statistics can facilitate easier visualization and The material in this paper was presented (in part) at the International Symposium on Information Theory and Applications, Taichung, Taiwan, October C. Quinn is with the Department of Electrical and Computer Engineering, University of Illinois at Urbana Champaign, Urbana, Illinois ( quinn7@illinois.edu). N. Kiyavash is with the Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana Champaign, Urbana, Illinois ( kiyavash@illinois.edu). T. P. Coleman is with the Department of Bioengineering, University of California, San Diego, La Jolla, CA ( tpcoleman@ucsd.edu) human comprehension of complex relationships. For instance, in situations such as network intrusion detection, decision making in adversarial environments, and first response tasks where a rapid decision is required, such representations can greatly aid the situation awareness and the decision making process. Graphical models have been used to describe both full and approximating joint distributions of random variables [1]. For many graphical models, random variables are represented as nodes and edges between pairs encode conditional dependence relationships. Markov networks and Bayesian networks are two common examples. Markov networks are undirected graphs, while Bayesian networks are directed acyclic graphs. A Bayesian network s graphical structure depends on the variable indexing. This methodology could in principle be applied to describing multiple random processes. For example, if we have n time indices and m random processes, then we could create a Markov or Bayesian network on mn random variables. However, if m or n is large, this could be prohibitive from a complexity and visualization perspective. We have recently developed an alternative graphical model, termed a directed information graph, to parsimoniously describe statistical dynamics amongst a collection of random processes [2]. Each process is represented by a single node, and directed edges encode conditional independence relationships pertaining to how the past of one process affects the future of another, given the past of all other processes. As such, in this framework, directed edges represent directions of causal influence 1. They are motivated by simplified generative models of coupled dynamical systems. They admit cycles, and can thus represent feedback between processes. Under appropriate technical conditions, they do not depend on process indexing and moreover are unique [2]. Directed information graphs are particularly attractive when we have m processes and a large number n of time units: it collapses a graph of mn nodes to a graph on m nodes and a directed arrow encodes information about temporal dynamics. In some situations, however, the number m of processes we record itself can be very large, and in such a situation each conditional independence test, involving all m processes, can be difficult to evaluate. Moreover, even visualization of the directed information graph with up to m 2 edges can be cumbersome. As such, the benefits of having a precise picture of the statistical dynamics might be out-weighed by the costs 1 Causal in this work refers to Granger causality [3], where a process X is said to causally influence a process Y if the past of X helps to predict the future of Y, already conditioned on the past Y and all other processes.

2 2 (a) The full influence structure of the social network. It is difficult to determine which users to target to indirectly influence the whole network. (b) The graph of an approximation which captures key structural components. By targeting only the root of the tree, who is circled, it is possible influence will spread throughout the rest of the network. Fig. 1. Graphical models of the full user influence dynamics of the social network and an approximation of those dynamics. Arrow widths correspond to strengths of influence. Although some structural components are lost, the graph of the approximation makes it clear who to target and the paths of expected influence. in computation, storage, and ease-of-use to a human. An approximation of the joint distribution that preserves a small number of important interactions could alleviate this problem. As an example, consider how a social network company negotiates the costs of advertisements to its users with another company. If the preferences or actions of certain users on average have a large causal influence on the subsequent preferences or actions of friends in their network, then a business might be willing to pay more money to advertise to those users, as compared to the down-stream friends with less influence. By paying to advertise to the influential users, the business is effectively advertising to many. For the social network company and the business to agree on pricing, however, it needs to be agreed on which users are the most influential. With a complicated social network, such as Figure 1(a), a simple procedure to identify who to advertise to, and for how much, might be onerous to develop. However, if the social network company could approximate the user interactions dynamics into a simplified - but accurate - picture, such as Figure 1(b), then it would be much easier for the business to see who to target to influence the whole network. This is the motivation of this work. Directed trees, such as Figure 1(b), are among the simplest structures that could be used for approximation - each node has at most one parent. In reducing the computational, storage, and visual complexity substantially, directed trees are much more amenable to analysis than the full structure. They also depict a clear hierarchy between nodes. We here consider the problem of finding the best approximation of a joint distribution on m random processes so that each node in the subsequent directed information graph has at most one parent. We will demonstrate the efficacy of this approach from complexity, visualization, and decision-making perspectives. II. OUR CONTRIBUTION AND RELATED WORK A. Our Contribution In this paper, we consider the problem of approximating a joint distribution on m random processes by another joint distribution on m random processes, where each node in the subsequent directed information graph has at most one parent. We consider two variants, one in which the approximation s directed information graph need not be connected, and the second for which it is (i.e. it must be a directed tree). We use the KL divergence as the metric to find the best approximation, and show that the subsequent optimization problem is equivalent to maximizing a sum of pairwise directed informations. Both cases only require knowledge of statistics between pairs of processes to find the best such approximations. For the connected case, a minimum weight spanning tree algorithm can be solved in time that is quadratic in the number of processes. Both approximations have similar algorithmic and storage complexity. We demonstrate the utility of this approach in simulated and experimental data, where the relevant information for decision-making is maintained in the tree approximation. B. Related work Chow and Liu proposed an efficient algorithm to find an optimal tree structured approximation to the joint distribution on a collection of random variables [4]. Since then, many works have been developed to approximate joint distributions, in terms of underlying Markov and Bayesian networks. There have been other works that approximated with more complicated structures; see [1] for an overview. In this work, we use directed information graphs to describe the joint distribution on random processes, in terms of how the past of processes statistically affect the future of others. These were recently introduced in [2], where it was also shown that they are a generalized embodiment of Granger s notion of causality [3] and that under mild assumptions, they are equivalent to minimal generative model graphs. Many methods to estimate joint distributions on random processes from a generative model perspective have recently been developed. Shalizi et al. [5] have developed methods using a stochastic state reconstruction algorithm for discrete valued processes to identify interactions between processes and functional communities. Group Lasso is a method to infer the causal relationships between multivariate auto-regressive models [6]. Bolstad et al. recently showed conditions under which the estimates of Group Lasso are consistent [7]. Puig et al. have developed a multidimensional shrinkage-threshold operator which arises in problems with Group Lasso type penalties [8]. Tan and Willsky analyzed sample complexity for identifying the topology of a tree structured network of LTI systems [9]. Materassi et al. have developed methods based on Wiener filtering to statistically infer causal influences in linear stochastic dynamical systems; consistency results have been derived for the case when the underlying dynamics have a tree structure [10], [11]. For the setting where the directed information graph has a tree structure and some processes are not observed, Etesami et al. developed a procedure to recover the graph [12]. C. Paper organization The paper organization is as follows. Section III establishes definitions and notations. In Section IV, we review directed

3 3 information graphs and discuss their relationship with generative models of stochastic dynamical systems to motivate our approach. In Section V, we present our main results pertaining to finding the optimal approximations of the joint distribution where each node can have at most one parent, both unconstrained and when the structure is constrained to be a directed tree. Here we show that in both cases, the optimization simplifies to maximizing a sum of pairwise directed informations. In Section VI, we analyze the algorithmic and storage complexity of the approximations. In Section VII, we review parametric estimation, evaluate the performance of the approximations in a simulated binary classification experiment, and showcase the utility of this approach in elucidating the wave-like phenomena in the joint neural spiking activity of primary motor cortex. III. DEFINITIONS AND NOTATION This section presents probabilistic notations and information-theoretic definitions and identities that will be used throughout the remainder of the manuscript. Unless otherwise noted, the definitions and identities come from Cover & Thomas [13]. For a sequence a 1, a 2,..., denote a i (a 1,..., a i ). For any Borel space Z, denote its Borel sets by B(Z) and the space of probability measures on (Z, B(Z)) as P (Z). Consider two probability measures P and Q on P (Z). P is absolutely continuous with respect to Q (denoted as P Q) if Q(A) = 0 implies that P(A) = 0 for all A B(Z). If P Q, denote the Radon-Nikodym derivative as the random variable dp dq : Z R that satisfies P(A) = z A dp dq (z)q(dz), A B(Z). The Kullback-Leibler divergence between P P (Z) and Q P (Z) is defined as [ D(P Q) E P log dp ] = dq z Z log dp (z)p(dz) (1) dq if P Q and otherwise. For a sample space Ω, sigma-algebra F, and probability measure P, denote the probability space as (Ω, F, P). Throughout this paper, we will consider m random processes where the ith (with i {1,..., m}) random process at time t (with t {1,..., n}), takes values in a Borel space X. Denote the ith random variable at time t by X i,t : Ω X, the ith random process as X i = (X i,1,..., X i,n ), and the whole collection of all m random processes as X = (X 1,..., X m ). The probability measure P thus induces a joint distribution on X given by P X ( ) P (X mn ), a joint distribution on X i given by P Xi ( ) P (X n ), and a marginal distribution on X i,t given by P Xi,t ( ) P (X). With slight abuse of notation, denote X X i for some i and Y X j for some j i and denote the conditional distribution and causally conditioned distribution of Y given X as P Y X=x (dy) P Y X (dy x) n ( = P Yt Y t 1,X n dyt y t 1, x n) P Y X=x (dy) P Y X (dy x) n ( P Yt Y t 1,X t 1 dyt y t 1, x t 1). (2) Note the similarity with regular conditioning in (2), except in causal conditioning the future (x n t ) is not conditioned on [14]. The mutual information and directed information [15] between random process X and random process Y are I(X; Y) = D ( ) P Y X=x P Y PX (dx) x I(X Y) = D ( ) P Y X=x P Y PX (dx). (3) x Conceptually, mutual information and directed information are related. However, while mutual information quantifies statistical correlation (in the colloquial sense of statistical interdependence), directed information quantifies statistical causation. Note that I(X; Y) = I(Y; X), but I(X Y) I(Y X) in general. Remark 1: Note that in (2), there is no conditioning on the present x t. This follows Marko s definition [14] and is consistent with Granger causality [3]. Massey [15] and Kramer [16] later included conditioning on x t for the specific setting of communication channels. In such settings, since the directions of causation (e.g. that X is input and Y is output) are known, it is convenient to work with synchronized time, for which conditioning on x t is meaningful. Note, however, that by conditioning on the present x t in (2), that in a binary symmetric channel (for example) with input X, output Y, and no feedback, I(Y X) > 0, even though Y does not influence X. Directed information has been shown to play important roles in characterizing the capacity of channels with feedback [17] [19], quantifying achievable rates for source coding with feedforward [20], for feedback control over noisy channels [21], [22], and gambling, hypothesis testing, and portfolio theory [23]. See [24] for examples and further discussion. Remark 2: This work is in the setting of discrete time, such as sampled continuous-time processes. Under appropriate technical assumptions, directed information can be directly extended to continuous time on the [0, T ] interval. Define F t = σ(x τ : 0 τ < t, Y τ : 0 τ < t) to be the sigma-algebra generated by the past of all processes and F t = σ(y τ : 0 τ < t) to be the sigma-algebra generated by the past of all processes excluding X. If we assume that all processes

4 4 are well-behaved (i.e. on Polish spaces), then we have that regular versions of P (Y t F t ) and P (Y t F t ), exist almost-surely [25]. As such, we can denote the regular conditional probabilities by by P t ( ) P (Y) and P t ( ) P (Y) respectively. Then the directed information in continuous time is given in complete analogy with discrete time by [ ] T I(X Y) E D (P t P ) t dt. Connections between directed information in continuous time, causal continuous-time estimation, and communication in continuous time have also recently been proposed [26]. A treatment of the continuous-time setting is outside the scope of this work. IV. BACKGROUND AND MOTIVATING EXAMPLE: APPROXIMATING THE STRUCTURE OF DYNAMICAL 0 SYSTEMS In this section, we describe the problem of identifying the structure of a stochastic, dynamical system, and then approximating it with another stochastic dynamical system. We will review the definitions and basic properties of directed information graphs. We first consider an example of a deterministic dynamical system described in state space format in terms of coupled differential equations. Example 1: Consider a system with three deterministic processes, {x t, y t, z t }, which evolves according to: ẋ = g 1 (x, y, z) x t+ = x t + g 1 (x t, y t, z t ) ẏ = g 2 (x, y, z) y t+ = y t + g 2 (x t, y t, z t ) ż = g 3 (x, y, z) z t+ = z t + g 3 (x t, y t, z t ). Given the full past of the whole network, {x t, y t, z t }, the future of each process (at time t + ) can be constructed. In many cases, some processes do not depend on the past of every other process, but only some subset of other processes. Suppose we can simplify the above equations by removing all of the dependencies of how one process evolves given others: x t+ = x t + g 1 (x t, y t ) y t+ = y t + g 2 (x t, y t ) z t+ = z t + g 3 (x t, y t, z t ). This structure can be depicted graphically (see Figure 2(a)). We can further approximate this dynamical system by approximating the functions {g 1 (x t, y t ), g 2 (x t, y t ), g 3 (x t, y t, z t )} whose generative models have fewer inputs. One approximation for the system is: x t+ = x t + g 1 (x t, y t ) x t + g 1(x t ) y t+ = y t + g 2 (x t, y t ) z t+ = z t + g 3 (x t, y t, z t ) z t + g 3(y t, z t ). Figure 2(b) depicts the corresponding directed tree structure. We refer to such structures as causal dependence trees. A similar procedure can be used for networks of stochastic processes, where the system is described in a timeevolving manner through conditional probabilities. Consider X Y Z (a) Full causal dependence structure, the directed information graph. X Y Z (b) Causal dependence tree approximation (7). Fig. 2. Directed information graph and a causal dependence tree approximation for the dynamical system in Example 1. three processes {X, Y, Z}, formed by including i.i.d. noises {B t, C t, D t } n to the above dynamical system and relabeling the time indices: X t+1 = X t + g 1 (X t, Y t, Z t ) + B t+1 Y t+1 = Y t + g 2 (X t, Y t, Z t ) + C t+1 Z t+1 = Z t + g 3 (X t, Y t, Z t ) + D t+1. The system can alternatively be described through the joint distribution (up to time n) as P X,Y,Z (dx, dy, dz) = n P Xt,Y t,z t X t 1,Y t 1,Z t 1(dx t, dy t, dz t x t 1, y t 1, z t 1 ). Because of the causal structure of the dynamical system and the statistical independence of the noises, given the full past, the present values are conditionally independent: P X,Y,Z (dx, dy, dz) = (4) n P Xt X t 1,Y t 1,Z t 1(dx t x t 1, y t 1, z t 1 ) P Yt X t 1,Y t 1,Z t 1(dy t x t 1, y t 1, z t 1 ) P Zt X t 1,Y t 1,Z t 1(dz t x t 1, y t 1, z t 1 ). More generally, we will make the analogous assumption about the chain rule and how each process at time t is conditionally independent of one another, given the full past of all processes. Assumption 1: Equation (4) holds and dp X dφ (x) > 0 for all x and some measure φ P X. A large class of stochastic systems satisfy Assumption 1. For example, coupled stochastic processes described by an Ito stochastic differential equation with independent Brownian noise satisfy the continuous-time equivalent of this assumption [2]. Granger argued that this is a valid assumption for real world systems, provided the sampling rate 1/ is high [3]. We can rewrite (4) using causal conditioning notation (2): P X,Y,Z (dx, dy, dz) = P X Y,Z (dx y, z)p Y X,Z (dy x, z) P Z X,Y (dz x, y). As in the deterministic case, often the evolution of one process does not depend on every other process, but only some subset. We can remove the unnecessary dependencies to obtain P X,Y,Z (dx, dy, dz) = P X Y (dx y)p Y X (dy x) P Z X,Y (dz x, y).

5 5 The dependence structure of this stochastic system is represented by Figure 2(a). We next generalize this procedure. For each process X i, let A(i) {1,..., m}\{i} denote a potential subset of parent processes. Define the corresponding induced probability measure P A : m P A (dx) = P Xi X A(i) (dx i x A(i) ). (5) To find a minimal graph, for each process X i, we would like to find the smallest set of parents that fully describes the dynamics of X i as well as the whole network does: D ( P X P A ) = 0. (6) In Example 1, the A(i) s would correspond to {Y}, {X}, and {X, Y} for X, Y, and Z, respectively. The parent sets {A(i)} n can be independently minimized so that (6) holds. With these minimal parent sets, we can define the graphical model we will use throughout this discussion. 2 Definition 4.1: A directed information graph is a directed graph, where each process is represented by a node, and there is a directed edge from X j to X i for i, j [m] iff j A(i), where the cardinalities { A(i) } m are minimal such that (6) holds. Lemma 4.2 ([2]): Under Assumption 1, directed information graphs are unique. Furthermore, for a given process X i, a directed edge is placed from j to i (j A(i)) if and only if I(X j X i X\{X j, X i }) > 0. Directed information graphs can have cycles, representing feedback between processes, and can even be complete. For some systems, there might be a large number of influences between processes, with varying magnitudes. For analysis and even storage purposes, it can be helpful to have succinct approximations. For the stochastic system of Example 1, we can apply a similar approximation to this system as was done in the discrete case with: P X Y (dx y) P X (dx) P Z X,Y (dz x, y) P Z Y (dz y). Thus, our causal dependence tree approximation to these stochastic processes, denoted by P X, is: P X (dx) P X (dx) P X (dx)p Y X (dy x)p Z Y (dz y). (7) This approximation is represented graphically in Figure 2(b). Although the system in Example 1 only had three processes, with a large number m of processes, the directed information graph could be quite complex, difficult to compute and analyze visually. As we will show, it is possible to construct efficient optimal tree-like approximations to the directed information graph, and these approximations do not suffer greatly in decision-making performance nor in visualization of relevant features. 2 In [2], minimal generative model graphs are defined by Definition 4.1. Under mild technical assumptions they are equivalent to directed information graphs; for clarity we refer to them together as directed information graphs. (a) Best parent approximation. (b) Best tree approximation. Fig. 3. Examples of directed information graph approximations. The best parent approximation is better in terms of KL divergence. However, the best tree approximation is connected and has a clearly distinguished root with paths from the root to all other nodes. Thus, it is more useful for applications such as targeted advertising. V. MAIN RESULT: BEST PARENT AND CAUSAL DEPENDENCE TREE APPROXIMATIONS We now describe two approaches to approximate joint distributions of networks of stochastic processes, with corresponding low complexity directed information graphs. In both cases, at most a single parent will be kept. The first case is an unconstrained optimization. The second constrains the approximating structure to be a causal dependence tree (this was presented in part at [27]). Minimizing the KL divergence between the full and approximating joint distributions in both cases will result in a sum of pairwise directed informations. We first examine the problem of finding the best approximation where each process has at most one parent. See Figure 3(a). Consider the joint distribution P X of m random processes {X 1, X 2,, X m }, each of length n. We will consider approximations of the form P X (dx) m P Xi X a(i) (dx i x a(i) ). (8) where a(i) {1,..., m}\{i} selects the parent. Let G 1 denote the set of all such approximations. We want to find the P X G 1 that minimizes the KL divergence. Theorem 1: arg min D(P X P X ) = arg max a(i) {1,...,m}\{i} Proof: First define the product distribution P X (dx) I(X a(i) X i ). (9) m P Xi (dx i ), (10) which is equivalent to P X (x) when the processes are statistically independent. Note that P X, P X, P X all lie in P (Ω), and moreover, P X P X P X. Thus, the Radon-Nikodym derivative dp X d P satisfies X the chain rule [28]: dp X d P X = dp X d P X d P X d P X. (11)

6 6 Thus, arg min D(P X P X ) = arg min = arg min = arg max = arg max = arg max E PX E PX E PX [ ] log dp X d P X [ ] [ + E PX log dp X d P X [ ] x log d P X d P X log d P X d P X ] (12) (13) log dp X i X a(i) =x a(i) dp Xi P X (dx) (14) ) D (P Xi Xa(i) =xa(i) P Xi P Xa(i) (dx) (15) x = arg max I(X a(i) X i ) (16) = arg max I(X a(i) X i ), (17) a(i) {1,...,m}\{i} where (12) applies the log to (11) and rearranges; (13) follows from dp X d P not depending on P X ; (14) follows from (8) and X (10); (15) follows from (1); (16) follows from (3); and (17) follows from the choice of each a(i) effecting only a single term in the sum. Thus, finding the optimal structure where each node has at most one parent is equivalent to individually maximizing pairwise directed informations. The process is described in Algorithm 1. Let R denote the set of all pairwise marginal distributions of P X : R = {P Xi,X j : i, j {1,..., m}}. Algorithm 1. Best Parent Input: R 1. For i {1,..., m} 2. a(i) 3. For j {1,..., m}\{i} 4. Compute I(X j X i ) 5. a(i) arg max I(X j X i ) j Algorithm 1 will return the best possible approximation where only pairwise interactions are preserved. It is possible, though, that PX could be disconnected. See Figure 3(a). For some applications, such as picking a single most influential user in a group of friends for targeted advertising, it is useful to have a connected structure with a dominant node. See Figure 3(b). Next consider the case where candidate approximations have causal dependence tree structures. The approximations have the form m P X (dx) P Xπ(i) X l(π(i)) (dx π(i) x l(π(i)) ) (18) where π is a permutation on {1,..., m} and 0 l(i) < i with X 0 denoting a deterministic constant (for the root node s dependence). Let T C denote the set of all possible causal dependence tree approximations. Like before, we want to find the P X T C that minimizes the KL divergence. Theorem 2: arg min D(P X P X ) = arg max I(X l(π(i)) X π(i) ). (19) P X T C P X T C Proof: The proof is similar to the proof of Theorem 1, except that (16) cannot be broken up, as the structural constraint couples choosing π( ) and l( ). Because the maximization became decoupled in Theorem 1, there was a simple algorithm to find the best structure, and that algorithm could be run in a distributed manner. Although that does not happen here, note that the optimal PX T C is maximizing a sum of pairwise directed information values. Each value corresponds to an edge weight for one directed edge in a complete directed graph on the m processes. To find the tree with maximal weight, we can employ a maximum-weight directed spanning tree (MWDST) algorithm. We discuss MWDST algorithms in Section VI-A. Algorithm 2 describes the procedure to find the best approximating distribution with a causal dependence tree structure. Algorithm 2. Causal Dependence Tree Input: R 1. For i {1,..., m} 2. a(i) 3. For j {1,..., m}\{i} 4. Compute I(X j X i ) 5. {a(i)} m MWDST ({I(X j X i )} 1 i j m ) Since T C contains simpler approximations than G 1, Algorithm 1 s approximations are superior to Algorithm 2 s in terms of KL divergence. For some applications, however, having a directed tree can be more useful for analysis and allocation of resources. Remark 3: Chow and Liu [4] solved an analogous problem for a collection of random variables. They developed an algorithm to efficiently find the best tree structured approximation for a Markov network (or, equivalently for that problem, a Bayesian network). They showed that using KL divergence, finding the best tree approximation was equivalent to maximizing a sum of mutual informations. They used a maximum weight spanning tree to solve the optimization. Thus, even though directed information graphs have different properties than Markov or Bayesian networks, and operate on a collection of random processes not variables, the method for finding the best tree is analogous. Next, we consider the consistency of these algorithms in the setting of estimating from data. We discuss estimation in Section VII-A. Theorem 3: Suppose P X T C and the estimates {I(X j X i )} 1 j i m converge almost surely (a.s.). Then for the output P X of Algorithm 2, P X P X a.s. (20)

7 7 Proof: Since P X T C, by Lemma 4.2, P X is the unique tree structure with maximal sum of directed informations along its edges. Algorithm 2 finds the tree with maximal weight, and thus if the edge weights converge almost surely, then the tree estimate does also. Note that an analogous result holds for Algorithm 1 in the case P X G 1. In general, there could be multiple approximation structures in T C or G 1 with the same maximal weight, so P X might not converge, but the approximating structures picked would almost surely be among those of maximal weight. VI. COMPLEXITY In this section, we will discuss the complexity both of the algorithms and storage requirements for the solution. A. Algorithmic complexity Both algorithms first compute the directed information values between each pair. For discrete random processes, computing the directed information, a divergence (3), in general involves summations over exponentially large alphabets. Computing one directed information value for two processes of length n is O( X 2n ). If the distributions are assumed to be jointly Markov of order k, then it becomes linear O(n X 2k ) = O(n) for fixed k. Thus computing the directed information for each ordered pair of processes is O(nm 2 ) work when Markovicity is assumed. For both algorithms, computation of the directed informations can be done independently: the for loops in lines 1 and 4 of both algorithms can be done in a distributed fashion. Note that computing only pairwise relationships is computationally much more tractable than in the full case. To identify the true directed information graph, divergence calculations using the whole network of processes are used [2], requiring O( X mn ) time without Markovicity, and O(n X mk ) with Markovicity. Furthermore, the computation can reduced by calculating mutual informations initially for line 4 in both algorithms. Equation (4) holding means P X,Y = P X Y P Y X which implies I(X; Y) = I(X Y) + I(Y X) [14]. Since mutual and directed informations are non-negative, the mutual information bounds each directed information. Either directed information can later be computed to resolve both. After computing the pairwise directed informations, Algorithm 1 then picks the best parent for each process, which takes O(m 2 ) total, so the total runtime is O(nm 2 ) assuming Markovicity. Algorithm 2 additionally computes a maximum weight spanning tree. Chu and Liu [29], Edmonds [30], and Bock [31] independently developed an efficient MWDST algorithm, which runs in O(m 2 ) time. Thus, like Algorithm 1, Algorithm 2 also runs in O(nm 2 ). Note that Humblet [32] proposed a distributed MWDST algorithm, which constructs the maximum weight tree for each node as root in O(m 2 ) time. In some applications, it is useful to be able to choose from multiple potential roots. B. Storage complexity In the full joint distribution, there are mn variables. Each possible outcome might have unique probability. Thus, for discrete variables with alphabet X, the total storage for the joint distribution is O( X mn ). Both approximations we consider reduce the full joint distribution to m pairwise distributions. Thus, the storage is O(m X 2n ). Further, if the approximations have Markovicity of order k, the total storage becomes O(mn X 2k ) = O(mn) for constant k. VII. APPLICATIONS TO SIMULATED AND EXPERIMENTAL DATA In this section, we demonstrate the efficacy of the approximations in a classification experiment with simulated timeseries. We then show the approximations capture important structural characteristics of a network of brain cells from a neuroscience experiment. First we discuss parametric estimation of directed information from data. A. Parametric Estimation While a thorough discussion of estimation techniques is outside the scope of this work, for completeness we briefly describe the consistent parametric estimation technique for directed information proposed in [24] and [33] and applied to study brain cell networks. After we discuss estimation for the specific setting of autoregressive time-series. 1) Point-Process Parametric Models: Let Y and X denote two binary time series of brain cell activity. Y t = 1 if cell Y was active at time t, otherwise 0. Truccolo et al. [34] proposed modeling how Y depends on its own past and the past of X using a point process framework. The conditional log-likelihood has the form n log f Y X (y x; θ) = log λ θ (t, y t 1, x t 1 )y t λ θ (t, y t 1, x t 1 ), where is the time length between samples and λ θ (t, y t 1, x t 1 ) is the conditional intensity function [34] log λ θ (t, y t 1, x t 1 ) = α 0 + J R α j y t j + β r x t r. j=1 r=1 λ θ (t, y t 1, x t 1 ) can be interpreted as the propensity of Y to be active at time t based on its past and the past of X. The Markov orders J and R are assumed to be unknown. To avoid over-fitting, the minimum description length penalty [35] is used to select the MLE θ: (Ĵ, R, θ) = arg min 1 (J,R,θ) n log f Y X(y x; θ) + (J + R) log n. 2n This penalty balances the Shannon code-length of encoding Y with causal side information X using a MLE θ(j, R) and the code-length required to describe the MLE θ(j, R). The directed information estimates are Î(X Y) 1 n log f Y X(y x; θ) f Y (y; θ, (21) )

8 8 where θ and θ are the MLE parameter vectors for their respective models. Under stationarity, ergodicity, and Markovicity, almost sure convergence of Î(X Y) is shown in [24]. These results extend to general parametric classes. Also note that for the setting of finite alphabets, [36] proposed universal estimation of directed information using context tree weighting. 2) Autoregressive Models: Next consider the specific parametric class of autoregressive time-series. Specifically, a Markov-order one autoregressive model (AR-1) for X is X t = BX t 1 + N t, (22) where B is a coefficient matrix and N t is i.i.d. white Gaussian noise with variance matrix Σ. The noise components are assumed to be independent, so Σ is diagonal. The coefficients (B, Σ) are fixed, so for two processes X = (X, Y) modeled as AR-1, I(X Y) = 1 n I(X t 1 ; Y t Y t 1 ) n = 1 n n I(X n 1 ; Y n Y n 1 ) (23) = 1 [ ] 2 log KYn,Y n 1 K Xn 1,Y n 1, (24) K Yn 1 K Xn 1,Y n,y n 1 where (23) follows from stationarity and Markovicity and (24) follows from (pg. 249 of [13]) with K Yn,Y n 1 denoting the determinant of the covariance matrix of {Y n, Y n 1 }. Note that by the recurrence relation (22), the covariance matrix K Xt,X t can be computed as K Xt,X t = min(t,t ) s=1 (B t s )Σ(B t s ). (25) Thus, estimates of Î(X Y) can be computed by first finding the least squares estimate B of the coefficient matrix in (22), then computing covariance matrices using (25), and then computing (24). B. Classification experiment We tested the utility of the approximation methods using a binary classification experiment. 1) Setup: For the number of processes m {5, 10, 15}, 100 pairs of AR-1 models (B, Σ) and (B, Σ ) were randomly generated. Each element of the m m coefficient matrix B was generated i.i.d. from a N (0, 1) distribution. Σ was an m m diagonal matrix with entries randomly selected from the interval [ 1 4, 1] uniformly. For each AR model (B, Σ), time-series of lengths n {50, 10 2, 10 3, 10 4 } were generated using (22). The coefficients of (B, Σ) were estimated using least squares for each of the time-series. The best parent and best tree approximations were computed using estimated coefficients. The directed informations {I(X Y)} between each pair (X, Y) were estimated using the method in Section VII-A2 with X = (X, Y). To identify the MWDSTs, a Matlab implementation of Edmunds s algorithm [37] was used. Coefficients (B, Σ ) were generated and estimated likewise. Next, classification was performed. For each pair of models (B, Σ) and (B, Σ ), n = 10 6 length time-series were generated from each model using (22). First, the log-likelihoods of each time-step conditioned on the past was computed for the full distributions using estimates of (B, Σ) and (B, Σ ). The frequency of correct classification was calculated. Next, the log-likelihoods using the best parent approximations with estimated coefficients were calculated and then those for the best tree approximations. This was repeated for each set of coefficient estimates, corresponding to n {50, 10 2, 10 3, 10 4 }. 2) Results: The results of these classification experiments are shown in Figure 4. The classification rates are averaged over the 100 trials. Error bars show standard deviation. The best parent approximations only perform slightly better than the best tree approximations. Both performed close to 85% correct classification rate, slightly improving with larger m. Classification using the full distribution noticeably improves with m. This is due to the increased complexity of the distributions; with more processes, there are more relationships to distinguish the distributions. There are m(m 1) edges in the full distribution compared to m in the best parent and m 1 in the best tree approximations. Despite having significantly fewer edges, the approximations capture enough structure to distinguish models. The effect of having a small number of samples to estimate AR coefficients is more dramatic as m increases. For m {5, 10, 15}, coefficients estimated with n = 10 3 and n = 10 4 length time-series performed almost identically. C. Application to Experimental Data We now discuss an application of these methods to analysis of neural activity. A recent study computed the directed information graph for a group of neurons in a monkey s primary motor cortex [24]. Using that graph, they identified a dominant axis of local interactions, which corresponded to the known, primary direction of wave propagation of regional synchronous activity, believed to mediate information transfer [38]. We show that the best parent and best tree approximations preserve that dominant axis. The monkey was performing a sequence of arm-reaching tasks. Its arm was constrained to move along a horizontal surface. Each task involved presentation of a randomly positioned, fixed target on the surface, the monkey moving its hand to meet the target, and a reward (drop of juice) given to the monkey if it was successful. For more details, see [24], [39]. Neural activity in the primary motor cortex was recorded by an implanted silicon micro-electrode array. The recorded waveforms were filtered and processed to produce, for each neuron that was detected, a sequence of times when that neuron became active (e.g. it spiked. ). The 37 neurons with the greatest total activity (number of spikes) were used for analysis. To study the flow of activity between individual neurons, we constructed a directed information graph on the collection of neurons. To simplify computation, pairwise directed informations were estimated using the parametric estimation procedure discussed in Section VII-A.

9 (a) m = 5. (b) m = 10. (c) m = 15. Fig. 4. Classification rate between pairs of autoregressive series. For each m {5, 10, 15}, 100 pairs of autoregressive coefficients were generated randomly.

Error bars depict standard deviation. (a) Graphical structure of non-zero pairwise directed information values. (b) Causal dependence tree approximation. Fig. 5.

The blue arrow in Figure 5(a) depicts a dominant orientation of the edges.

9 9 (a) m = 5. (b) m = 10. (c) m = 15. Fig. 4. Classification rate between pairs of autoregressive series. For each m {5, 10, 15}, 100 pairs of autoregressive coefficients were generated randomly. Classification was performed using the full structures, best parent approximations, and best tree approximations, using coefficients estimated with n {50, 10 2, 10 3, 10 4 } length time-series. Error bars depict standard deviation. (a) Graphical structure of non-zero pairwise directed information values. (b) Causal dependence tree approximation. Fig. 5. Graphical structures of non-zero pairwise directed information values from [24] and causal dependence tree approximation. The best parent approximation was almost identical and is not shown. The blue arrow in Figure 5(a) depicts a dominant orientation of the edges. That orientation is consistent with the direction of propagation of local field potentials, which is believed to mediate information transfer [38]. Both approximations preserve that structure. Figure 5(a) depicts the pairwise directed information graph. The relative positions of the neurons in the graph correspond to the relative positions of the recording electrodes. The blue arrow indicates a dominant orientation of the edges. This orientation along the rostro-caudal axis is consistent with the direction of propagation of local field potentials, which researchers believe mediates information transfer between regions [38]. We applied Algorithms 1 and 2 to this data set. The structure of the dependence tree approximation is shown in Figure 5(b). The best parent approximation is almost identical. The only differences are that the parents of nodes 28 and 13 are 27 and 3 respectively. The original graph had 117 edges with many complicated loops. Both approximations reduced the number of edges by a third, improving the clarity of the graph. Both approximations preserve the dominant edge orientation - pertaining to wave propagation - depicted by the blue arrow in Figure 5(a). This suggests that these approximation methodologies preserve relevant information for decision-making and visualization for analysis of mechanistic biological phenomena. VIII. CONCLUSION In this work, we presented efficient methods to optimally approximate networks of stochastic, dynamically interacting processes with low-complexity approximation methods. Both approximations only required pairwise marginal statistics between the processes, which computationally are significantly

10 10 more tractable than the full joint distribution. Also, the corresponding directed information graphs are much more accessible to analysis and practical usage for many applications. An important line of future work involves investigating methods to approximate with other, more complicated structures. Best-parent approximations and causal dependence tree approximations will always reduce the storage complexity dramatically and facilitate analysis. However, for some applications, it might be desirable to have slightly more complicated structures, such as connected graphs with at most three parents for each node. Such approximations highlight a richer set of interactions and feedback while still being visually and computationally simpler to analyze than the full structure. Although it might not always be possible to efficiently find optimal approximations of such graphical complexity, even near-optimal approximations could prove quite beneficial to real world applications. ACKNOWLEDGMENTS The authors thank Jalal Etesami and Mavis Rodrigues for their assistance with computer simulations. This work was supported in part to C. J. Quinn by the NSF IGERT fellowship under grant number DGE , and the Department of Energy Computational Science Graduate Fellowship under grant number DE-FG02-97ER25308; to N. Kiyavash by AFOSR under grants FA , FA , and FA ; and by NSF grant CCF CAR; and to T. P. Coleman by NSF Science & Technology Center grant CCF and NSF grant CCF REFERENCES [1] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques. The MIT Press, [2] C. Quinn, N. Kiyavash, and T. Coleman, Directed information graphs, Arxiv preprint arxiv: , [3] C. Granger, Investigating causal relations by econometric models and cross-spectral methods, Econometrica, vol. 37, no. 3, pp , [4] C. Chow and C. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. on Information Theory, vol. 14, no. 3, pp , [5] C. Shalizi, M. Camperi, and K. Klinkner, Discovering functional communities in dynamical networks, Statistical Network Analysis: Models, Issues, and New Directions, pp , [6] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp , [7] A. Bolstad, B. Van Veen, and R. Nowak, Causal network inference via group sparse regularization, Signal Processing, IEEE Trans. on, vol. 59, no. 6, pp , [8] A. Puig, A. Wiesel, G. Fleury, and A. Hero, Multidimensional shrinkage-thresholding operator and group lasso penalties, Signal Processing Letters, IEEE, vol. 18, no. 6, pp , June [9] V. Tan and A. Willsky, Sample complexity for topology estimation in networks of LTI systems, in Decision and Control, IEEE Conference on. IEEE, [10] D. Materassi and G. Innocenti, Topological identification in networks of dynamical systems, Automatic Control, IEEE Trans. on, vol. 55, no. 8, pp , [11] D. Materassi and M. Salapaka, On the problem of reconstructing an unknown topology via locality properties of the Wiener filter, Automatic Control, IEEE Trans. on, no. 99, pp. 1 1, [12] J. Etesami, N. Kiyavash, and T. P. Coleman, Learning minimal latent directed information trees, in Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on. IEEE, 2012, pp [13] T. Cover and J. Thomas, Elements of information theory. Wiley- Interscience, [14] H. Marko, The bidirectional communication theory a generalization of information theory, Communications, IEEE Trans. on, vol. 21, no. 12, pp , Dec [15] J. Massey, Causality, feedback and directed information, in Proc Intl. Symp. on Info. Th. and its Applications, 1990, pp [16] G. Kramer, Directed information for channels with feedback, Ph.D. dissertation, Swiss Federal Institute of Technology (ETH), Zrich, Switzerland, [17] S. Tatikonda and S. Mitter, The Capacity of Channels With Feedback, IEEE Trans. on Information Theory, vol. 55, no. 1, pp , [18] H. Permuter, T. Weissman, and A. Goldsmith, Finite State Channels With Time-Invariant Deterministic Feedback, IEEE Trans. on Information Theory, vol. 55, no. 2, pp , [19] C. Li and N. Elia, The information flow and capacity of channels with noisy feedback, arxiv preprint arxiv: , [20] R. Venkataramanan and S. Pradhan, Source coding with feed-forward: rate-distortion theorems and error exponents for a general source, IEEE Trans. on Information Theory, vol. 53, no. 6, pp , [21] N. Martins and M. Dahleh, Feedback control in the presence of noisy channels: bode-like fundamental limitations of performance, Automatic Control, IEEE Trans. on, vol. 53, no. 7, pp , Aug [22] S. K. Gorantla, The interplay between information and control theory within interactive decision-making problems, Ph.D. dissertation, University of Illinois at Urbana-Champaign, [23] H. Permuter, Y. Kim, and T. Weissman, Interpretations of directed information in portfolio theory, data compression, and hypothesis testing, Information Theory, IEEE Trans. on, vol. 57, no. 6, pp , [24] C. Quinn, T. Coleman, N. Kiyavash, and N. Hatsopoulos, Estimating the directed information to infer causal relationships in ensemble neural spike train recordings, Journal of Computational Neuroscience, vol. 30, no. 1, pp , [25] R. M. Gray, Probability, random processes, and ergodic properties. Springer, [26] T. Weissman, Y.-H. Kim, and H. H. Permuter, Directed Information, Causal Estimation, and Communication in Continuous Time, ArXiv e- prints, Sep [27] C. Quinn, T. Coleman, and N. Kiyavash, Approximating discrete probability distributions with causal dependence trees, in Info. Theory and its App. (ISITA), 2010 Intl. Symp. on. IEEE, 2010, pp [28] H. Royden and P. Fitzpatrick, Real analysis, 3rd ed. Macmillan New York, [29] Y. Chu and T. Liu, On the shortest arborescence of a directed graph, Science Sinica, vol. 14, no , p. 270, [30] J. Edmonds, Optimum branchings, J. Res. Natl. Bur. Stand., Sect. B, vol. 71, pp , [31] F. Bock, An algorithm to construct a minimum directed spanning tree in a directed network, Developments in operations research, vol. 1, pp , [32] P. Humblet, A distributed algorithm for minimum weight directed spanning trees, Communications, IEEE Trans. on, vol. 31, no. 6, pp , [33] S. Kim, D. Putrino, S. Ghosh, and E. N. Brown, A Granger causality measure for point process models of ensemble neural spiking activity, PLoS Comput Biol, vol. 7, no. 3, March [34] W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue, and E. N. Brown, A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects, Journal of Neurophysiology, vol. 93, no. 2, pp , [35] P. D. Grünwald, The minimum description length principle. MIT press, [36] J. Jiao, H. Permuter, L. Zhao, Y. Kim, and T. Weissman, Universal estimation of directed information, Arxiv preprint arxiv: , [37] G. Li, Maximum Weight Spanning tree (Undirected), Computer software, June [Online]. Available: maximum-weight-spanning-tree-undirected [38] D. Rubino, K. Robbins, and N. Hatsopoulos, Propagating waves mediate information transfer in the motor cortex, Nature neuroscience, vol. 9, no. 12, pp , [39] W. Wu and N. Hatsopoulos, Evidence against a single coordinate system representation in the motor cortex, Experimental brain research, vol. 175, no. 2, pp , 2006.

Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 3173 Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs Christopher J. Quinn, Student Member,