Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs

Size: px

Start display at page:

Download "Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs"

Sylvia Gregory
5 years ago
Views:

1 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs Christopher J. Quinn, Student Member, IEEE, Negar Kiyavash, Senior Member, IEEE, and Todd P. Coleman, Senior Member, IEEE Abstract Recently, directed information graphs have been proposed as concise graphical representations of the statistical dynamics among multiple random processes. A directed edge from one node to another indicates that the past of one random process statistically affects the future of another, given the past of all other processes. When the number of processes is large, computing those conditional dependence tests becomes difficult. Also, when the number of interactions becomes too large, the graph no longer facilitates visual extraction of relevant information for decision-making. This work considers approximating the true joint distribution on multiple random processes by another, whose directed information graph has at most one parent for any node. Under a Kullback Leibler (KL) divergence minimization criterion, we show that the optimal approximate joint distribution can be obtained by maximizing a sum of directed informations. In particular, each directed information calculation only involves statistics among a pair of processes and can be efficiently estimated and given all pairwise directed informations, an efficient minimum weight spanning directed tree algorithm can be solved to find the best tree. We demonstrate theefficacy of this approach using simulated and experimental data. In both, the approximations preserve the relevant information for decision-making. Index Terms Directed trees, graphical models, network approximations. I. INTRODUCTION MANY important inference problems involve reasoning about dynamic relationships between time series. In such cases, observations of multiple time series are recorded and the objective of the observer is to understand relationships between the past of some processes, and how they affect the Manuscript received July 03, 2012; revised March 31, 2013; accepted April 03, Date of publication April 19, 2013; date of current version May 22, The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ignacio Santamaria. The work of C. J. Quinn was supported in part by the NSF IGERT fellowship under Grant Number DGE and by the Department of Energy Computational Science Graduate Fellowship under Grant Number DE-FG02-97ER The work of N. Kiyavash was supported by AFOSR under Grants FA , FA , and FA ; and by NSF Grant CCF CAR. The work of T. P. Coleman was supported by the NSF Science & Technology Center Grant CCF and NSF Grant CCF The material in this paper was presented in part at the International Symposium on Information Theory and Applications,Taichung, Taiwan, October C. Quinn is with the Department of Electrical and Computer Engineering, University of Illinois at Urbana Champaign, Urbana, IL USA ( quinn7@illinois.edu). N. Kiyavash is with the Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana Champaign, Urbana, IL USA ( kiyavash@illinois.edu). T. P. Coleman is with the Department of Bioengineering, University of California, San Diego, La Jolla, CA USA ( tpcoleman@ucsd.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TSP future of others. In general, with knowledge of joint statistics among multiple random processes, such decision-making could in principle be done. However, if these processes exhibit complex dynamics, gaining knowledge can be prohibitive from computational and storage perspectives. As such, it is appealing to develop an approximation of the joint distribution on multiple random processes which can be calculated efficiently and is less complex for inference. Moreover, simplified representations of joint statistics can facilitate easier visualization and human comprehension of complex relationships. For instance, in situations such as network intrusion detection, decision making in adversarial environments, and first response tasks where a rapid decision is required, such representations can greatly aid the situation awareness and the decision making process. Graphical models have been used to describe both full and approximating joint distributions of random variables [1]. For many graphical models, random variables are represented as nodes and edges between pairs encode conditional dependence relationships. Markov networks and Bayesian networks are two common examples. Markov networks are undirected graphs, while Bayesian networks are directed acyclic graphs. A Bayesian network s graphical structure depends on the variable indexing. This methodology could in principle be applied to describing multiple random processes. For example, if we have time indices and random processes, then we could create a Markov or Bayesian network on random variables. However, if or is large, this could be prohibitive from a complexity and visualization perspective. We have recently developed an alternative graphical model, termed a directed information graph, to parsimoniously describe statistical dynamics among a collection of random processes [2]. Each process is represented by a single node, and directed edges encode conditional independence relationships pertaining to how the past of one process affects the future of another, given the past of all other processes. As such, in this framework, directed edges represent directions of causal influence. 1 They are motivated by simplified generative models of coupled dynamical systems. They admit cycles, and can thus represent feedback between processes. Under appropriate technical conditions, they do not depend on process indexing and moreover are unique [2]. Directed information graphs are particularly attractive when we have processes and a large number of time units: it collapses a graph of nodes to a graph on nodes and a directed 1 Causal in this work refers to Granger causality [3], where a process is said to causally influence a process if the past of helps to predict the future of, already conditioned on the past and all other processes X/$ IEEE

2 3174 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 arrow encodes information about temporal dynamics. In some situations, however, the number of processes we record itself can be very large, and in such a situation each conditional independence test, involving all processes, can be difficult to evaluate. Moreover, even visualization of the directed informationgraphwithupto edges can be cumbersome. As such, the benefits of having a precise picture of the statistical dynamics might be outweighed by the costs in computation, storage, and ease-of-use to a human. An approximation of the joint distribution that preserves a small number of important interactions could alleviate this problem. As an example, consider how a social network company negotiates the costs of advertisements to its users with another company. If the preferences or actions of certain users on average have a large causal influence on the subsequent preferences or actions of friends in their network, then a business might be willing to pay more money to advertise to those users, as compared to the downstream friends with less influence. By paying to advertise to the influential users, the business is effectively advertising to many. For the social network company and the business to agree on pricing, however, it needs to be agreed on which users are the most influential. With a complicated social network, such as Fig. 1(a), a simple procedure to identify who to advertise to, and for how much, might be onerous to develop. However, if the social network company could approximate the user interactions dynamics into a simplified but accurate picture, such as Fig. 1(b), then it would be much easier for the business to see who to target to influence the whole network. This is the motivation of this work. Directed trees, such as Fig. 1(b), are among the simplest structures that could be used for approximation each node has at most one parent. In reducing the computational, storage, and visual complexity substantially, directed trees are much more amenable to analysis than the full structure. They also depict a clear hierarchy between nodes. We here consider the problem of finding the best approximation of a joint distribution on random processes so that each node in the subsequent directed information graph has at most one parent. We will demonstrate the efficacy of this approach from complexity, visualization, and decision-making perspectives. II. OUR CONTRIBUTION AND RELATED WORK A. Our Contribution In this paper, we consider the problem of approximating a joint distribution on random processes by another joint distribution on random processes, where each node in the subsequent directed information graph has at most one parent. We consider two variants, one in which the approximation s directed information graph need not be connected, and the second for which it is (i.e., it must be a directed tree). We use the KL divergence as the metric to find the best approximation, and show that the subsequent optimization problem is equivalent to maximizing a sum of pairwise directed informations. Both cases only require knowledge of statistics between pairs of processes to find the best such approximations. For the connected case, a minimum weight spanning tree algorithm can be solved in time that is quadratic in the number of processes. Fig. 1. Graphical models of the full user influence dynamics of the social network and an approximation of those dynamics. Arrow widths correspond to strengths of influence. Although some structural components are lost, the graph of the approximation makes it clear who to target and the paths of expected influence. (a) The full influence structure of the social network. (b) The graph of an approximation which captures key structural components. Both approximations have similar algorithmic and storage complexity. We demonstrate the utility of this approach in simulated and experimental data, where the relevant information for decision-making is maintained in the tree approximation. B. Related Work Chow and Liu proposed an efficient algorithm to find an optimal tree structured approximation to the joint distribution on a collection of random variables [4]. Since then, many works have been developed to approximate joint distributions, in terms of underlying Markov and Bayesian networks. There have been other works that approximated with more complicated structures; see [1] for an overview. In this work, we use directed information graphs to describe the joint distribution on random processes, in terms of how the past of processes statistically affect the future of others. These were recently introduced in [2], where it was also shown that they are a generalized embodiment of Granger s notion of causality [3] and that under mild assumptions, they are equivalent to minimal generative model graphs. Many methods to estimate joint distributions on random processes from a generative model perspective have recently been developed. Shalizi et al. [5] have developed methods using a stochastic state reconstruction algorithm for discrete valued processes to identify interactions between processes and functional communities. Group Lasso is a method to infer the causal relationships between multivariate auto-regressive models [6]. Bolstad et al. recently showed conditions under which the estimates of Group Lasso are consistent [7]. Puig et al. have developed a multidimensional shrinkage-threshold

3 QUINN et al.: OPTIMAL TREE APPROXIMATIONS OF DIRECTED INFORMATION GRAPHS 3175 operator which arises in problems with Group Lasso type penalties [8]. Tan and Willsky analyzed sample complexity for identifying the topology of a tree structured network of LTI systems [9]. Materassi et al. have developed methods based on Wiener filtering to statistically infer causal influences in linear stochastic dynamical systems; consistency results have been derived for the case when the underlying dynamics have a tree structure [10], [11]. For the setting where the directed information graph has a tree structure and some processes are not observed, Etesami et al. developed a procedure to recover the graph [12]. C. Paper Organization The paper organization is as follows. Section III establishes definitions and notations. In Section IV, we review directed information graphs and discuss their relationship with generative models of stochastic dynamical systems to motivate our approach. In Section V, we present our main results pertaining to finding the optimal approximations of the joint distribution where each node can have at most one parent, both unconstrained and when the structure is constrained to be a directed tree. Here we show that in both cases, the optimization simplifies to maximizing a sum of pairwise directed informations. In Section VI, we analyze the algorithmic and storage complexity of the approximations. In Section VII, we review parametric estimation, evaluate the performance of the approximations in a simulated binary classification experiment, and showcase the utility of this approach in elucidating the wavelike phenomena in the joint neural spiking activity of primary motor cortex. III. DEFINITIONS AND NOTATION This section presents probabilistic notations and informationtheoretic definitions and identities that will be used throughout the remainder of the manuscript. Unless otherwise noted, the definitions and identities come from Cover & Thomas [13]. For a sequence, denote. For any Borel space,denoteitsborelsetsby and the space of probability measures on as. Consider two probability measures and on. is absolutely continuous with respect to (denoted as )if implies that for all. If, denote the Radon Nikodym derivative as the random variable that satisfies The Kullback Leibler divergence between and is defined as if and otherwise. For a sample space,sigma-algebra, and probability measure, denote the probability space as. Throughout this paper, we will consider random processes where the th (with ) random process at time (with ), takes values in a Borel space.denotethe th random variable at (1) time by,the th random process as, and the whole collection of all random processes as. The probability measure thus induces a joint distribution on given by, a joint distribution on given by, and a marginal distribution on given by. With slight abuse of notation, denote for some and for some and denote the conditional distribution and causally conditioned distribution of given as Note the similarity with regular conditioning in (2), except in causal conditioning the future is not conditioned on [14]. The mutual information and directed information [15] between random process and random process are Conceptually, mutual information and directed information are related. However, while mutual information quantifies statistical correlation (in the colloquial sense of statistical interdependence), directed information quantifies statistical causation. Note that,but in general. Remark 1: Note that in (2), there is no conditioning on the present.thisfollowsmarko sdefinition [14] and is consistent with Granger causality [3]. Massey [15] and Kramer [16] later included conditioning on for the specific setting of communication channels. In such settings, since the directions of causation (e.g., that is input and is output) are known, it is convenient to work with synchronized time, for which conditioning on is meaningful. Note, however, that by conditioning on the present in (2), that in a binary symmetric channel (for example) with input,output, and no feedback,, even though does not influence. Directed information has been shown to play important roles in characterizing the capacity of channels with feedback [17] [19], quantifying achievable rates for source coding with feedforward [20], for feedback control over noisy channels [21], [22], and gambling, hypothesis testing, and portfolio theory [23]. See [24] for examples and further discussion. Remark 2: This work is in the setting of discrete time, such as sampled continuous-time processes. Under appropriate technical assumptions, directed information can be directly extended to continuous time on the interval. Define (2) (3)

4 3176 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 to be the sigma-algebra generated by the past of all processes and to be the sigma-algebra generated by the past of all processes excluding. If we assume that all processes are well-behaved (i.e., on Polish spaces), then we have that regular versions of and, exist almost-surely [25]. As such, we can denote the regular conditional probabilities by and respectively. Then the directed information in continuous time is given in complete analogy with discrete time by Connections between directed information in continuous time, causal continuous-time estimation, and communication in continuous time have also recently been proposed [26]. A treatment of the continuous-time setting is outside the scope of this work. IV. BACKGROUND AND MOTIVATING EXAMPLE: APPROXIMATING THE STRUCTURE OF DYNAMICAL SYSTEMS In this section, we describe the problem of identifying the structure of a stochastic, dynamical system, and then approximating it with another stochastic dynamical system. We will review the definitions and basic properties of directed information graphs. We first consider an example of a deterministic dynamical system described in state space format in terms of coupled differential equations. Example 1: Consider a system with three deterministic processes,, which evolves according to: Fig. 2. Directed information graph and a causal dependence tree approximation for the dynamical system in Example 1. (a) Full causal dependence structure, the directed information graph. (b) Causal dependence tree approximation (7). Fig. 2(b) depicts the corresponding directed tree structure. We refertosuchstructuresascausal dependence trees. A similar procedure can be used for networks of stochastic processes, where the system is described in a time-evolving manner through conditional probabilities. Consider three processes, formed by including i.i.d. noises to the above dynamical system and relabeling the time indices: The system can alternatively be described through the joint distribution (up to time )as Given the full past of the whole network,,thefuture of each process (at time ) can be constructed. In many cases, some processes do not depend on the past of every other process, but only some subset of other processes. Suppose we can simplify the above equations by removing all of the dependencies of how one process evolves given others: Because of the causal structure of the dynamical system and the statistical independence of the noises, given the full past, the present values are conditionally independent: (4) This structure can be depicted graphically (see Fig. 2(a)). We can further approximate this dynamical system by approximating the functions whose generative models have fewer inputs. One approximation for the system is: More generally, we will make the analogous assumption about the chain rule and how each process at time is conditionally independent of one another, given the full past of all processes. Assumption 1: Equation (4) holds and for all and some measure. A large class of stochastic systems satisfy Assumption 1. For example, coupled stochastic processes described by an Ito stochastic differential equation with independent Brownian noise satisfy the continuous-time equivalent of this assumption [2].

5 QUINN et al.: OPTIMAL TREE APPROXIMATIONS OF DIRECTED INFORMATION GRAPHS 3177 Granger argued that this is a valid assumption for real world systems, provided the sampling rate is high [3]. We can rewrite (4) using causal conditioning notation (2): Thus, our causal dependence tree approximation to these stochastic processes, denoted by,is: (7) As in the deterministic case, often the evolution of one process does not depend on every other process, but only some subset. We can remove the unnecessary dependencies to obtain The dependence structure of this stochastic system is represented by Fig. 2(a). We next generalize this procedure. For each process,let denote a potential subset of parent processes. Define the corresponding induced probability measure : To find a minimal graph, for each process, we would like to find the smallest set of parents that fully describes the dynamics of as well as the whole network does: In Example 1, the s would correspond to, and for,and, respectively. The parent sets can be independently minimized so that (6) holds. With these minimal parent sets, we can define the graphical model we will use throughout this discussion. 2. Definition 4.1: A directed information graph is a directed graph, where each process is represented by a node, and there is a directed edge from to for iff,where the cardinalities are minimal such that (6) holds. Lemma 4.2 ([2]): Under Assumption 1, directed information graphs are unique. Furthermore, for a given process,adirected edge is placed from to ifandonlyif (5) (6) This approximation is represented graphically in Fig. 2(b). Although the system in Example 1 only had three processes, with a large number of processes, the directed information graph could be quite complex, difficult to compute and analyze visually. As we will show, it is possible to construct efficient optimal tree-like approximations to the directed information graph, and these approximations do not suffer greatly in decision-making performance nor in visualization of relevant features. V. MAIN RESULT: BEST PARENT AND CAUSAL DEPENDENCE TREE APPROXIMATIONS We now describe two approaches to approximate joint distributions of networks of stochastic processes, with corresponding low complexity directed information graphs. In both cases, at mostasingleparentwillbekept.thefirst case is an unconstrained optimization. The second constrains the approximating structure to be a causal dependence tree (this was presented in part at [27]). Minimizing the KL divergence between the full and approximating joint distributions in both cases will result in a sum of pairwise directed informations. We first examine the problem of finding the best approximation where each process has at most one parent. See Fig. 3(a). Consider the joint distribution of random processes, each of length. We will consider approximations of the form where selects the parent. Let denote the set of all such approximations. We want to find the that minimizes the KL divergence. Theorem 1: (8) Directed information graphs can have cycles, representing feedback between processes, and can even be complete. For some systems, there might be a large number of influences between processes, with varying magnitudes. For analysis and even storage purposes, it can be helpful to have succinct approximations. For the stochastic system of Example 1, we can apply a similar approximation to this system as was done in the discrete case with: 2 In [2], minimal generative model graphs are definedbydefinition 4.1. Under mild technical assumptions they are equivalent to directed information graphs; for clarity we refer to them together as directed information graphs. Proof: First define the product distribution (9) (10) which is equivalent to when the processes are statistically independent. Note that all lie in,andmoreover,. Thus, the Radon Nikodym derivative satisfies the chain rule [28]: (11)

6 3178 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 from the choice of each effecting only a single term in the sum. Thus, finding the optimal structure where each node has at most one parent is equivalent to individually maximizing pairwise directed informations. The process is described in Algorithm 1. Let denote the set of all pairwise marginal distributions of : Algorithm 1. Best Parent Input: 1. For For 4. Compute 5. Fig. 3. Examples of directed information graph approximations. The best parent approximation is better in terms of KL divergence. However, the best tree approximation is connected and has a clearly distinguished root with paths from the root to all other nodes. Thus, it is more useful for applications such as targeted advertising. (a) Best parent approximation. (b) Best tree approximation. Thus, Algorithm 1 will return the best possible approximation where only pairwise interactions are preserved. It is possible, though, that could be disconnected. See Fig. 3(a). For some applications, such as picking a single most influential user in a group of friends for targeted advertising, it is useful to have a connected structure with a dominant node. See Fig. 3(b). Next consider the case where candidate approximations have causal dependence tree structures. The approximations have the form (18) (12) where is a permutation on and with denoting a deterministic constant (for the root node s dependence). Let denote the set of all possible causal dependence tree approximations. Like before, we want to find the that minimizes the KL divergence. Theorem 2: (13) (14) (15) (16) (17) where (12) applies the to (11) and rearranges; (13) follows from not depending on ; (14) follows from (8) and (10); (15) follows from (1); (16) follows from (3); and (17) follows (19) Proof: The proof is similar to the proof of Theorem 1, except that (16) cannot be broken up, as the structural constraint couples choosing and.. Because the maximization became decoupled in Theorem 1, there was a simple algorithm to find the best structure, and that algorithm could be run in a distributed manner. Although that does not happen here, note that the optimal is maximizing a sum of pairwise directed information values. Each value corresponds to an edge weight for one directed edge in a complete directed graph on the processes. To find the tree with maximal weight, we can employ a maximum-weight directed spanning tree (MWDST) algorithm. We discuss MWDST algorithms in Section VI-A. Algorithm 2 describes the procedure to find the best approximating distribution with a causal dependence tree structure.

7 QUINN et al.: OPTIMAL TREE APPROXIMATIONS OF DIRECTED INFORMATION GRAPHS 3179 Algorithm 2. Causal Dependence Tree Input: 1. For For 4. Compute 5. Since contains simpler approximations than, Algorithm 1 s approximations are superior to Algorithm 2 s in terms of KL divergence. For some applications, however, having a directed tree can be more useful for analysis and allocation of resources. Remark 3: Chow and Liu [4] solved an analogous problem for a collection of random variables. They developed an algorithm to efficiently find the best tree structured approximation for a Markov network (or, equivalently for that problem, a Bayesian network). They showed that using KL divergence, finding the best tree approximation was equivalent to maximizing a sum of mutual informations. They used a maximum weight spanning tree to solve the optimization. Thus, even though directed information graphs have different properties than Markov or Bayesian networks, and operate on a collection of random processes not variables, the method for finding the best tree is analogous. Next, we consider the consistency of these algorithms in the setting of estimating from data. We discuss estimation in Section VII-A. Theorem 3: Suppose and the estimates converge almost surely (a.s.). Then for the output of Algorithm 2, (20) Proof: Since, by Lemma 4.2, is the unique tree structure with maximal sum of directed informations along its edges. Algorithm 2 finds the tree with maximal weight, and thus if the edge weights converge almost surely, then the tree estimate does also. Note that an analogous result holds for Algorithm 1 in the case. In general, there could be multiple approximation structures in or with the same maximal weight, so might not converge, but the approximating structures picked would almost surely be among those of maximal weight. VI. COMPLEXITY In this section, we will discuss the complexity both of the algorithms and storage requirements for the solution. A. Algorithmic Complexity Both algorithms first compute the directed information values between each pair. For discrete random processes, computing the directed information, a divergence (3), in general involves summations over exponentially large alphabets. Computing one directed information value for two processes of length is. If the distributions are assumed to be jointly Markov of order, then it becomes linear for fixed. Thus computing the directed information for each ordered pair of processes is work when Markovicity is assumed. For both algorithms, computation of the directed informations can be done independently: the for loops in lines 1 and 4 of both algorithms can be done in a distributed fashion. Note that computing only pairwise relationships is computationally much more tractable than in the full case. To identify the true directed information graph, divergence calculations using the whole network of processes are used [2], requiring time without Markovicity, and with Markovicity. Furthermore, the computation can reduced by calculating mutual informations initially for line 4 in both algorithms. Equation (4) holding means which implies [14]. Since mutual and directed informations are nonnegative, the mutual information bounds each directed information. Either directed information can later be computed to resolve both. After computing the pairwise directed informations, Algorithm 1 then picks the best parent for each process, which takes total, so the total runtime is assuming Markovicity. Algorithm 2 additionally computes a maximum weight spanning tree. Chu and Liu [29], Edmonds [30], and Bock [31] independently developed an efficient MWDST algorithm, which runs in time. Thus, like Algorithm 1, Algorithm 2 also runs in. Note that Humblet [32] proposed a distributed MWDST algorithm, which constructs the maximum weight tree for each node as root in time. In some applications, it is useful to be able to choose from multiple potential roots. B. Storage Complexity In the full joint distribution, there are variables. Each possible outcome might have unique probability. Thus, for discrete variables with alphabet, the total storage for the joint distribution is. Both approximations we consider reduce the full joint distribution to pairwise distributions. Thus, the storage is. Further, if the approximations have Markovicity of order, the total storage becomes for constant. VII. APPLICATIONS TO SIMULATED AND EXPERIMENTAL DATA In this section, we demonstrate the efficacy of the approximations in a classification experiment with simulated time-series. We then show the approximations capture important structural characteristics of a network of brain cells from a neuroscience experiment. First we discuss parametric estimation of directed information from data. A. Parametric Estimation While a thorough discussion of estimation techniques is outside the scope of this work, for completeness we briefly describe the consistent parametric estimation technique for directed information proposed in [24] and [33] and applied to study brain cell networks. After we discuss estimation for the specific setting of autoregressive time-series. 1) Point-Process Parametric Models: Let and denote two binary time series of brain cell activity. if cell was active at time, otherwise 0. Truccolo et al. [34] proposed modeling how depends on its own past and the past of

8 3180 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 using a point process framework. The conditional log-likelihood has the form where is the time length between samples and is the conditional intensity function [34] can be interpreted as the propensity of to be active at time based on its past and the past of. The Markov orders and are assumed to be unknown. To avoid overfitting, the minimum description length penalty [35] is used to select the MLE : This penalty balances the Shannon code-length of encoding with causal side information using a MLE and the code-length required to describe the MLE. The directed information estimates are (21) where and are the MLE parameter vectors for their respective models. Under stationarity, ergodicity, and Markovicity, almost sure convergence of isshownin[24].these results extend to general parametric classes. Also note that for the setting of finite alphabets, [36] proposed universal estimation of directed information using context tree weighting. 2) Autoregressive Models: Next consider the specific parametric class of autoregressive time-series. Specifically, a Markov-order one autoregressive model (AR-1) for is (22) where is a coefficient matrix and is i.i.d. white Gaussian noise with variance matrix. The noise components are assumed to be independent, so is diagonal. The coefficients are fixed, so for two processes modeled as AR-1, (23) (24) where (23) follows from stationarity and Markovicity and (24) follows from (pg. 249 of [13]) with denoting the determinant of the covariance matrix of.notethat by the recurrence relation (22), the covariance matrix can be computed as (25) Thus, estimates of can be computed by first finding the least squares estimate of the coefficient matrix in (22), then computing covariance matrices using (25), and then computing (24). B. Classification Experiment We tested the utility of the approximation methods using a binary classification experiment. 1) Setup: For the number of processes pairs of AR-1 models and were randomly generated. Each element of the coefficient matrix was generated i.i.d. from a distribution. was an diagonal matrix with entries randomly selected from the interval uniformly. For each AR model, time-series of lengths were generated using (22). The coefficients of were estimated using least squares for each of the time-series. The best parent and best tree approximations were computed using estimated coefficients. The directed informations between each pair were estimated using the method in Section VII-A2 with. To identify the MWDSTs, a Matlab implementation of Edmunds s algorithm [37] was used. Coefficients were generated and estimated likewise. Next, classification was performed. For each pair of models and length time-series were generated from each model using (22). First, the log-likelihoods of each time-step conditioned on the past was computed for the full distributions using estimates of and. The frequency of correct classification was calculated. Next, the log-likelihoods using the best parent approximations with estimated coefficients were calculated and then those for the best tree approximations. This was repeated for each set of coefficient estimates, corresponding to. 2) Results: The results of these classification experiments are shown in Fig. 4. The classification rates are averaged over the 100 trials. Error bars show standard deviation. The best parent approximations only perform slightly better than the best tree approximations. Both performed close to 85% correct classification rate, slightly improving with larger. Classification using the full distribution noticeably improves with. This is due to the increased complexity of the distributions; with more processes, there are more relationships to distinguish the distributions. There are edges in the full distribution compared to in the best parent and in the best tree approximations. Despite having significantly fewer edges, the approximations capture enough structure to distinguish models. The effect of having a small number of samples to estimate AR coefficients is more dramatic as increases. For,coefficients estimated with and length time-series performed almost identically. C. Application to Experimental Data We now discuss an application of these methods to analysis of neural activity. A recent study computed the directed information graph for a group of neurons in a monkey s primary motor cortex [24]. Using that graph, they identified a dominant axis of local interactions, which corresponded to the known, primary direction of wave propagation of regional synchronous activity, believed to mediate information transfer [38]. We show that the

QUINN et al.: OPTIMAL TREE APPROXIMATIONS OF DIRECTED INFORMATION GRAPHS 3181 Fig. 4. Classification rate between pairs of autoregressive series.

9 QUINN et al.: OPTIMAL TREE APPROXIMATIONS OF DIRECTED INFORMATION GRAPHS 3181 Fig. 4. Classification rate between pairs of autoregressive series. For each pairs of autoregressive coefficients were generated randomly. Classification was performed using the full structures, best parent approximations, and best tree approximations, using coefficients estimated with length time-series. Error bars depict standard deviation. (a).(b).(c). best parent and best tree approximations preserve that dominant axis. The monkey was performing a sequence of arm-reaching tasks. Its arm was constrained to move along a horizontal surface. Each task involved presentation of a randomly positioned, fixed target on the surface, the monkey moving its hand to meet the target, and a reward (drop of juice) given to the monkey if it was successful. For more details, see [24], [39]. Neural activity in the primary motor cortex was recorded by an implanted silicon micro-electrode array. The recorded waveforms were filtered and processed to produce, for each neuron that was detected, a sequence of times when that neuron became active (e.g., it spiked ). The 37 neurons with the greatest total activity (number of spikes) were used for analysis. To study the flow of activity between individual neurons, we constructed a directed information graph on the collection of neurons. To simplify computation, pairwise directed informations were estimated using the parametric estimation procedure discussed in Section VII-A. Fig. 5(a) depicts the pairwise directed information graph. The relative positions of the neurons in the graph correspond to the relative positions of the recording electrodes. The blue arrow indicates a dominant orientation of the edges. This orientation along the rostro-caudal axis is consistent with the direction of propagation of local field potentials, which researchers believe mediates information transfer between regions [38]. WeappliedAlgorithms1and2tothisdataset.Thestructure of the dependence tree approximation is shown in Fig. 5(b). The best parent approximation is almost identical. The only differences are that the parents of nodes 28 and 13 are 27 and 3 respectively. The original graph had 117 edges with many complicated loops. Both approximations reduced the number of edges by a Fig. 5. Graphical structures of nonzero pairwise directed information values from [24] and causal dependence tree approximation. The best parent approximation was almost identical and is not shown. The blue arrow in Fig. 5(a) depicts a dominant orientation of the edges. That orientation is consistent with the direction of propagation of local field potentials, which is believed to mediate information transfer [38]. Both approximations preserve that structure. (a) Graphical structure of nonzero pairwise directed information values. (b) Causal dependence tree approximation. third, improving the clarity of the graph. Both approximations preserve the dominant edge orientation pertaining to wave propagation depicted by the blue arrow in Fig. 5(a). This suggests that these approximation methodologies preserve relevant information for decision-making and visualization for analysis of mechanistic biological phenomena.

10 3182 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 VIII. CONCLUSION In this work, we presented efficient methods to optimally approximate networks of stochastic, dynamically interacting processes with low-complexity approximation methods. Both approximations only required pairwise marginal statistics between the processes, which computationally are significantly more tractable than the full joint distribution. Also, the corresponding directed information graphs are much more accessible to analysis and practical usage for many applications. An important line of future work involves investigating methods to approximate with other, more complicated structures. Best-parent approximations and causal dependence tree approximations will always reduce the storage complexity dramatically and facilitate analysis. However, for some applications, it might be desirable to have slightly more complicated structures, such as connected graphs with at most three parents for each node. Such approximations highlight a richer set of interactions and feedback while still being visually and computationally simpler to analyze than the full structure. Although it might not always be possible to efficiently find optimal approximations of such graphical complexity, even near-optimal approximations could prove quite beneficial to real world applications. ACKNOWLEDGMENT The authors would like to thank J. Etesami and M. Rodrigues for their assistance with computer simulations. REFERENCES [1] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA, USA: MIT Press, [2] C. Quinn, N. Kiyavash, and T. Coleman, Directed information graphs, 2012, Arxiv preprint arxiv: [3] C. Granger, Investigating causal relations by econometric models and cross-spectral methods, Econometrica, vol. 37, no. 3, pp , [4] C. Chow and C. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. Inf. Theory, vol. IT 14, no. 3, pp , [5] C. Shalizi, M. Camperi, and K. Klinkner, Discovering functional communities in dynamical networks, in Statistical Network Analysis: Models, Issues, and New Directions, ser. Lecture Notes in Computer Science, E. Airoldi, D. Blei, S. Fienberg, A. Goldenberg, E. Xing, and A. Zheng, Eds. Berlin, Germany: Springer, 2007, vol. 4503, pp [6] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, J. Royal Statist. Soc.: Ser. B (Statist. Method.), vol. 68, no. 1, pp , [7] A. Bolstad, B. Van Veen, and R. Nowak, Causal network inference via group sparse regularization, IEEE Trans. Signal Process., vol. 59, no. 6, pp , [8] A. Puig, A. Wiesel, G. Fleury, and A. Hero, Multidimensional shrinkage-thresholding operator and group lasso penalties, IEEE Signal Process. Lett., vol. 18, no. 6, pp , Jun [9] V. Tan and A. Willsky, Sample complexity for topology estimation in networks of LTI systems, in Proc. 50th IEEE Conf. Decision Control (CDC-ECC), 2011, pp [10] D. Materassi and G. Innocenti, Topological identificationinnetworks of dynamical systems, IEEE Trans. Autom. Control,vol.55,no.8,pp , [11] D. Materassi and M. Salapaka, On the problem of reconstructing an unknown topology via locality properties of the Wiener filter, IEEE Trans. Autom. Control, vol. 57, no. 7, pp , [12] J. Etesami, N. Kiyavash, and T. P. Coleman, Learning minimal latent directed information trees, in Proc. IEEE Int. Symp. Inf. Theory Proc. (ISIT), 2012, pp [13] T. Cover and J. Thomas, Elements of Information Theory. NewYork, NY, USA: Wiley-Interscience, [14] H. Marko, The bidirectional communication theory A generalization of information theory, IEEE Trans. Commun., vol. C 21, no. 12, pp , Dec [15] J. Massey, Causality, feedback and directed information, in Proc Int. Symp. Inf. Theory and Its Appl., 1990, pp [16] G. Kramer, Directed information for channels with feedback, Ph.D. dissertation, Electr. and Comput. Eng. Dept., Swiss Fed. Inst. Technol. (ETH), Zürich, Switzerland, [17] S. Tatikonda and S. Mitter, The capacity of channels with feedback, IEEE Trans. Inf. Theory, vol. 55, no. 1, pp , [18] H. Permuter, T. Weissman, and A. Goldsmith, Finite state channels with time-invariant deterministic feedback, IEEE Trans. Inf. Theory, vol. 55, no. 2, pp , [19] C. Li and N. Elia, The information flow and capacity of channels with noisy feedback, 2011, arxiv preprint arxiv: [20] R. Venkataramanan and S. Pradhan, Source coding with feed-forward: Rate-distortion theorems and error exponents for a general source, IEEE Trans. Inf. Theory, vol. 53, no. 6, pp , [21] N. Martins and M. Dahleh, Feedback control in the presence of noisy channels: Bode-like fundamental limitations of performance, IEEE Trans. Autom. Control, vol. 53, no. 7, pp , Aug [22] S. K. Gorantla, The interplay between information and control theory within interactive decision-making problems, Ph.D. dissertation, Electr. and Comput. Eng. Dept., Univ. Illinois at Urbana-Champaign, Champaign, IL, USA, [23] H. Permuter, Y. Kim, and T. Weissman, Interpretations of directed information in portfolio theory, data compression, and hypothesis testing, IEEE Trans. Inf. Theory, vol. 57, no. 6, pp , [24] C. Quinn, T. Coleman, N. Kiyavash, and N. Hatsopoulos, Estimating the directed information to infer causal relationships in ensemble neural spike train recordings, J. Computat. Neurosci., vol. 30, no. 1, pp , [25] R. M. Gray, Probability, Random Processes, and Ergodic Properties. New York, NY, USA: Springer, [26] T. Weissman, Y.-H. Kim, and H. Permuter, Directed information, causal estimation, and communication in continuous time, IEEE Trans. Inf. Theory, vol. 59, no. 3, pp , [27] C. Quinn, T. Coleman, and N. Kiyavash, Approximating discrete probability distributions with causal dependence trees, in Proc. IEEE Int. Symp. Inf. Theory Appl. (ISITA), 2010, pp [28] H. Royden and P. Fitzpatrick, RealAnalysis, 3rd ed. New York, NY, USA: Macmillan, [29] Y. Chu and T. Liu, On the shortest arborescence of a directed graph, Sci. Sinica, vol. 14, no , p. 270, [30] J. Edmonds, Optimum branchings, J. Res. Natl. Bur. Stand., Sect. B, vol. 71, pp , [31] F. Bock, An algorithm to construct a minimum directed spanning tree in a directed network, Develop. Operat. Res., vol. 1, pp , [32] P. Humblet, A distributed algorithm for minimum weight directed spanning trees, IEEE Trans. Commun., vol. C 31, no. 6, pp , [33] S. Kim, D. Putrino, S. Ghosh, and E. N. Brown, A Granger causality measure for point process models of ensemble neural spiking activity, PLoS Comput. Biol., vol. 7, no. 3, p. e , Mar [34] W.Truccolo,U.T.Eden,M.R.Fellows,J.P.Donoghue,andE.N. Brown, A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects, J. Neurophysiol., vol. 93, no. 2, pp , [35] P. D. Grünwald, The Minimum Description Length Principle. Cambridge, MA, USA: MIT Press, [36] J.Jiao,H.Permuter,L.Zhao,Y.Kim,andT.Weissman, Universal estimation of directed information, 2012, Arxiv preprint arxiv: [37] G. Li, Jun [Online]. Available: Maximum Weight Spanning tree (Undirected) Computer software [38] D. Rubino, K. Robbins, and N. Hatsopoulos, Propagating waves mediate information transfer in the motor cortex, Nature Neurosci., vol. 9, no. 12, pp , [39] W. Wu and N. Hatsopoulos, Evidence against a single coordinate system representation in the motor cortex, Experiment. Brain Res., vol. 175, no. 2, pp , Christopher J. Quinn (S 11), photograph and biography not available at the time of publication. Negar Kiyavash (SM 13), photograph and biography not available at the time of publication. Todd P. Coleman (SM 12), photograph and biography not available at the time of publication.

Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs

1 Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs Christopher J. Quinn*, Student Member, IEEE, Negar Kiyavash, Senior Member, IEEE, and Todd P. Coleman, Senior Member,