Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs

Size: px
Start display at page:

Download "Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs"

Transcription

1 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs Christopher J. Quinn, Student Member, IEEE, Negar Kiyavash, Senior Member, IEEE, and Todd P. Coleman, Senior Member, IEEE Abstract Recently, directed information graphs have been proposed as concise graphical representations of the statistical dynamics among multiple random processes. A directed edge from one node to another indicates that the past of one random process statistically affects the future of another, given the past of all other processes. When the number of processes is large, computing those conditional dependence tests becomes difficult. Also, when the number of interactions becomes too large, the graph no longer facilitates visual extraction of relevant information for decision-making. This work considers approximating the true joint distribution on multiple random processes by another, whose directed information graph has at most one parent for any node. Under a Kullback Leibler (KL) divergence minimization criterion, we show that the optimal approximate joint distribution can be obtained by maximizing a sum of directed informations. In particular, each directed information calculation only involves statistics among a pair of processes and can be efficiently estimated and given all pairwise directed informations, an efficient minimum weight spanning directed tree algorithm can be solved to find the best tree. We demonstrate theefficacy of this approach using simulated and experimental data. In both, the approximations preserve the relevant information for decision-making. Index Terms Directed trees, graphical models, network approximations. I. INTRODUCTION MANY important inference problems involve reasoning about dynamic relationships between time series. In such cases, observations of multiple time series are recorded and the objective of the observer is to understand relationships between the past of some processes, and how they affect the Manuscript received July 03, 2012; revised March 31, 2013; accepted April 03, Date of publication April 19, 2013; date of current version May 22, The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ignacio Santamaria. The work of C. J. Quinn was supported in part by the NSF IGERT fellowship under Grant Number DGE and by the Department of Energy Computational Science Graduate Fellowship under Grant Number DE-FG02-97ER The work of N. Kiyavash was supported by AFOSR under Grants FA , FA , and FA ; and by NSF Grant CCF CAR. The work of T. P. Coleman was supported by the NSF Science & Technology Center Grant CCF and NSF Grant CCF The material in this paper was presented in part at the International Symposium on Information Theory and Applications,Taichung, Taiwan, October C. Quinn is with the Department of Electrical and Computer Engineering, University of Illinois at Urbana Champaign, Urbana, IL USA ( quinn7@illinois.edu). N. Kiyavash is with the Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana Champaign, Urbana, IL USA ( kiyavash@illinois.edu). T. P. Coleman is with the Department of Bioengineering, University of California, San Diego, La Jolla, CA USA ( tpcoleman@ucsd.edu). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TSP future of others. In general, with knowledge of joint statistics among multiple random processes, such decision-making could in principle be done. However, if these processes exhibit complex dynamics, gaining knowledge can be prohibitive from computational and storage perspectives. As such, it is appealing to develop an approximation of the joint distribution on multiple random processes which can be calculated efficiently and is less complex for inference. Moreover, simplified representations of joint statistics can facilitate easier visualization and human comprehension of complex relationships. For instance, in situations such as network intrusion detection, decision making in adversarial environments, and first response tasks where a rapid decision is required, such representations can greatly aid the situation awareness and the decision making process. Graphical models have been used to describe both full and approximating joint distributions of random variables [1]. For many graphical models, random variables are represented as nodes and edges between pairs encode conditional dependence relationships. Markov networks and Bayesian networks are two common examples. Markov networks are undirected graphs, while Bayesian networks are directed acyclic graphs. A Bayesian network s graphical structure depends on the variable indexing. This methodology could in principle be applied to describing multiple random processes. For example, if we have time indices and random processes, then we could create a Markov or Bayesian network on random variables. However, if or is large, this could be prohibitive from a complexity and visualization perspective. We have recently developed an alternative graphical model, termed a directed information graph, to parsimoniously describe statistical dynamics among a collection of random processes [2]. Each process is represented by a single node, and directed edges encode conditional independence relationships pertaining to how the past of one process affects the future of another, given the past of all other processes. As such, in this framework, directed edges represent directions of causal influence. 1 They are motivated by simplified generative models of coupled dynamical systems. They admit cycles, and can thus represent feedback between processes. Under appropriate technical conditions, they do not depend on process indexing and moreover are unique [2]. Directed information graphs are particularly attractive when we have processes and a large number of time units: it collapses a graph of nodes to a graph on nodes and a directed 1 Causal in this work refers to Granger causality [3], where a process is said to causally influence a process if the past of helps to predict the future of, already conditioned on the past and all other processes X/$ IEEE

2 3174 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 arrow encodes information about temporal dynamics. In some situations, however, the number of processes we record itself can be very large, and in such a situation each conditional independence test, involving all processes, can be difficult to evaluate. Moreover, even visualization of the directed informationgraphwithupto edges can be cumbersome. As such, the benefits of having a precise picture of the statistical dynamics might be outweighed by the costs in computation, storage, and ease-of-use to a human. An approximation of the joint distribution that preserves a small number of important interactions could alleviate this problem. As an example, consider how a social network company negotiates the costs of advertisements to its users with another company. If the preferences or actions of certain users on average have a large causal influence on the subsequent preferences or actions of friends in their network, then a business might be willing to pay more money to advertise to those users, as compared to the downstream friends with less influence. By paying to advertise to the influential users, the business is effectively advertising to many. For the social network company and the business to agree on pricing, however, it needs to be agreed on which users are the most influential. With a complicated social network, such as Fig. 1(a), a simple procedure to identify who to advertise to, and for how much, might be onerous to develop. However, if the social network company could approximate the user interactions dynamics into a simplified but accurate picture, such as Fig. 1(b), then it would be much easier for the business to see who to target to influence the whole network. This is the motivation of this work. Directed trees, such as Fig. 1(b), are among the simplest structures that could be used for approximation each node has at most one parent. In reducing the computational, storage, and visual complexity substantially, directed trees are much more amenable to analysis than the full structure. They also depict a clear hierarchy between nodes. We here consider the problem of finding the best approximation of a joint distribution on random processes so that each node in the subsequent directed information graph has at most one parent. We will demonstrate the efficacy of this approach from complexity, visualization, and decision-making perspectives. II. OUR CONTRIBUTION AND RELATED WORK A. Our Contribution In this paper, we consider the problem of approximating a joint distribution on random processes by another joint distribution on random processes, where each node in the subsequent directed information graph has at most one parent. We consider two variants, one in which the approximation s directed information graph need not be connected, and the second for which it is (i.e., it must be a directed tree). We use the KL divergence as the metric to find the best approximation, and show that the subsequent optimization problem is equivalent to maximizing a sum of pairwise directed informations. Both cases only require knowledge of statistics between pairs of processes to find the best such approximations. For the connected case, a minimum weight spanning tree algorithm can be solved in time that is quadratic in the number of processes. Fig. 1. Graphical models of the full user influence dynamics of the social network and an approximation of those dynamics. Arrow widths correspond to strengths of influence. Although some structural components are lost, the graph of the approximation makes it clear who to target and the paths of expected influence. (a) The full influence structure of the social network. (b) The graph of an approximation which captures key structural components. Both approximations have similar algorithmic and storage complexity. We demonstrate the utility of this approach in simulated and experimental data, where the relevant information for decision-making is maintained in the tree approximation. B. Related Work Chow and Liu proposed an efficient algorithm to find an optimal tree structured approximation to the joint distribution on a collection of random variables [4]. Since then, many works have been developed to approximate joint distributions, in terms of underlying Markov and Bayesian networks. There have been other works that approximated with more complicated structures; see [1] for an overview. In this work, we use directed information graphs to describe the joint distribution on random processes, in terms of how the past of processes statistically affect the future of others. These were recently introduced in [2], where it was also shown that they are a generalized embodiment of Granger s notion of causality [3] and that under mild assumptions, they are equivalent to minimal generative model graphs. Many methods to estimate joint distributions on random processes from a generative model perspective have recently been developed. Shalizi et al. [5] have developed methods using a stochastic state reconstruction algorithm for discrete valued processes to identify interactions between processes and functional communities. Group Lasso is a method to infer the causal relationships between multivariate auto-regressive models [6]. Bolstad et al. recently showed conditions under which the estimates of Group Lasso are consistent [7]. Puig et al. have developed a multidimensional shrinkage-threshold

3 QUINN et al.: OPTIMAL TREE APPROXIMATIONS OF DIRECTED INFORMATION GRAPHS 3175 operator which arises in problems with Group Lasso type penalties [8]. Tan and Willsky analyzed sample complexity for identifying the topology of a tree structured network of LTI systems [9]. Materassi et al. have developed methods based on Wiener filtering to statistically infer causal influences in linear stochastic dynamical systems; consistency results have been derived for the case when the underlying dynamics have a tree structure [10], [11]. For the setting where the directed information graph has a tree structure and some processes are not observed, Etesami et al. developed a procedure to recover the graph [12]. C. Paper Organization The paper organization is as follows. Section III establishes definitions and notations. In Section IV, we review directed information graphs and discuss their relationship with generative models of stochastic dynamical systems to motivate our approach. In Section V, we present our main results pertaining to finding the optimal approximations of the joint distribution where each node can have at most one parent, both unconstrained and when the structure is constrained to be a directed tree. Here we show that in both cases, the optimization simplifies to maximizing a sum of pairwise directed informations. In Section VI, we analyze the algorithmic and storage complexity of the approximations. In Section VII, we review parametric estimation, evaluate the performance of the approximations in a simulated binary classification experiment, and showcase the utility of this approach in elucidating the wavelike phenomena in the joint neural spiking activity of primary motor cortex. III. DEFINITIONS AND NOTATION This section presents probabilistic notations and informationtheoretic definitions and identities that will be used throughout the remainder of the manuscript. Unless otherwise noted, the definitions and identities come from Cover & Thomas [13]. For a sequence, denote. For any Borel space,denoteitsborelsetsby and the space of probability measures on as. Consider two probability measures and on. is absolutely continuous with respect to (denoted as )if implies that for all. If, denote the Radon Nikodym derivative as the random variable that satisfies The Kullback Leibler divergence between and is defined as if and otherwise. For a sample space,sigma-algebra, and probability measure, denote the probability space as. Throughout this paper, we will consider random processes where the th (with ) random process at time (with ), takes values in a Borel space.denotethe th random variable at (1) time by,the th random process as, and the whole collection of all random processes as. The probability measure thus induces a joint distribution on given by, a joint distribution on given by, and a marginal distribution on given by. With slight abuse of notation, denote for some and for some and denote the conditional distribution and causally conditioned distribution of given as Note the similarity with regular conditioning in (2), except in causal conditioning the future is not conditioned on [14]. The mutual information and directed information [15] between random process and random process are Conceptually, mutual information and directed information are related. However, while mutual information quantifies statistical correlation (in the colloquial sense of statistical interdependence), directed information quantifies statistical causation. Note that,but in general. Remark 1: Note that in (2), there is no conditioning on the present.thisfollowsmarko sdefinition [14] and is consistent with Granger causality [3]. Massey [15] and Kramer [16] later included conditioning on for the specific setting of communication channels. In such settings, since the directions of causation (e.g., that is input and is output) are known, it is convenient to work with synchronized time, for which conditioning on is meaningful. Note, however, that by conditioning on the present in (2), that in a binary symmetric channel (for example) with input,output, and no feedback,, even though does not influence. Directed information has been shown to play important roles in characterizing the capacity of channels with feedback [17] [19], quantifying achievable rates for source coding with feedforward [20], for feedback control over noisy channels [21], [22], and gambling, hypothesis testing, and portfolio theory [23]. See [24] for examples and further discussion. Remark 2: This work is in the setting of discrete time, such as sampled continuous-time processes. Under appropriate technical assumptions, directed information can be directly extended to continuous time on the interval. Define (2) (3)

4 3176 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 to be the sigma-algebra generated by the past of all processes and to be the sigma-algebra generated by the past of all processes excluding. If we assume that all processes are well-behaved (i.e., on Polish spaces), then we have that regular versions of and, exist almost-surely [25]. As such, we can denote the regular conditional probabilities by and respectively. Then the directed information in continuous time is given in complete analogy with discrete time by Connections between directed information in continuous time, causal continuous-time estimation, and communication in continuous time have also recently been proposed [26]. A treatment of the continuous-time setting is outside the scope of this work. IV. BACKGROUND AND MOTIVATING EXAMPLE: APPROXIMATING THE STRUCTURE OF DYNAMICAL SYSTEMS In this section, we describe the problem of identifying the structure of a stochastic, dynamical system, and then approximating it with another stochastic dynamical system. We will review the definitions and basic properties of directed information graphs. We first consider an example of a deterministic dynamical system described in state space format in terms of coupled differential equations. Example 1: Consider a system with three deterministic processes,, which evolves according to: Fig. 2. Directed information graph and a causal dependence tree approximation for the dynamical system in Example 1. (a) Full causal dependence structure, the directed information graph. (b) Causal dependence tree approximation (7). Fig. 2(b) depicts the corresponding directed tree structure. We refertosuchstructuresascausal dependence trees. A similar procedure can be used for networks of stochastic processes, where the system is described in a time-evolving manner through conditional probabilities. Consider three processes, formed by including i.i.d. noises to the above dynamical system and relabeling the time indices: The system can alternatively be described through the joint distribution (up to time )as Given the full past of the whole network,,thefuture of each process (at time ) can be constructed. In many cases, some processes do not depend on the past of every other process, but only some subset of other processes. Suppose we can simplify the above equations by removing all of the dependencies of how one process evolves given others: Because of the causal structure of the dynamical system and the statistical independence of the noises, given the full past, the present values are conditionally independent: (4) This structure can be depicted graphically (see Fig. 2(a)). We can further approximate this dynamical system by approximating the functions whose generative models have fewer inputs. One approximation for the system is: More generally, we will make the analogous assumption about the chain rule and how each process at time is conditionally independent of one another, given the full past of all processes. Assumption 1: Equation (4) holds and for all and some measure. A large class of stochastic systems satisfy Assumption 1. For example, coupled stochastic processes described by an Ito stochastic differential equation with independent Brownian noise satisfy the continuous-time equivalent of this assumption [2].

5 QUINN et al.: OPTIMAL TREE APPROXIMATIONS OF DIRECTED INFORMATION GRAPHS 3177 Granger argued that this is a valid assumption for real world systems, provided the sampling rate is high [3]. We can rewrite (4) using causal conditioning notation (2): Thus, our causal dependence tree approximation to these stochastic processes, denoted by,is: (7) As in the deterministic case, often the evolution of one process does not depend on every other process, but only some subset. We can remove the unnecessary dependencies to obtain The dependence structure of this stochastic system is represented by Fig. 2(a). We next generalize this procedure. For each process,let denote a potential subset of parent processes. Define the corresponding induced probability measure : To find a minimal graph, for each process, we would like to find the smallest set of parents that fully describes the dynamics of as well as the whole network does: In Example 1, the s would correspond to, and for,and, respectively. The parent sets can be independently minimized so that (6) holds. With these minimal parent sets, we can define the graphical model we will use throughout this discussion. 2. Definition 4.1: A directed information graph is a directed graph, where each process is represented by a node, and there is a directed edge from to for iff,where the cardinalities are minimal such that (6) holds. Lemma 4.2 ([2]): Under Assumption 1, directed information graphs are unique. Furthermore, for a given process,adirected edge is placed from to ifandonlyif (5) (6) This approximation is represented graphically in Fig. 2(b). Although the system in Example 1 only had three processes, with a large number of processes, the directed information graph could be quite complex, difficult to compute and analyze visually. As we will show, it is possible to construct efficient optimal tree-like approximations to the directed information graph, and these approximations do not suffer greatly in decision-making performance nor in visualization of relevant features. V. MAIN RESULT: BEST PARENT AND CAUSAL DEPENDENCE TREE APPROXIMATIONS We now describe two approaches to approximate joint distributions of networks of stochastic processes, with corresponding low complexity directed information graphs. In both cases, at mostasingleparentwillbekept.thefirst case is an unconstrained optimization. The second constrains the approximating structure to be a causal dependence tree (this was presented in part at [27]). Minimizing the KL divergence between the full and approximating joint distributions in both cases will result in a sum of pairwise directed informations. We first examine the problem of finding the best approximation where each process has at most one parent. See Fig. 3(a). Consider the joint distribution of random processes, each of length. We will consider approximations of the form where selects the parent. Let denote the set of all such approximations. We want to find the that minimizes the KL divergence. Theorem 1: (8) Directed information graphs can have cycles, representing feedback between processes, and can even be complete. For some systems, there might be a large number of influences between processes, with varying magnitudes. For analysis and even storage purposes, it can be helpful to have succinct approximations. For the stochastic system of Example 1, we can apply a similar approximation to this system as was done in the discrete case with: 2 In [2], minimal generative model graphs are definedbydefinition 4.1. Under mild technical assumptions they are equivalent to directed information graphs; for clarity we refer to them together as directed information graphs. Proof: First define the product distribution (9) (10) which is equivalent to when the processes are statistically independent. Note that all lie in,andmoreover,. Thus, the Radon Nikodym derivative satisfies the chain rule [28]: (11)

6 3178 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 from the choice of each effecting only a single term in the sum. Thus, finding the optimal structure where each node has at most one parent is equivalent to individually maximizing pairwise directed informations. The process is described in Algorithm 1. Let denote the set of all pairwise marginal distributions of : Algorithm 1. Best Parent Input: 1. For For 4. Compute 5. Fig. 3. Examples of directed information graph approximations. The best parent approximation is better in terms of KL divergence. However, the best tree approximation is connected and has a clearly distinguished root with paths from the root to all other nodes. Thus, it is more useful for applications such as targeted advertising. (a) Best parent approximation. (b) Best tree approximation. Thus, Algorithm 1 will return the best possible approximation where only pairwise interactions are preserved. It is possible, though, that could be disconnected. See Fig. 3(a). For some applications, such as picking a single most influential user in a group of friends for targeted advertising, it is useful to have a connected structure with a dominant node. See Fig. 3(b). Next consider the case where candidate approximations have causal dependence tree structures. The approximations have the form (18) (12) where is a permutation on and with denoting a deterministic constant (for the root node s dependence). Let denote the set of all possible causal dependence tree approximations. Like before, we want to find the that minimizes the KL divergence. Theorem 2: (13) (14) (15) (16) (17) where (12) applies the to (11) and rearranges; (13) follows from not depending on ; (14) follows from (8) and (10); (15) follows from (1); (16) follows from (3); and (17) follows (19) Proof: The proof is similar to the proof of Theorem 1, except that (16) cannot be broken up, as the structural constraint couples choosing and.. Because the maximization became decoupled in Theorem 1, there was a simple algorithm to find the best structure, and that algorithm could be run in a distributed manner. Although that does not happen here, note that the optimal is maximizing a sum of pairwise directed information values. Each value corresponds to an edge weight for one directed edge in a complete directed graph on the processes. To find the tree with maximal weight, we can employ a maximum-weight directed spanning tree (MWDST) algorithm. We discuss MWDST algorithms in Section VI-A. Algorithm 2 describes the procedure to find the best approximating distribution with a causal dependence tree structure.

7 QUINN et al.: OPTIMAL TREE APPROXIMATIONS OF DIRECTED INFORMATION GRAPHS 3179 Algorithm 2. Causal Dependence Tree Input: 1. For For 4. Compute 5. Since contains simpler approximations than, Algorithm 1 s approximations are superior to Algorithm 2 s in terms of KL divergence. For some applications, however, having a directed tree can be more useful for analysis and allocation of resources. Remark 3: Chow and Liu [4] solved an analogous problem for a collection of random variables. They developed an algorithm to efficiently find the best tree structured approximation for a Markov network (or, equivalently for that problem, a Bayesian network). They showed that using KL divergence, finding the best tree approximation was equivalent to maximizing a sum of mutual informations. They used a maximum weight spanning tree to solve the optimization. Thus, even though directed information graphs have different properties than Markov or Bayesian networks, and operate on a collection of random processes not variables, the method for finding the best tree is analogous. Next, we consider the consistency of these algorithms in the setting of estimating from data. We discuss estimation in Section VII-A. Theorem 3: Suppose and the estimates converge almost surely (a.s.). Then for the output of Algorithm 2, (20) Proof: Since, by Lemma 4.2, is the unique tree structure with maximal sum of directed informations along its edges. Algorithm 2 finds the tree with maximal weight, and thus if the edge weights converge almost surely, then the tree estimate does also. Note that an analogous result holds for Algorithm 1 in the case. In general, there could be multiple approximation structures in or with the same maximal weight, so might not converge, but the approximating structures picked would almost surely be among those of maximal weight. VI. COMPLEXITY In this section, we will discuss the complexity both of the algorithms and storage requirements for the solution. A. Algorithmic Complexity Both algorithms first compute the directed information values between each pair. For discrete random processes, computing the directed information, a divergence (3), in general involves summations over exponentially large alphabets. Computing one directed information value for two processes of length is. If the distributions are assumed to be jointly Markov of order, then it becomes linear for fixed. Thus computing the directed information for each ordered pair of processes is work when Markovicity is assumed. For both algorithms, computation of the directed informations can be done independently: the for loops in lines 1 and 4 of both algorithms can be done in a distributed fashion. Note that computing only pairwise relationships is computationally much more tractable than in the full case. To identify the true directed information graph, divergence calculations using the whole network of processes are used [2], requiring time without Markovicity, and with Markovicity. Furthermore, the computation can reduced by calculating mutual informations initially for line 4 in both algorithms. Equation (4) holding means which implies [14]. Since mutual and directed informations are nonnegative, the mutual information bounds each directed information. Either directed information can later be computed to resolve both. After computing the pairwise directed informations, Algorithm 1 then picks the best parent for each process, which takes total, so the total runtime is assuming Markovicity. Algorithm 2 additionally computes a maximum weight spanning tree. Chu and Liu [29], Edmonds [30], and Bock [31] independently developed an efficient MWDST algorithm, which runs in time. Thus, like Algorithm 1, Algorithm 2 also runs in. Note that Humblet [32] proposed a distributed MWDST algorithm, which constructs the maximum weight tree for each node as root in time. In some applications, it is useful to be able to choose from multiple potential roots. B. Storage Complexity In the full joint distribution, there are variables. Each possible outcome might have unique probability. Thus, for discrete variables with alphabet, the total storage for the joint distribution is. Both approximations we consider reduce the full joint distribution to pairwise distributions. Thus, the storage is. Further, if the approximations have Markovicity of order, the total storage becomes for constant. VII. APPLICATIONS TO SIMULATED AND EXPERIMENTAL DATA In this section, we demonstrate the efficacy of the approximations in a classification experiment with simulated time-series. We then show the approximations capture important structural characteristics of a network of brain cells from a neuroscience experiment. First we discuss parametric estimation of directed information from data. A. Parametric Estimation While a thorough discussion of estimation techniques is outside the scope of this work, for completeness we briefly describe the consistent parametric estimation technique for directed information proposed in [24] and [33] and applied to study brain cell networks. After we discuss estimation for the specific setting of autoregressive time-series. 1) Point-Process Parametric Models: Let and denote two binary time series of brain cell activity. if cell was active at time, otherwise 0. Truccolo et al. [34] proposed modeling how depends on its own past and the past of

8 3180 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 using a point process framework. The conditional log-likelihood has the form where is the time length between samples and is the conditional intensity function [34] can be interpreted as the propensity of to be active at time based on its past and the past of. The Markov orders and are assumed to be unknown. To avoid overfitting, the minimum description length penalty [35] is used to select the MLE : This penalty balances the Shannon code-length of encoding with causal side information using a MLE and the code-length required to describe the MLE. The directed information estimates are (21) where and are the MLE parameter vectors for their respective models. Under stationarity, ergodicity, and Markovicity, almost sure convergence of isshownin[24].these results extend to general parametric classes. Also note that for the setting of finite alphabets, [36] proposed universal estimation of directed information using context tree weighting. 2) Autoregressive Models: Next consider the specific parametric class of autoregressive time-series. Specifically, a Markov-order one autoregressive model (AR-1) for is (22) where is a coefficient matrix and is i.i.d. white Gaussian noise with variance matrix. The noise components are assumed to be independent, so is diagonal. The coefficients are fixed, so for two processes modeled as AR-1, (23) (24) where (23) follows from stationarity and Markovicity and (24) follows from (pg. 249 of [13]) with denoting the determinant of the covariance matrix of.notethat by the recurrence relation (22), the covariance matrix can be computed as (25) Thus, estimates of can be computed by first finding the least squares estimate of the coefficient matrix in (22), then computing covariance matrices using (25), and then computing (24). B. Classification Experiment We tested the utility of the approximation methods using a binary classification experiment. 1) Setup: For the number of processes pairs of AR-1 models and were randomly generated. Each element of the coefficient matrix was generated i.i.d. from a distribution. was an diagonal matrix with entries randomly selected from the interval uniformly. For each AR model, time-series of lengths were generated using (22). The coefficients of were estimated using least squares for each of the time-series. The best parent and best tree approximations were computed using estimated coefficients. The directed informations between each pair were estimated using the method in Section VII-A2 with. To identify the MWDSTs, a Matlab implementation of Edmunds s algorithm [37] was used. Coefficients were generated and estimated likewise. Next, classification was performed. For each pair of models and length time-series were generated from each model using (22). First, the log-likelihoods of each time-step conditioned on the past was computed for the full distributions using estimates of and. The frequency of correct classification was calculated. Next, the log-likelihoods using the best parent approximations with estimated coefficients were calculated and then those for the best tree approximations. This was repeated for each set of coefficient estimates, corresponding to. 2) Results: The results of these classification experiments are shown in Fig. 4. The classification rates are averaged over the 100 trials. Error bars show standard deviation. The best parent approximations only perform slightly better than the best tree approximations. Both performed close to 85% correct classification rate, slightly improving with larger. Classification using the full distribution noticeably improves with. This is due to the increased complexity of the distributions; with more processes, there are more relationships to distinguish the distributions. There are edges in the full distribution compared to in the best parent and in the best tree approximations. Despite having significantly fewer edges, the approximations capture enough structure to distinguish models. The effect of having a small number of samples to estimate AR coefficients is more dramatic as increases. For,coefficients estimated with and length time-series performed almost identically. C. Application to Experimental Data We now discuss an application of these methods to analysis of neural activity. A recent study computed the directed information graph for a group of neurons in a monkey s primary motor cortex [24]. Using that graph, they identified a dominant axis of local interactions, which corresponded to the known, primary direction of wave propagation of regional synchronous activity, believed to mediate information transfer [38]. We show that the

9 QUINN et al.: OPTIMAL TREE APPROXIMATIONS OF DIRECTED INFORMATION GRAPHS 3181 Fig. 4. Classification rate between pairs of autoregressive series. For each pairs of autoregressive coefficients were generated randomly. Classification was performed using the full structures, best parent approximations, and best tree approximations, using coefficients estimated with length time-series. Error bars depict standard deviation. (a).(b).(c). best parent and best tree approximations preserve that dominant axis. The monkey was performing a sequence of arm-reaching tasks. Its arm was constrained to move along a horizontal surface. Each task involved presentation of a randomly positioned, fixed target on the surface, the monkey moving its hand to meet the target, and a reward (drop of juice) given to the monkey if it was successful. For more details, see [24], [39]. Neural activity in the primary motor cortex was recorded by an implanted silicon micro-electrode array. The recorded waveforms were filtered and processed to produce, for each neuron that was detected, a sequence of times when that neuron became active (e.g., it spiked ). The 37 neurons with the greatest total activity (number of spikes) were used for analysis. To study the flow of activity between individual neurons, we constructed a directed information graph on the collection of neurons. To simplify computation, pairwise directed informations were estimated using the parametric estimation procedure discussed in Section VII-A. Fig. 5(a) depicts the pairwise directed information graph. The relative positions of the neurons in the graph correspond to the relative positions of the recording electrodes. The blue arrow indicates a dominant orientation of the edges. This orientation along the rostro-caudal axis is consistent with the direction of propagation of local field potentials, which researchers believe mediates information transfer between regions [38]. WeappliedAlgorithms1and2tothisdataset.Thestructure of the dependence tree approximation is shown in Fig. 5(b). The best parent approximation is almost identical. The only differences are that the parents of nodes 28 and 13 are 27 and 3 respectively. The original graph had 117 edges with many complicated loops. Both approximations reduced the number of edges by a Fig. 5. Graphical structures of nonzero pairwise directed information values from [24] and causal dependence tree approximation. The best parent approximation was almost identical and is not shown. The blue arrow in Fig. 5(a) depicts a dominant orientation of the edges. That orientation is consistent with the direction of propagation of local field potentials, which is believed to mediate information transfer [38]. Both approximations preserve that structure. (a) Graphical structure of nonzero pairwise directed information values. (b) Causal dependence tree approximation. third, improving the clarity of the graph. Both approximations preserve the dominant edge orientation pertaining to wave propagation depicted by the blue arrow in Fig. 5(a). This suggests that these approximation methodologies preserve relevant information for decision-making and visualization for analysis of mechanistic biological phenomena.

10 3182 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 VIII. CONCLUSION In this work, we presented efficient methods to optimally approximate networks of stochastic, dynamically interacting processes with low-complexity approximation methods. Both approximations only required pairwise marginal statistics between the processes, which computationally are significantly more tractable than the full joint distribution. Also, the corresponding directed information graphs are much more accessible to analysis and practical usage for many applications. An important line of future work involves investigating methods to approximate with other, more complicated structures. Best-parent approximations and causal dependence tree approximations will always reduce the storage complexity dramatically and facilitate analysis. However, for some applications, it might be desirable to have slightly more complicated structures, such as connected graphs with at most three parents for each node. Such approximations highlight a richer set of interactions and feedback while still being visually and computationally simpler to analyze than the full structure. Although it might not always be possible to efficiently find optimal approximations of such graphical complexity, even near-optimal approximations could prove quite beneficial to real world applications. ACKNOWLEDGMENT The authors would like to thank J. Etesami and M. Rodrigues for their assistance with computer simulations. REFERENCES [1] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA, USA: MIT Press, [2] C. Quinn, N. Kiyavash, and T. Coleman, Directed information graphs, 2012, Arxiv preprint arxiv: [3] C. Granger, Investigating causal relations by econometric models and cross-spectral methods, Econometrica, vol. 37, no. 3, pp , [4] C. Chow and C. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. Inf. Theory, vol. IT 14, no. 3, pp , [5] C. Shalizi, M. Camperi, and K. Klinkner, Discovering functional communities in dynamical networks, in Statistical Network Analysis: Models, Issues, and New Directions, ser. Lecture Notes in Computer Science, E. Airoldi, D. Blei, S. Fienberg, A. Goldenberg, E. Xing, and A. Zheng, Eds. Berlin, Germany: Springer, 2007, vol. 4503, pp [6] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, J. Royal Statist. Soc.: Ser. B (Statist. Method.), vol. 68, no. 1, pp , [7] A. Bolstad, B. Van Veen, and R. Nowak, Causal network inference via group sparse regularization, IEEE Trans. Signal Process., vol. 59, no. 6, pp , [8] A. Puig, A. Wiesel, G. Fleury, and A. Hero, Multidimensional shrinkage-thresholding operator and group lasso penalties, IEEE Signal Process. Lett., vol. 18, no. 6, pp , Jun [9] V. Tan and A. Willsky, Sample complexity for topology estimation in networks of LTI systems, in Proc. 50th IEEE Conf. Decision Control (CDC-ECC), 2011, pp [10] D. Materassi and G. Innocenti, Topological identificationinnetworks of dynamical systems, IEEE Trans. Autom. Control,vol.55,no.8,pp , [11] D. Materassi and M. Salapaka, On the problem of reconstructing an unknown topology via locality properties of the Wiener filter, IEEE Trans. Autom. Control, vol. 57, no. 7, pp , [12] J. Etesami, N. Kiyavash, and T. P. Coleman, Learning minimal latent directed information trees, in Proc. IEEE Int. Symp. Inf. Theory Proc. (ISIT), 2012, pp [13] T. Cover and J. Thomas, Elements of Information Theory. NewYork, NY, USA: Wiley-Interscience, [14] H. Marko, The bidirectional communication theory A generalization of information theory, IEEE Trans. Commun., vol. C 21, no. 12, pp , Dec [15] J. Massey, Causality, feedback and directed information, in Proc Int. Symp. Inf. Theory and Its Appl., 1990, pp [16] G. Kramer, Directed information for channels with feedback, Ph.D. dissertation, Electr. and Comput. Eng. Dept., Swiss Fed. Inst. Technol. (ETH), Zürich, Switzerland, [17] S. Tatikonda and S. Mitter, The capacity of channels with feedback, IEEE Trans. Inf. Theory, vol. 55, no. 1, pp , [18] H. Permuter, T. Weissman, and A. Goldsmith, Finite state channels with time-invariant deterministic feedback, IEEE Trans. Inf. Theory, vol. 55, no. 2, pp , [19] C. Li and N. Elia, The information flow and capacity of channels with noisy feedback, 2011, arxiv preprint arxiv: [20] R. Venkataramanan and S. Pradhan, Source coding with feed-forward: Rate-distortion theorems and error exponents for a general source, IEEE Trans. Inf. Theory, vol. 53, no. 6, pp , [21] N. Martins and M. Dahleh, Feedback control in the presence of noisy channels: Bode-like fundamental limitations of performance, IEEE Trans. Autom. Control, vol. 53, no. 7, pp , Aug [22] S. K. Gorantla, The interplay between information and control theory within interactive decision-making problems, Ph.D. dissertation, Electr. and Comput. Eng. Dept., Univ. Illinois at Urbana-Champaign, Champaign, IL, USA, [23] H. Permuter, Y. Kim, and T. Weissman, Interpretations of directed information in portfolio theory, data compression, and hypothesis testing, IEEE Trans. Inf. Theory, vol. 57, no. 6, pp , [24] C. Quinn, T. Coleman, N. Kiyavash, and N. Hatsopoulos, Estimating the directed information to infer causal relationships in ensemble neural spike train recordings, J. Computat. Neurosci., vol. 30, no. 1, pp , [25] R. M. Gray, Probability, Random Processes, and Ergodic Properties. New York, NY, USA: Springer, [26] T. Weissman, Y.-H. Kim, and H. Permuter, Directed information, causal estimation, and communication in continuous time, IEEE Trans. Inf. Theory, vol. 59, no. 3, pp , [27] C. Quinn, T. Coleman, and N. Kiyavash, Approximating discrete probability distributions with causal dependence trees, in Proc. IEEE Int. Symp. Inf. Theory Appl. (ISITA), 2010, pp [28] H. Royden and P. Fitzpatrick, RealAnalysis, 3rd ed. New York, NY, USA: Macmillan, [29] Y. Chu and T. Liu, On the shortest arborescence of a directed graph, Sci. Sinica, vol. 14, no , p. 270, [30] J. Edmonds, Optimum branchings, J. Res. Natl. Bur. Stand., Sect. B, vol. 71, pp , [31] F. Bock, An algorithm to construct a minimum directed spanning tree in a directed network, Develop. Operat. Res., vol. 1, pp , [32] P. Humblet, A distributed algorithm for minimum weight directed spanning trees, IEEE Trans. Commun., vol. C 31, no. 6, pp , [33] S. Kim, D. Putrino, S. Ghosh, and E. N. Brown, A Granger causality measure for point process models of ensemble neural spiking activity, PLoS Comput. Biol., vol. 7, no. 3, p. e , Mar [34] W.Truccolo,U.T.Eden,M.R.Fellows,J.P.Donoghue,andE.N. Brown, A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects, J. Neurophysiol., vol. 93, no. 2, pp , [35] P. D. Grünwald, The Minimum Description Length Principle. Cambridge, MA, USA: MIT Press, [36] J.Jiao,H.Permuter,L.Zhao,Y.Kim,andT.Weissman, Universal estimation of directed information, 2012, Arxiv preprint arxiv: [37] G. Li, Jun [Online]. Available: Maximum Weight Spanning tree (Undirected) Computer software [38] D. Rubino, K. Robbins, and N. Hatsopoulos, Propagating waves mediate information transfer in the motor cortex, Nature Neurosci., vol. 9, no. 12, pp , [39] W. Wu and N. Hatsopoulos, Evidence against a single coordinate system representation in the motor cortex, Experiment. Brain Res., vol. 175, no. 2, pp , Christopher J. Quinn (S 11), photograph and biography not available at the time of publication. Negar Kiyavash (SM 13), photograph and biography not available at the time of publication. Todd P. Coleman (SM 12), photograph and biography not available at the time of publication.

Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs

Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs 1 Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs Christopher J. Quinn*, Student Member, IEEE, Negar Kiyavash, Senior Member, IEEE, and Todd P. Coleman, Senior Member,

More information

Approximating Discrete Probability Distributions with Causal Dependence Trees

Approximating Discrete Probability Distributions with Causal Dependence Trees Approximating Discrete Probability Distributions with Causal Dependence Trees Christopher J. Quinn Department of Electrical and Computer Engineering University of Illinois Urbana, Illinois 680 Email: quinn7@illinois.edu

More information

Causal Dependence Tree Approximations of Joint Distributions for Multiple Random Processes

Causal Dependence Tree Approximations of Joint Distributions for Multiple Random Processes 1 Causal Dependence Tree Approximations of Joint Distributions for Multiple Random Processes Christopher J. Quinn, Student Member, IEEE, Todd P. Coleman, Member, IEEE, and Negar Kiyavash, Member, IEEE

More information

c 2010 Christopher John Quinn

c 2010 Christopher John Quinn c 2010 Christopher John Quinn ESTIMATING DIRECTED INFORMATION TO INFER CAUSAL RELATIONSHIPS BETWEEN NEURAL SPIKE TRAINS AND APPROXIMATING DISCRETE PROBABILITY DISTRIBUTIONS WITH CAUSAL DEPENDENCE TREES

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Latent Tree Approximation in Linear Model

Latent Tree Approximation in Linear Model Latent Tree Approximation in Linear Model Navid Tafaghodi Khajavi Dept. of Electrical Engineering, University of Hawaii, Honolulu, HI 96822 Email: navidt@hawaii.edu ariv:1710.01838v1 [cs.it] 5 Oct 2017

More information

MUTUAL information between two random

MUTUAL information between two random 3248 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 57, NO 6, JUNE 2011 Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing Haim H Permuter, Member, IEEE,

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER Zhen Zhen 1, Jun Young Lee 2, and Abdus Saboor 3 1 Mingde College, Guizhou University, China zhenz2000@21cn.com 2 Department

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

WE study the capacity of peak-power limited, single-antenna,

WE study the capacity of peak-power limited, single-antenna, 1158 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 3, MARCH 2010 Gaussian Fading Is the Worst Fading Tobias Koch, Member, IEEE, and Amos Lapidoth, Fellow, IEEE Abstract The capacity of peak-power

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

BAYESIAN ESTIMATION OF UNKNOWN PARAMETERS OVER NETWORKS

BAYESIAN ESTIMATION OF UNKNOWN PARAMETERS OVER NETWORKS BAYESIAN ESTIMATION OF UNKNOWN PARAMETERS OVER NETWORKS Petar M. Djurić Dept. of Electrical & Computer Engineering Stony Brook University Stony Brook, NY 11794, USA e-mail: petar.djuric@stonybrook.edu

More information

EEG- Signal Processing

EEG- Signal Processing Fatemeh Hadaeghi EEG- Signal Processing Lecture Notes for BSP, Chapter 5 Master Program Data Engineering 1 5 Introduction The complex patterns of neural activity, both in presence and absence of external

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

Causality and communities in neural networks

Causality and communities in neural networks Causality and communities in neural networks Leonardo Angelini, Daniele Marinazzo, Mario Pellicoro, Sebastiano Stramaglia TIRES-Center for Signal Detection and Processing - Università di Bari, Bari, Italy

More information

Expectation Propagation in Factor Graphs: A Tutorial

Expectation Propagation in Factor Graphs: A Tutorial DRAFT: Version 0.1, 28 October 2005. Do not distribute. Expectation Propagation in Factor Graphs: A Tutorial Charles Sutton October 28, 2005 Abstract Expectation propagation is an important variational

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure

Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure Gaussian Multiresolution Models: Exploiting Sparse Markov and Covariance Structure The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

HOPFIELD neural networks (HNNs) are a class of nonlinear

HOPFIELD neural networks (HNNs) are a class of nonlinear IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 4, APRIL 2005 213 Stochastic Noise Process Enhancement of Hopfield Neural Networks Vladimir Pavlović, Member, IEEE, Dan Schonfeld,

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 12, DECEMBER 2010 1005 Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko Abstract A new theorem shows that

More information

Optimal Decentralized Control of Coupled Subsystems With Control Sharing

Optimal Decentralized Control of Coupled Subsystems With Control Sharing IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 58, NO. 9, SEPTEMBER 2013 2377 Optimal Decentralized Control of Coupled Subsystems With Control Sharing Aditya Mahajan, Member, IEEE Abstract Subsystems that

More information

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University Learning from Sensor Data: Set II Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University 1 6. Data Representation The approach for learning from data Probabilistic

More information

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information 204 IEEE International Symposium on Information Theory Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information Omur Ozel, Kaya Tutuncuoglu 2, Sennur Ulukus, and Aylin Yener

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

IN this paper, we consider the capacity of sticky channels, a

IN this paper, we consider the capacity of sticky channels, a 72 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 1, JANUARY 2008 Capacity Bounds for Sticky Channels Michael Mitzenmacher, Member, IEEE Abstract The capacity of sticky channels, a subclass of insertion

More information

Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing

Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing arxiv:092.4872v [cs.it] 24 Dec 2009 Haim H. Permuter, Young-Han Kim, and Tsachy Weissman Abstract We

More information

Observed Brain Dynamics

Observed Brain Dynamics Observed Brain Dynamics Partha P. Mitra Hemant Bokil OXTORD UNIVERSITY PRESS 2008 \ PART I Conceptual Background 1 1 Why Study Brain Dynamics? 3 1.1 Why Dynamics? An Active Perspective 3 Vi Qimnü^iQ^Dv.aamics'v

More information

Large-Deviations and Applications for Learning Tree-Structured Graphical Models

Large-Deviations and Applications for Learning Tree-Structured Graphical Models Large-Deviations and Applications for Learning Tree-Structured Graphical Models Vincent Tan Stochastic Systems Group, Lab of Information and Decision Systems, Massachusetts Institute of Technology Thesis

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Chapter 2 Review of Classical Information Theory

Chapter 2 Review of Classical Information Theory Chapter 2 Review of Classical Information Theory Abstract This chapter presents a review of the classical information theory which plays a crucial role in this thesis. We introduce the various types of

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 41 Pulse Code Modulation (PCM) So, if you remember we have been talking

More information

Information Dynamics Foundations and Applications

Information Dynamics Foundations and Applications Gustavo Deco Bernd Schürmann Information Dynamics Foundations and Applications With 89 Illustrations Springer PREFACE vii CHAPTER 1 Introduction 1 CHAPTER 2 Dynamical Systems: An Overview 7 2.1 Deterministic

More information

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception COURSE INTRODUCTION COMPUTATIONAL MODELING OF VISUAL PERCEPTION 2 The goal of this course is to provide a framework and computational tools for modeling visual inference, motivated by interesting examples

More information

Auxiliary signal design for failure detection in uncertain systems

Auxiliary signal design for failure detection in uncertain systems Auxiliary signal design for failure detection in uncertain systems R. Nikoukhah, S. L. Campbell and F. Delebecque Abstract An auxiliary signal is an input signal that enhances the identifiability of a

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Supplementary Note on Bayesian analysis

Supplementary Note on Bayesian analysis Supplementary Note on Bayesian analysis Structured variability of muscle activations supports the minimal intervention principle of motor control Francisco J. Valero-Cuevas 1,2,3, Madhusudhan Venkadesan

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06

More information

Representation of Correlated Sources into Graphs for Transmission over Broadcast Channels

Representation of Correlated Sources into Graphs for Transmission over Broadcast Channels Representation of Correlated s into Graphs for Transmission over Broadcast s Suhan Choi Department of Electrical Eng. and Computer Science University of Michigan, Ann Arbor, MI 80, USA Email: suhanc@eecs.umich.edu

More information

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models arxiv:1811.03735v1 [math.st] 9 Nov 2018 Lu Zhang UCLA Department of Biostatistics Lu.Zhang@ucla.edu Sudipto Banerjee UCLA

More information

Communication Theory II

Communication Theory II Communication Theory II Lecture 8: Stochastic Processes Ahmed Elnakib, PhD Assistant Professor, Mansoura University, Egypt March 5 th, 2015 1 o Stochastic processes What is a stochastic process? Types:

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4 Inference: Exploiting Local Structure aphne Koller Stanford University CS228 Handout #4 We have seen that N inference exploits the network structure, in particular the conditional independence and the

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Directed acyclic graphs and the use of linear mixed models

Directed acyclic graphs and the use of linear mixed models Directed acyclic graphs and the use of linear mixed models Siem H. Heisterkamp 1,2 1 Groningen Bioinformatics Centre, University of Groningen 2 Biostatistics and Research Decision Sciences (BARDS), MSD,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Online Forest Density Estimation

Online Forest Density Estimation Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Yoshua Bengio Dept. IRO Université de Montréal Montreal, Qc, Canada, H3C 3J7 bengioy@iro.umontreal.ca Samy Bengio IDIAP CP 592,

More information

Generalized Writing on Dirty Paper

Generalized Writing on Dirty Paper Generalized Writing on Dirty Paper Aaron S. Cohen acohen@mit.edu MIT, 36-689 77 Massachusetts Ave. Cambridge, MA 02139-4307 Amos Lapidoth lapidoth@isi.ee.ethz.ch ETF E107 ETH-Zentrum CH-8092 Zürich, Switzerland

More information

arxiv:cs/ v2 [cs.it] 1 Oct 2006

arxiv:cs/ v2 [cs.it] 1 Oct 2006 A General Computation Rule for Lossy Summaries/Messages with Examples from Equalization Junli Hu, Hans-Andrea Loeliger, Justin Dauwels, and Frank Kschischang arxiv:cs/060707v [cs.it] 1 Oct 006 Abstract

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Diversity-Promoting Bayesian Learning of Latent Variable Models

Diversity-Promoting Bayesian Learning of Latent Variable Models Diversity-Promoting Bayesian Learning of Latent Variable Models Pengtao Xie 1, Jun Zhu 1,2 and Eric Xing 1 1 Machine Learning Department, Carnegie Mellon University 2 Department of Computer Science and

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics) COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

6.867 Machine learning, lecture 23 (Jaakkola)

6.867 Machine learning, lecture 23 (Jaakkola) Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context

More information

Feasibility Conditions for Interference Alignment

Feasibility Conditions for Interference Alignment Feasibility Conditions for Interference Alignment Cenk M. Yetis Istanbul Technical University Informatics Inst. Maslak, Istanbul, TURKEY Email: cenkmyetis@yahoo.com Tiangao Gou, Syed A. Jafar University

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Course content (will be adapted to the background knowledge of the class):

Course content (will be adapted to the background knowledge of the class): Biomedical Signal Processing and Signal Modeling Lucas C Parra, parra@ccny.cuny.edu Departamento the Fisica, UBA Synopsis This course introduces two fundamental concepts of signal processing: linear systems

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex Y Gao M J Black E Bienenstock S Shoham J P Donoghue Division of Applied Mathematics, Brown University, Providence, RI 292 Dept

More information

On Scalable Coding in the Presence of Decoder Side Information

On Scalable Coding in the Presence of Decoder Side Information On Scalable Coding in the Presence of Decoder Side Information Emrah Akyol, Urbashi Mitra Dep. of Electrical Eng. USC, CA, US Email: {eakyol, ubli}@usc.edu Ertem Tuncel Dep. of Electrical Eng. UC Riverside,

More information

Tight Lower Bounds on the Ergodic Capacity of Rayleigh Fading MIMO Channels

Tight Lower Bounds on the Ergodic Capacity of Rayleigh Fading MIMO Channels Tight Lower Bounds on the Ergodic Capacity of Rayleigh Fading MIMO Channels Özgür Oyman ), Rohit U. Nabar ), Helmut Bölcskei 2), and Arogyaswami J. Paulraj ) ) Information Systems Laboratory, Stanford

More information

Belief propagation decoding of quantum channels by passing quantum messages

Belief propagation decoding of quantum channels by passing quantum messages Belief propagation decoding of quantum channels by passing quantum messages arxiv:67.4833 QIP 27 Joseph M. Renes lempelziv@flickr To do research in quantum information theory, pick a favorite text on classical

More information

Estimation of linear non-gaussian acyclic models for latent factors

Estimation of linear non-gaussian acyclic models for latent factors Estimation of linear non-gaussian acyclic models for latent factors Shohei Shimizu a Patrik O. Hoyer b Aapo Hyvärinen b,c a The Institute of Scientific and Industrial Research, Osaka University Mihogaoka

More information

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department Decision-Making under Statistical Uncertainty Jayakrishnan Unnikrishnan PhD Defense ECE Department University of Illinois at Urbana-Champaign CSL 141 12 June 2010 Statistical Decision-Making Relevant in

More information

de Blanc, Peter Ontological Crises in Artificial Agents Value Systems. The Singularity Institute, San Francisco, CA, May 19.

de Blanc, Peter Ontological Crises in Artificial Agents Value Systems. The Singularity Institute, San Francisco, CA, May 19. MIRI MACHINE INTELLIGENCE RESEARCH INSTITUTE Ontological Crises in Artificial Agents Value Systems Peter de Blanc Machine Intelligence Research Institute Abstract Decision-theoretic agents predict and

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Independent Component Analysis. Contents

Independent Component Analysis. Contents Contents Preface xvii 1 Introduction 1 1.1 Linear representation of multivariate data 1 1.1.1 The general statistical setting 1 1.1.2 Dimension reduction methods 2 1.1.3 Independence as a guiding principle

More information

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 Probabilistic

More information

A Log-Frequency Approach to the Identification of the Wiener-Hammerstein Model

A Log-Frequency Approach to the Identification of the Wiener-Hammerstein Model A Log-Frequency Approach to the Identification of the Wiener-Hammerstein Model The MIT Faculty has made this article openly available Please share how this access benefits you Your story matters Citation

More information

Equivalence in Non-Recursive Structural Equation Models

Equivalence in Non-Recursive Structural Equation Models Equivalence in Non-Recursive Structural Equation Models Thomas Richardson 1 Philosophy Department, Carnegie-Mellon University Pittsburgh, P 15213, US thomas.richardson@andrew.cmu.edu Introduction In the

More information

2012 IEEE International Symposium on Information Theory Proceedings

2012 IEEE International Symposium on Information Theory Proceedings Decoding of Cyclic Codes over Symbol-Pair Read Channels Eitan Yaakobi, Jehoshua Bruck, and Paul H Siegel Electrical Engineering Department, California Institute of Technology, Pasadena, CA 9115, USA Electrical

More information

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References 24th March 2011 Update Hierarchical Model Rao and Ballard (1999) presented a hierarchical model of visual cortex to show how classical and extra-classical Receptive Field (RF) effects could be explained

More information

Expectation propagation for signal detection in flat-fading channels

Expectation propagation for signal detection in flat-fading channels Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA

More information

How Random is a Coin Toss? Bayesian Inference and the Symbolic Dynamics of Deterministic Chaos

How Random is a Coin Toss? Bayesian Inference and the Symbolic Dynamics of Deterministic Chaos How Random is a Coin Toss? Bayesian Inference and the Symbolic Dynamics of Deterministic Chaos Christopher C. Strelioff Center for Complex Systems Research and Department of Physics University of Illinois

More information

Variable Length Codes for Degraded Broadcast Channels

Variable Length Codes for Degraded Broadcast Channels Variable Length Codes for Degraded Broadcast Channels Stéphane Musy School of Computer and Communication Sciences, EPFL CH-1015 Lausanne, Switzerland Email: stephane.musy@ep.ch Abstract This paper investigates

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

(Preprint of paper to appear in Proc Intl. Symp. on Info. Th. and its Applications, Waikiki, Hawaii, Nov , 1990.)

(Preprint of paper to appear in Proc Intl. Symp. on Info. Th. and its Applications, Waikiki, Hawaii, Nov , 1990.) (Preprint of paper to appear in Proc. 1990 Intl. Symp. on Info. Th. and its Applications, Waikiki, Hawaii, ov. 27-30, 1990.) CAUSALITY, FEEDBACK AD DIRECTED IFORMATIO James L. Massey Institute for Signal

More information

WE start with a general discussion. Suppose we have

WE start with a general discussion. Suppose we have 646 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997 Minimax Redundancy for the Class of Memoryless Sources Qun Xie and Andrew R. Barron, Member, IEEE Abstract Let X n = (X 1 ; 111;Xn)be

More information