Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs

Size: px
Start display at page:

Download "Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs"

Transcription

1 1 Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs Christopher J. Quinn*, Student Member, IEEE, Negar Kiyavash, Senior Member, IEEE, and Todd P. Coleman, Senior Member, IEEE Abstract Recently, directed information graphs have been proposed as concise graphical representations of the statistical dynamics amongst multiple random processes. A directed edge from one node to another indicates that the past of one random process statistically affects the future of another, given the past of all other processes. When the number of processes is large, computing those conditional dependence tests becomes difficult. Also, when the number of interactions becomes too large, the graph no longer facilitates visual extraction of relevant information for decision-making. This work considers approximating the true joint distribution on multiple random processes by another, whose directed information graph has at most one parent for any node. Under a Kullback-Leibler (KL) divergence minimization criterion, we show that the optimal approximate joint distribution can be obtained by maximizing a sum of directed informations. In particular, (a) each directed information calculation only involves statistics amongst a pair of processes and can be efficiently estimated; (b) given all pairwise directed informations, an efficient minimum weight spanning directed tree algorithm can be solved to find the best tree. We demonstrate the efficacy of this approach using simulated and experimental data. In both, the approximations preserve the relevant information for decisionmaking. EDICS classification identifier: MLR-GRKN I. INTRODUCTION Many important inference problems involve reasoning about dynamic relationships between time series. In such cases, observations of multiple time series are recorded and the objective of the observer is to understand relationships between the past of some processes, and how they affect the future of others. In general, with knowledge of joint statistics amongst multiple random processes, such decision-making could in principle be done. However, if these processes exhibit complex dynamics, gaining knowledge can be prohibitive from computational and storage perspectives. As such, it is appealing to develop an approximation of the joint distribution on multiple random processes which can be calculated efficiently and is less complex for inference. Moreover, simplified representations of joint statistics can facilitate easier visualization and The material in this paper was presented (in part) at the International Symposium on Information Theory and Applications, Taichung, Taiwan, October C. Quinn is with the Department of Electrical and Computer Engineering, University of Illinois at Urbana Champaign, Urbana, Illinois ( quinn7@illinois.edu). N. Kiyavash is with the Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana Champaign, Urbana, Illinois ( kiyavash@illinois.edu). T. P. Coleman is with the Department of Bioengineering, University of California, San Diego, La Jolla, CA ( tpcoleman@ucsd.edu) human comprehension of complex relationships. For instance, in situations such as network intrusion detection, decision making in adversarial environments, and first response tasks where a rapid decision is required, such representations can greatly aid the situation awareness and the decision making process. Graphical models have been used to describe both full and approximating joint distributions of random variables [1]. For many graphical models, random variables are represented as nodes and edges between pairs encode conditional dependence relationships. Markov networks and Bayesian networks are two common examples. Markov networks are undirected graphs, while Bayesian networks are directed acyclic graphs. A Bayesian network s graphical structure depends on the variable indexing. This methodology could in principle be applied to describing multiple random processes. For example, if we have n time indices and m random processes, then we could create a Markov or Bayesian network on mn random variables. However, if m or n is large, this could be prohibitive from a complexity and visualization perspective. We have recently developed an alternative graphical model, termed a directed information graph, to parsimoniously describe statistical dynamics amongst a collection of random processes [2]. Each process is represented by a single node, and directed edges encode conditional independence relationships pertaining to how the past of one process affects the future of another, given the past of all other processes. As such, in this framework, directed edges represent directions of causal influence 1. They are motivated by simplified generative models of coupled dynamical systems. They admit cycles, and can thus represent feedback between processes. Under appropriate technical conditions, they do not depend on process indexing and moreover are unique [2]. Directed information graphs are particularly attractive when we have m processes and a large number n of time units: it collapses a graph of mn nodes to a graph on m nodes and a directed arrow encodes information about temporal dynamics. In some situations, however, the number m of processes we record itself can be very large, and in such a situation each conditional independence test, involving all m processes, can be difficult to evaluate. Moreover, even visualization of the directed information graph with up to m 2 edges can be cumbersome. As such, the benefits of having a precise picture of the statistical dynamics might be out-weighed by the costs 1 Causal in this work refers to Granger causality [3], where a process X is said to causally influence a process Y if the past of X helps to predict the future of Y, already conditioned on the past Y and all other processes.

2 2 (a) The full influence structure of the social network. It is difficult to determine which users to target to indirectly influence the whole network. (b) The graph of an approximation which captures key structural components. By targeting only the root of the tree, who is circled, it is possible influence will spread throughout the rest of the network. Fig. 1. Graphical models of the full user influence dynamics of the social network and an approximation of those dynamics. Arrow widths correspond to strengths of influence. Although some structural components are lost, the graph of the approximation makes it clear who to target and the paths of expected influence. in computation, storage, and ease-of-use to a human. An approximation of the joint distribution that preserves a small number of important interactions could alleviate this problem. As an example, consider how a social network company negotiates the costs of advertisements to its users with another company. If the preferences or actions of certain users on average have a large causal influence on the subsequent preferences or actions of friends in their network, then a business might be willing to pay more money to advertise to those users, as compared to the down-stream friends with less influence. By paying to advertise to the influential users, the business is effectively advertising to many. For the social network company and the business to agree on pricing, however, it needs to be agreed on which users are the most influential. With a complicated social network, such as Figure 1(a), a simple procedure to identify who to advertise to, and for how much, might be onerous to develop. However, if the social network company could approximate the user interactions dynamics into a simplified - but accurate - picture, such as Figure 1(b), then it would be much easier for the business to see who to target to influence the whole network. This is the motivation of this work. Directed trees, such as Figure 1(b), are among the simplest structures that could be used for approximation - each node has at most one parent. In reducing the computational, storage, and visual complexity substantially, directed trees are much more amenable to analysis than the full structure. They also depict a clear hierarchy between nodes. We here consider the problem of finding the best approximation of a joint distribution on m random processes so that each node in the subsequent directed information graph has at most one parent. We will demonstrate the efficacy of this approach from complexity, visualization, and decision-making perspectives. II. OUR CONTRIBUTION AND RELATED WORK A. Our Contribution In this paper, we consider the problem of approximating a joint distribution on m random processes by another joint distribution on m random processes, where each node in the subsequent directed information graph has at most one parent. We consider two variants, one in which the approximation s directed information graph need not be connected, and the second for which it is (i.e. it must be a directed tree). We use the KL divergence as the metric to find the best approximation, and show that the subsequent optimization problem is equivalent to maximizing a sum of pairwise directed informations. Both cases only require knowledge of statistics between pairs of processes to find the best such approximations. For the connected case, a minimum weight spanning tree algorithm can be solved in time that is quadratic in the number of processes. Both approximations have similar algorithmic and storage complexity. We demonstrate the utility of this approach in simulated and experimental data, where the relevant information for decision-making is maintained in the tree approximation. B. Related work Chow and Liu proposed an efficient algorithm to find an optimal tree structured approximation to the joint distribution on a collection of random variables [4]. Since then, many works have been developed to approximate joint distributions, in terms of underlying Markov and Bayesian networks. There have been other works that approximated with more complicated structures; see [1] for an overview. In this work, we use directed information graphs to describe the joint distribution on random processes, in terms of how the past of processes statistically affect the future of others. These were recently introduced in [2], where it was also shown that they are a generalized embodiment of Granger s notion of causality [3] and that under mild assumptions, they are equivalent to minimal generative model graphs. Many methods to estimate joint distributions on random processes from a generative model perspective have recently been developed. Shalizi et al. [5] have developed methods using a stochastic state reconstruction algorithm for discrete valued processes to identify interactions between processes and functional communities. Group Lasso is a method to infer the causal relationships between multivariate auto-regressive models [6]. Bolstad et al. recently showed conditions under which the estimates of Group Lasso are consistent [7]. Puig et al. have developed a multidimensional shrinkage-threshold operator which arises in problems with Group Lasso type penalties [8]. Tan and Willsky analyzed sample complexity for identifying the topology of a tree structured network of LTI systems [9]. Materassi et al. have developed methods based on Wiener filtering to statistically infer causal influences in linear stochastic dynamical systems; consistency results have been derived for the case when the underlying dynamics have a tree structure [10], [11]. For the setting where the directed information graph has a tree structure and some processes are not observed, Etesami et al. developed a procedure to recover the graph [12]. C. Paper organization The paper organization is as follows. Section III establishes definitions and notations. In Section IV, we review directed

3 3 information graphs and discuss their relationship with generative models of stochastic dynamical systems to motivate our approach. In Section V, we present our main results pertaining to finding the optimal approximations of the joint distribution where each node can have at most one parent, both unconstrained and when the structure is constrained to be a directed tree. Here we show that in both cases, the optimization simplifies to maximizing a sum of pairwise directed informations. In Section VI, we analyze the algorithmic and storage complexity of the approximations. In Section VII, we review parametric estimation, evaluate the performance of the approximations in a simulated binary classification experiment, and showcase the utility of this approach in elucidating the wave-like phenomena in the joint neural spiking activity of primary motor cortex. III. DEFINITIONS AND NOTATION This section presents probabilistic notations and information-theoretic definitions and identities that will be used throughout the remainder of the manuscript. Unless otherwise noted, the definitions and identities come from Cover & Thomas [13]. For a sequence a 1, a 2,..., denote a i (a 1,..., a i ). For any Borel space Z, denote its Borel sets by B(Z) and the space of probability measures on (Z, B(Z)) as P (Z). Consider two probability measures P and Q on P (Z). P is absolutely continuous with respect to Q (denoted as P Q) if Q(A) = 0 implies that P(A) = 0 for all A B(Z). If P Q, denote the Radon-Nikodym derivative as the random variable dp dq : Z R that satisfies P(A) = z A dp dq (z)q(dz), A B(Z). The Kullback-Leibler divergence between P P (Z) and Q P (Z) is defined as [ D(P Q) E P log dp ] = dq z Z log dp (z)p(dz) (1) dq if P Q and otherwise. For a sample space Ω, sigma-algebra F, and probability measure P, denote the probability space as (Ω, F, P). Throughout this paper, we will consider m random processes where the ith (with i {1,..., m}) random process at time t (with t {1,..., n}), takes values in a Borel space X. Denote the ith random variable at time t by X i,t : Ω X, the ith random process as X i = (X i,1,..., X i,n ), and the whole collection of all m random processes as X = (X 1,..., X m ). The probability measure P thus induces a joint distribution on X given by P X ( ) P (X mn ), a joint distribution on X i given by P Xi ( ) P (X n ), and a marginal distribution on X i,t given by P Xi,t ( ) P (X). With slight abuse of notation, denote X X i for some i and Y X j for some j i and denote the conditional distribution and causally conditioned distribution of Y given X as P Y X=x (dy) P Y X (dy x) n ( = P Yt Y t 1,X n dyt y t 1, x n) P Y X=x (dy) P Y X (dy x) n ( P Yt Y t 1,X t 1 dyt y t 1, x t 1). (2) Note the similarity with regular conditioning in (2), except in causal conditioning the future (x n t ) is not conditioned on [14]. The mutual information and directed information [15] between random process X and random process Y are I(X; Y) = D ( ) P Y X=x P Y PX (dx) x I(X Y) = D ( ) P Y X=x P Y PX (dx). (3) x Conceptually, mutual information and directed information are related. However, while mutual information quantifies statistical correlation (in the colloquial sense of statistical interdependence), directed information quantifies statistical causation. Note that I(X; Y) = I(Y; X), but I(X Y) I(Y X) in general. Remark 1: Note that in (2), there is no conditioning on the present x t. This follows Marko s definition [14] and is consistent with Granger causality [3]. Massey [15] and Kramer [16] later included conditioning on x t for the specific setting of communication channels. In such settings, since the directions of causation (e.g. that X is input and Y is output) are known, it is convenient to work with synchronized time, for which conditioning on x t is meaningful. Note, however, that by conditioning on the present x t in (2), that in a binary symmetric channel (for example) with input X, output Y, and no feedback, I(Y X) > 0, even though Y does not influence X. Directed information has been shown to play important roles in characterizing the capacity of channels with feedback [17] [19], quantifying achievable rates for source coding with feedforward [20], for feedback control over noisy channels [21], [22], and gambling, hypothesis testing, and portfolio theory [23]. See [24] for examples and further discussion. Remark 2: This work is in the setting of discrete time, such as sampled continuous-time processes. Under appropriate technical assumptions, directed information can be directly extended to continuous time on the [0, T ] interval. Define F t = σ(x τ : 0 τ < t, Y τ : 0 τ < t) to be the sigma-algebra generated by the past of all processes and F t = σ(y τ : 0 τ < t) to be the sigma-algebra generated by the past of all processes excluding X. If we assume that all processes

4 4 are well-behaved (i.e. on Polish spaces), then we have that regular versions of P (Y t F t ) and P (Y t F t ), exist almost-surely [25]. As such, we can denote the regular conditional probabilities by by P t ( ) P (Y) and P t ( ) P (Y) respectively. Then the directed information in continuous time is given in complete analogy with discrete time by [ ] T I(X Y) E D (P t P ) t dt. Connections between directed information in continuous time, causal continuous-time estimation, and communication in continuous time have also recently been proposed [26]. A treatment of the continuous-time setting is outside the scope of this work. IV. BACKGROUND AND MOTIVATING EXAMPLE: APPROXIMATING THE STRUCTURE OF DYNAMICAL 0 SYSTEMS In this section, we describe the problem of identifying the structure of a stochastic, dynamical system, and then approximating it with another stochastic dynamical system. We will review the definitions and basic properties of directed information graphs. We first consider an example of a deterministic dynamical system described in state space format in terms of coupled differential equations. Example 1: Consider a system with three deterministic processes, {x t, y t, z t }, which evolves according to: ẋ = g 1 (x, y, z) x t+ = x t + g 1 (x t, y t, z t ) ẏ = g 2 (x, y, z) y t+ = y t + g 2 (x t, y t, z t ) ż = g 3 (x, y, z) z t+ = z t + g 3 (x t, y t, z t ). Given the full past of the whole network, {x t, y t, z t }, the future of each process (at time t + ) can be constructed. In many cases, some processes do not depend on the past of every other process, but only some subset of other processes. Suppose we can simplify the above equations by removing all of the dependencies of how one process evolves given others: x t+ = x t + g 1 (x t, y t ) y t+ = y t + g 2 (x t, y t ) z t+ = z t + g 3 (x t, y t, z t ). This structure can be depicted graphically (see Figure 2(a)). We can further approximate this dynamical system by approximating the functions {g 1 (x t, y t ), g 2 (x t, y t ), g 3 (x t, y t, z t )} whose generative models have fewer inputs. One approximation for the system is: x t+ = x t + g 1 (x t, y t ) x t + g 1(x t ) y t+ = y t + g 2 (x t, y t ) z t+ = z t + g 3 (x t, y t, z t ) z t + g 3(y t, z t ). Figure 2(b) depicts the corresponding directed tree structure. We refer to such structures as causal dependence trees. A similar procedure can be used for networks of stochastic processes, where the system is described in a timeevolving manner through conditional probabilities. Consider X Y Z (a) Full causal dependence structure, the directed information graph. X Y Z (b) Causal dependence tree approximation (7). Fig. 2. Directed information graph and a causal dependence tree approximation for the dynamical system in Example 1. three processes {X, Y, Z}, formed by including i.i.d. noises {B t, C t, D t } n to the above dynamical system and relabeling the time indices: X t+1 = X t + g 1 (X t, Y t, Z t ) + B t+1 Y t+1 = Y t + g 2 (X t, Y t, Z t ) + C t+1 Z t+1 = Z t + g 3 (X t, Y t, Z t ) + D t+1. The system can alternatively be described through the joint distribution (up to time n) as P X,Y,Z (dx, dy, dz) = n P Xt,Y t,z t X t 1,Y t 1,Z t 1(dx t, dy t, dz t x t 1, y t 1, z t 1 ). Because of the causal structure of the dynamical system and the statistical independence of the noises, given the full past, the present values are conditionally independent: P X,Y,Z (dx, dy, dz) = (4) n P Xt X t 1,Y t 1,Z t 1(dx t x t 1, y t 1, z t 1 ) P Yt X t 1,Y t 1,Z t 1(dy t x t 1, y t 1, z t 1 ) P Zt X t 1,Y t 1,Z t 1(dz t x t 1, y t 1, z t 1 ). More generally, we will make the analogous assumption about the chain rule and how each process at time t is conditionally independent of one another, given the full past of all processes. Assumption 1: Equation (4) holds and dp X dφ (x) > 0 for all x and some measure φ P X. A large class of stochastic systems satisfy Assumption 1. For example, coupled stochastic processes described by an Ito stochastic differential equation with independent Brownian noise satisfy the continuous-time equivalent of this assumption [2]. Granger argued that this is a valid assumption for real world systems, provided the sampling rate 1/ is high [3]. We can rewrite (4) using causal conditioning notation (2): P X,Y,Z (dx, dy, dz) = P X Y,Z (dx y, z)p Y X,Z (dy x, z) P Z X,Y (dz x, y). As in the deterministic case, often the evolution of one process does not depend on every other process, but only some subset. We can remove the unnecessary dependencies to obtain P X,Y,Z (dx, dy, dz) = P X Y (dx y)p Y X (dy x) P Z X,Y (dz x, y).

5 5 The dependence structure of this stochastic system is represented by Figure 2(a). We next generalize this procedure. For each process X i, let A(i) {1,..., m}\{i} denote a potential subset of parent processes. Define the corresponding induced probability measure P A : m P A (dx) = P Xi X A(i) (dx i x A(i) ). (5) To find a minimal graph, for each process X i, we would like to find the smallest set of parents that fully describes the dynamics of X i as well as the whole network does: D ( P X P A ) = 0. (6) In Example 1, the A(i) s would correspond to {Y}, {X}, and {X, Y} for X, Y, and Z, respectively. The parent sets {A(i)} n can be independently minimized so that (6) holds. With these minimal parent sets, we can define the graphical model we will use throughout this discussion. 2 Definition 4.1: A directed information graph is a directed graph, where each process is represented by a node, and there is a directed edge from X j to X i for i, j [m] iff j A(i), where the cardinalities { A(i) } m are minimal such that (6) holds. Lemma 4.2 ([2]): Under Assumption 1, directed information graphs are unique. Furthermore, for a given process X i, a directed edge is placed from j to i (j A(i)) if and only if I(X j X i X\{X j, X i }) > 0. Directed information graphs can have cycles, representing feedback between processes, and can even be complete. For some systems, there might be a large number of influences between processes, with varying magnitudes. For analysis and even storage purposes, it can be helpful to have succinct approximations. For the stochastic system of Example 1, we can apply a similar approximation to this system as was done in the discrete case with: P X Y (dx y) P X (dx) P Z X,Y (dz x, y) P Z Y (dz y). Thus, our causal dependence tree approximation to these stochastic processes, denoted by P X, is: P X (dx) P X (dx) P X (dx)p Y X (dy x)p Z Y (dz y). (7) This approximation is represented graphically in Figure 2(b). Although the system in Example 1 only had three processes, with a large number m of processes, the directed information graph could be quite complex, difficult to compute and analyze visually. As we will show, it is possible to construct efficient optimal tree-like approximations to the directed information graph, and these approximations do not suffer greatly in decision-making performance nor in visualization of relevant features. 2 In [2], minimal generative model graphs are defined by Definition 4.1. Under mild technical assumptions they are equivalent to directed information graphs; for clarity we refer to them together as directed information graphs. (a) Best parent approximation. (b) Best tree approximation. Fig. 3. Examples of directed information graph approximations. The best parent approximation is better in terms of KL divergence. However, the best tree approximation is connected and has a clearly distinguished root with paths from the root to all other nodes. Thus, it is more useful for applications such as targeted advertising. V. MAIN RESULT: BEST PARENT AND CAUSAL DEPENDENCE TREE APPROXIMATIONS We now describe two approaches to approximate joint distributions of networks of stochastic processes, with corresponding low complexity directed information graphs. In both cases, at most a single parent will be kept. The first case is an unconstrained optimization. The second constrains the approximating structure to be a causal dependence tree (this was presented in part at [27]). Minimizing the KL divergence between the full and approximating joint distributions in both cases will result in a sum of pairwise directed informations. We first examine the problem of finding the best approximation where each process has at most one parent. See Figure 3(a). Consider the joint distribution P X of m random processes {X 1, X 2,, X m }, each of length n. We will consider approximations of the form P X (dx) m P Xi X a(i) (dx i x a(i) ). (8) where a(i) {1,..., m}\{i} selects the parent. Let G 1 denote the set of all such approximations. We want to find the P X G 1 that minimizes the KL divergence. Theorem 1: arg min D(P X P X ) = arg max a(i) {1,...,m}\{i} Proof: First define the product distribution P X (dx) I(X a(i) X i ). (9) m P Xi (dx i ), (10) which is equivalent to P X (x) when the processes are statistically independent. Note that P X, P X, P X all lie in P (Ω), and moreover, P X P X P X. Thus, the Radon-Nikodym derivative dp X d P satisfies X the chain rule [28]: dp X d P X = dp X d P X d P X d P X. (11)

6 6 Thus, arg min D(P X P X ) = arg min = arg min = arg max = arg max = arg max E PX E PX E PX [ ] log dp X d P X [ ] [ + E PX log dp X d P X [ ] x log d P X d P X log d P X d P X ] (12) (13) log dp X i X a(i) =x a(i) dp Xi P X (dx) (14) ) D (P Xi Xa(i) =xa(i) P Xi P Xa(i) (dx) (15) x = arg max I(X a(i) X i ) (16) = arg max I(X a(i) X i ), (17) a(i) {1,...,m}\{i} where (12) applies the log to (11) and rearranges; (13) follows from dp X d P not depending on P X ; (14) follows from (8) and X (10); (15) follows from (1); (16) follows from (3); and (17) follows from the choice of each a(i) effecting only a single term in the sum. Thus, finding the optimal structure where each node has at most one parent is equivalent to individually maximizing pairwise directed informations. The process is described in Algorithm 1. Let R denote the set of all pairwise marginal distributions of P X : R = {P Xi,X j : i, j {1,..., m}}. Algorithm 1. Best Parent Input: R 1. For i {1,..., m} 2. a(i) 3. For j {1,..., m}\{i} 4. Compute I(X j X i ) 5. a(i) arg max I(X j X i ) j Algorithm 1 will return the best possible approximation where only pairwise interactions are preserved. It is possible, though, that PX could be disconnected. See Figure 3(a). For some applications, such as picking a single most influential user in a group of friends for targeted advertising, it is useful to have a connected structure with a dominant node. See Figure 3(b). Next consider the case where candidate approximations have causal dependence tree structures. The approximations have the form m P X (dx) P Xπ(i) X l(π(i)) (dx π(i) x l(π(i)) ) (18) where π is a permutation on {1,..., m} and 0 l(i) < i with X 0 denoting a deterministic constant (for the root node s dependence). Let T C denote the set of all possible causal dependence tree approximations. Like before, we want to find the P X T C that minimizes the KL divergence. Theorem 2: arg min D(P X P X ) = arg max I(X l(π(i)) X π(i) ). (19) P X T C P X T C Proof: The proof is similar to the proof of Theorem 1, except that (16) cannot be broken up, as the structural constraint couples choosing π( ) and l( ). Because the maximization became decoupled in Theorem 1, there was a simple algorithm to find the best structure, and that algorithm could be run in a distributed manner. Although that does not happen here, note that the optimal PX T C is maximizing a sum of pairwise directed information values. Each value corresponds to an edge weight for one directed edge in a complete directed graph on the m processes. To find the tree with maximal weight, we can employ a maximum-weight directed spanning tree (MWDST) algorithm. We discuss MWDST algorithms in Section VI-A. Algorithm 2 describes the procedure to find the best approximating distribution with a causal dependence tree structure. Algorithm 2. Causal Dependence Tree Input: R 1. For i {1,..., m} 2. a(i) 3. For j {1,..., m}\{i} 4. Compute I(X j X i ) 5. {a(i)} m MWDST ({I(X j X i )} 1 i j m ) Since T C contains simpler approximations than G 1, Algorithm 1 s approximations are superior to Algorithm 2 s in terms of KL divergence. For some applications, however, having a directed tree can be more useful for analysis and allocation of resources. Remark 3: Chow and Liu [4] solved an analogous problem for a collection of random variables. They developed an algorithm to efficiently find the best tree structured approximation for a Markov network (or, equivalently for that problem, a Bayesian network). They showed that using KL divergence, finding the best tree approximation was equivalent to maximizing a sum of mutual informations. They used a maximum weight spanning tree to solve the optimization. Thus, even though directed information graphs have different properties than Markov or Bayesian networks, and operate on a collection of random processes not variables, the method for finding the best tree is analogous. Next, we consider the consistency of these algorithms in the setting of estimating from data. We discuss estimation in Section VII-A. Theorem 3: Suppose P X T C and the estimates {I(X j X i )} 1 j i m converge almost surely (a.s.). Then for the output P X of Algorithm 2, P X P X a.s. (20)

7 7 Proof: Since P X T C, by Lemma 4.2, P X is the unique tree structure with maximal sum of directed informations along its edges. Algorithm 2 finds the tree with maximal weight, and thus if the edge weights converge almost surely, then the tree estimate does also. Note that an analogous result holds for Algorithm 1 in the case P X G 1. In general, there could be multiple approximation structures in T C or G 1 with the same maximal weight, so P X might not converge, but the approximating structures picked would almost surely be among those of maximal weight. VI. COMPLEXITY In this section, we will discuss the complexity both of the algorithms and storage requirements for the solution. A. Algorithmic complexity Both algorithms first compute the directed information values between each pair. For discrete random processes, computing the directed information, a divergence (3), in general involves summations over exponentially large alphabets. Computing one directed information value for two processes of length n is O( X 2n ). If the distributions are assumed to be jointly Markov of order k, then it becomes linear O(n X 2k ) = O(n) for fixed k. Thus computing the directed information for each ordered pair of processes is O(nm 2 ) work when Markovicity is assumed. For both algorithms, computation of the directed informations can be done independently: the for loops in lines 1 and 4 of both algorithms can be done in a distributed fashion. Note that computing only pairwise relationships is computationally much more tractable than in the full case. To identify the true directed information graph, divergence calculations using the whole network of processes are used [2], requiring O( X mn ) time without Markovicity, and O(n X mk ) with Markovicity. Furthermore, the computation can reduced by calculating mutual informations initially for line 4 in both algorithms. Equation (4) holding means P X,Y = P X Y P Y X which implies I(X; Y) = I(X Y) + I(Y X) [14]. Since mutual and directed informations are non-negative, the mutual information bounds each directed information. Either directed information can later be computed to resolve both. After computing the pairwise directed informations, Algorithm 1 then picks the best parent for each process, which takes O(m 2 ) total, so the total runtime is O(nm 2 ) assuming Markovicity. Algorithm 2 additionally computes a maximum weight spanning tree. Chu and Liu [29], Edmonds [30], and Bock [31] independently developed an efficient MWDST algorithm, which runs in O(m 2 ) time. Thus, like Algorithm 1, Algorithm 2 also runs in O(nm 2 ). Note that Humblet [32] proposed a distributed MWDST algorithm, which constructs the maximum weight tree for each node as root in O(m 2 ) time. In some applications, it is useful to be able to choose from multiple potential roots. B. Storage complexity In the full joint distribution, there are mn variables. Each possible outcome might have unique probability. Thus, for discrete variables with alphabet X, the total storage for the joint distribution is O( X mn ). Both approximations we consider reduce the full joint distribution to m pairwise distributions. Thus, the storage is O(m X 2n ). Further, if the approximations have Markovicity of order k, the total storage becomes O(mn X 2k ) = O(mn) for constant k. VII. APPLICATIONS TO SIMULATED AND EXPERIMENTAL DATA In this section, we demonstrate the efficacy of the approximations in a classification experiment with simulated timeseries. We then show the approximations capture important structural characteristics of a network of brain cells from a neuroscience experiment. First we discuss parametric estimation of directed information from data. A. Parametric Estimation While a thorough discussion of estimation techniques is outside the scope of this work, for completeness we briefly describe the consistent parametric estimation technique for directed information proposed in [24] and [33] and applied to study brain cell networks. After we discuss estimation for the specific setting of autoregressive time-series. 1) Point-Process Parametric Models: Let Y and X denote two binary time series of brain cell activity. Y t = 1 if cell Y was active at time t, otherwise 0. Truccolo et al. [34] proposed modeling how Y depends on its own past and the past of X using a point process framework. The conditional log-likelihood has the form n log f Y X (y x; θ) = log λ θ (t, y t 1, x t 1 )y t λ θ (t, y t 1, x t 1 ), where is the time length between samples and λ θ (t, y t 1, x t 1 ) is the conditional intensity function [34] log λ θ (t, y t 1, x t 1 ) = α 0 + J R α j y t j + β r x t r. j=1 r=1 λ θ (t, y t 1, x t 1 ) can be interpreted as the propensity of Y to be active at time t based on its past and the past of X. The Markov orders J and R are assumed to be unknown. To avoid over-fitting, the minimum description length penalty [35] is used to select the MLE θ: (Ĵ, R, θ) = arg min 1 (J,R,θ) n log f Y X(y x; θ) + (J + R) log n. 2n This penalty balances the Shannon code-length of encoding Y with causal side information X using a MLE θ(j, R) and the code-length required to describe the MLE θ(j, R). The directed information estimates are Î(X Y) 1 n log f Y X(y x; θ) f Y (y; θ, (21) )

8 8 where θ and θ are the MLE parameter vectors for their respective models. Under stationarity, ergodicity, and Markovicity, almost sure convergence of Î(X Y) is shown in [24]. These results extend to general parametric classes. Also note that for the setting of finite alphabets, [36] proposed universal estimation of directed information using context tree weighting. 2) Autoregressive Models: Next consider the specific parametric class of autoregressive time-series. Specifically, a Markov-order one autoregressive model (AR-1) for X is X t = BX t 1 + N t, (22) where B is a coefficient matrix and N t is i.i.d. white Gaussian noise with variance matrix Σ. The noise components are assumed to be independent, so Σ is diagonal. The coefficients (B, Σ) are fixed, so for two processes X = (X, Y) modeled as AR-1, I(X Y) = 1 n I(X t 1 ; Y t Y t 1 ) n = 1 n n I(X n 1 ; Y n Y n 1 ) (23) = 1 [ ] 2 log KYn,Y n 1 K Xn 1,Y n 1, (24) K Yn 1 K Xn 1,Y n,y n 1 where (23) follows from stationarity and Markovicity and (24) follows from (pg. 249 of [13]) with K Yn,Y n 1 denoting the determinant of the covariance matrix of {Y n, Y n 1 }. Note that by the recurrence relation (22), the covariance matrix K Xt,X t can be computed as K Xt,X t = min(t,t ) s=1 (B t s )Σ(B t s ). (25) Thus, estimates of Î(X Y) can be computed by first finding the least squares estimate B of the coefficient matrix in (22), then computing covariance matrices using (25), and then computing (24). B. Classification experiment We tested the utility of the approximation methods using a binary classification experiment. 1) Setup: For the number of processes m {5, 10, 15}, 100 pairs of AR-1 models (B, Σ) and (B, Σ ) were randomly generated. Each element of the m m coefficient matrix B was generated i.i.d. from a N (0, 1) distribution. Σ was an m m diagonal matrix with entries randomly selected from the interval [ 1 4, 1] uniformly. For each AR model (B, Σ), time-series of lengths n {50, 10 2, 10 3, 10 4 } were generated using (22). The coefficients of (B, Σ) were estimated using least squares for each of the time-series. The best parent and best tree approximations were computed using estimated coefficients. The directed informations {I(X Y)} between each pair (X, Y) were estimated using the method in Section VII-A2 with X = (X, Y). To identify the MWDSTs, a Matlab implementation of Edmunds s algorithm [37] was used. Coefficients (B, Σ ) were generated and estimated likewise. Next, classification was performed. For each pair of models (B, Σ) and (B, Σ ), n = 10 6 length time-series were generated from each model using (22). First, the log-likelihoods of each time-step conditioned on the past was computed for the full distributions using estimates of (B, Σ) and (B, Σ ). The frequency of correct classification was calculated. Next, the log-likelihoods using the best parent approximations with estimated coefficients were calculated and then those for the best tree approximations. This was repeated for each set of coefficient estimates, corresponding to n {50, 10 2, 10 3, 10 4 }. 2) Results: The results of these classification experiments are shown in Figure 4. The classification rates are averaged over the 100 trials. Error bars show standard deviation. The best parent approximations only perform slightly better than the best tree approximations. Both performed close to 85% correct classification rate, slightly improving with larger m. Classification using the full distribution noticeably improves with m. This is due to the increased complexity of the distributions; with more processes, there are more relationships to distinguish the distributions. There are m(m 1) edges in the full distribution compared to m in the best parent and m 1 in the best tree approximations. Despite having significantly fewer edges, the approximations capture enough structure to distinguish models. The effect of having a small number of samples to estimate AR coefficients is more dramatic as m increases. For m {5, 10, 15}, coefficients estimated with n = 10 3 and n = 10 4 length time-series performed almost identically. C. Application to Experimental Data We now discuss an application of these methods to analysis of neural activity. A recent study computed the directed information graph for a group of neurons in a monkey s primary motor cortex [24]. Using that graph, they identified a dominant axis of local interactions, which corresponded to the known, primary direction of wave propagation of regional synchronous activity, believed to mediate information transfer [38]. We show that the best parent and best tree approximations preserve that dominant axis. The monkey was performing a sequence of arm-reaching tasks. Its arm was constrained to move along a horizontal surface. Each task involved presentation of a randomly positioned, fixed target on the surface, the monkey moving its hand to meet the target, and a reward (drop of juice) given to the monkey if it was successful. For more details, see [24], [39]. Neural activity in the primary motor cortex was recorded by an implanted silicon micro-electrode array. The recorded waveforms were filtered and processed to produce, for each neuron that was detected, a sequence of times when that neuron became active (e.g. it spiked. ). The 37 neurons with the greatest total activity (number of spikes) were used for analysis. To study the flow of activity between individual neurons, we constructed a directed information graph on the collection of neurons. To simplify computation, pairwise directed informations were estimated using the parametric estimation procedure discussed in Section VII-A.

9 9 (a) m = 5. (b) m = 10. (c) m = 15. Fig. 4. Classification rate between pairs of autoregressive series. For each m {5, 10, 15}, 100 pairs of autoregressive coefficients were generated randomly. Classification was performed using the full structures, best parent approximations, and best tree approximations, using coefficients estimated with n {50, 10 2, 10 3, 10 4 } length time-series. Error bars depict standard deviation. (a) Graphical structure of non-zero pairwise directed information values. (b) Causal dependence tree approximation. Fig. 5. Graphical structures of non-zero pairwise directed information values from [24] and causal dependence tree approximation. The best parent approximation was almost identical and is not shown. The blue arrow in Figure 5(a) depicts a dominant orientation of the edges. That orientation is consistent with the direction of propagation of local field potentials, which is believed to mediate information transfer [38]. Both approximations preserve that structure. Figure 5(a) depicts the pairwise directed information graph. The relative positions of the neurons in the graph correspond to the relative positions of the recording electrodes. The blue arrow indicates a dominant orientation of the edges. This orientation along the rostro-caudal axis is consistent with the direction of propagation of local field potentials, which researchers believe mediates information transfer between regions [38]. We applied Algorithms 1 and 2 to this data set. The structure of the dependence tree approximation is shown in Figure 5(b). The best parent approximation is almost identical. The only differences are that the parents of nodes 28 and 13 are 27 and 3 respectively. The original graph had 117 edges with many complicated loops. Both approximations reduced the number of edges by a third, improving the clarity of the graph. Both approximations preserve the dominant edge orientation - pertaining to wave propagation - depicted by the blue arrow in Figure 5(a). This suggests that these approximation methodologies preserve relevant information for decision-making and visualization for analysis of mechanistic biological phenomena. VIII. CONCLUSION In this work, we presented efficient methods to optimally approximate networks of stochastic, dynamically interacting processes with low-complexity approximation methods. Both approximations only required pairwise marginal statistics between the processes, which computationally are significantly

10 10 more tractable than the full joint distribution. Also, the corresponding directed information graphs are much more accessible to analysis and practical usage for many applications. An important line of future work involves investigating methods to approximate with other, more complicated structures. Best-parent approximations and causal dependence tree approximations will always reduce the storage complexity dramatically and facilitate analysis. However, for some applications, it might be desirable to have slightly more complicated structures, such as connected graphs with at most three parents for each node. Such approximations highlight a richer set of interactions and feedback while still being visually and computationally simpler to analyze than the full structure. Although it might not always be possible to efficiently find optimal approximations of such graphical complexity, even near-optimal approximations could prove quite beneficial to real world applications. ACKNOWLEDGMENTS The authors thank Jalal Etesami and Mavis Rodrigues for their assistance with computer simulations. This work was supported in part to C. J. Quinn by the NSF IGERT fellowship under grant number DGE , and the Department of Energy Computational Science Graduate Fellowship under grant number DE-FG02-97ER25308; to N. Kiyavash by AFOSR under grants FA , FA , and FA ; and by NSF grant CCF CAR; and to T. P. Coleman by NSF Science & Technology Center grant CCF and NSF grant CCF REFERENCES [1] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques. The MIT Press, [2] C. Quinn, N. Kiyavash, and T. Coleman, Directed information graphs, Arxiv preprint arxiv: , [3] C. Granger, Investigating causal relations by econometric models and cross-spectral methods, Econometrica, vol. 37, no. 3, pp , [4] C. Chow and C. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. on Information Theory, vol. 14, no. 3, pp , [5] C. Shalizi, M. Camperi, and K. Klinkner, Discovering functional communities in dynamical networks, Statistical Network Analysis: Models, Issues, and New Directions, pp , [6] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp , [7] A. Bolstad, B. Van Veen, and R. Nowak, Causal network inference via group sparse regularization, Signal Processing, IEEE Trans. on, vol. 59, no. 6, pp , [8] A. Puig, A. Wiesel, G. Fleury, and A. Hero, Multidimensional shrinkage-thresholding operator and group lasso penalties, Signal Processing Letters, IEEE, vol. 18, no. 6, pp , June [9] V. Tan and A. Willsky, Sample complexity for topology estimation in networks of LTI systems, in Decision and Control, IEEE Conference on. IEEE, [10] D. Materassi and G. Innocenti, Topological identification in networks of dynamical systems, Automatic Control, IEEE Trans. on, vol. 55, no. 8, pp , [11] D. Materassi and M. Salapaka, On the problem of reconstructing an unknown topology via locality properties of the Wiener filter, Automatic Control, IEEE Trans. on, no. 99, pp. 1 1, [12] J. Etesami, N. Kiyavash, and T. P. Coleman, Learning minimal latent directed information trees, in Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on. IEEE, 2012, pp [13] T. Cover and J. Thomas, Elements of information theory. Wiley- Interscience, [14] H. Marko, The bidirectional communication theory a generalization of information theory, Communications, IEEE Trans. on, vol. 21, no. 12, pp , Dec [15] J. Massey, Causality, feedback and directed information, in Proc Intl. Symp. on Info. Th. and its Applications, 1990, pp [16] G. Kramer, Directed information for channels with feedback, Ph.D. dissertation, Swiss Federal Institute of Technology (ETH), Zrich, Switzerland, [17] S. Tatikonda and S. Mitter, The Capacity of Channels With Feedback, IEEE Trans. on Information Theory, vol. 55, no. 1, pp , [18] H. Permuter, T. Weissman, and A. Goldsmith, Finite State Channels With Time-Invariant Deterministic Feedback, IEEE Trans. on Information Theory, vol. 55, no. 2, pp , [19] C. Li and N. Elia, The information flow and capacity of channels with noisy feedback, arxiv preprint arxiv: , [20] R. Venkataramanan and S. Pradhan, Source coding with feed-forward: rate-distortion theorems and error exponents for a general source, IEEE Trans. on Information Theory, vol. 53, no. 6, pp , [21] N. Martins and M. Dahleh, Feedback control in the presence of noisy channels: bode-like fundamental limitations of performance, Automatic Control, IEEE Trans. on, vol. 53, no. 7, pp , Aug [22] S. K. Gorantla, The interplay between information and control theory within interactive decision-making problems, Ph.D. dissertation, University of Illinois at Urbana-Champaign, [23] H. Permuter, Y. Kim, and T. Weissman, Interpretations of directed information in portfolio theory, data compression, and hypothesis testing, Information Theory, IEEE Trans. on, vol. 57, no. 6, pp , [24] C. Quinn, T. Coleman, N. Kiyavash, and N. Hatsopoulos, Estimating the directed information to infer causal relationships in ensemble neural spike train recordings, Journal of Computational Neuroscience, vol. 30, no. 1, pp , [25] R. M. Gray, Probability, random processes, and ergodic properties. Springer, [26] T. Weissman, Y.-H. Kim, and H. H. Permuter, Directed Information, Causal Estimation, and Communication in Continuous Time, ArXiv e- prints, Sep [27] C. Quinn, T. Coleman, and N. Kiyavash, Approximating discrete probability distributions with causal dependence trees, in Info. Theory and its App. (ISITA), 2010 Intl. Symp. on. IEEE, 2010, pp [28] H. Royden and P. Fitzpatrick, Real analysis, 3rd ed. Macmillan New York, [29] Y. Chu and T. Liu, On the shortest arborescence of a directed graph, Science Sinica, vol. 14, no , p. 270, [30] J. Edmonds, Optimum branchings, J. Res. Natl. Bur. Stand., Sect. B, vol. 71, pp , [31] F. Bock, An algorithm to construct a minimum directed spanning tree in a directed network, Developments in operations research, vol. 1, pp , [32] P. Humblet, A distributed algorithm for minimum weight directed spanning trees, Communications, IEEE Trans. on, vol. 31, no. 6, pp , [33] S. Kim, D. Putrino, S. Ghosh, and E. N. Brown, A Granger causality measure for point process models of ensemble neural spiking activity, PLoS Comput Biol, vol. 7, no. 3, March [34] W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue, and E. N. Brown, A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects, Journal of Neurophysiology, vol. 93, no. 2, pp , [35] P. D. Grünwald, The minimum description length principle. MIT press, [36] J. Jiao, H. Permuter, L. Zhao, Y. Kim, and T. Weissman, Universal estimation of directed information, Arxiv preprint arxiv: , [37] G. Li, Maximum Weight Spanning tree (Undirected), Computer software, June [Online]. Available: maximum-weight-spanning-tree-undirected [38] D. Rubino, K. Robbins, and N. Hatsopoulos, Propagating waves mediate information transfer in the motor cortex, Nature neuroscience, vol. 9, no. 12, pp , [39] W. Wu and N. Hatsopoulos, Evidence against a single coordinate system representation in the motor cortex, Experimental brain research, vol. 175, no. 2, pp , 2006.

Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs

Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 12, JUNE 15, 2013 3173 Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs Christopher J. Quinn, Student Member,

More information

Causal Dependence Tree Approximations of Joint Distributions for Multiple Random Processes

Causal Dependence Tree Approximations of Joint Distributions for Multiple Random Processes 1 Causal Dependence Tree Approximations of Joint Distributions for Multiple Random Processes Christopher J. Quinn, Student Member, IEEE, Todd P. Coleman, Member, IEEE, and Negar Kiyavash, Member, IEEE

More information

Approximating Discrete Probability Distributions with Causal Dependence Trees

Approximating Discrete Probability Distributions with Causal Dependence Trees Approximating Discrete Probability Distributions with Causal Dependence Trees Christopher J. Quinn Department of Electrical and Computer Engineering University of Illinois Urbana, Illinois 680 Email: quinn7@illinois.edu

More information

c 2010 Christopher John Quinn

c 2010 Christopher John Quinn c 2010 Christopher John Quinn ESTIMATING DIRECTED INFORMATION TO INFER CAUSAL RELATIONSHIPS BETWEEN NEURAL SPIKE TRAINS AND APPROXIMATING DISCRETE PROBABILITY DISTRIBUTIONS WITH CAUSAL DEPENDENCE TREES

More information

Latent Tree Approximation in Linear Model

Latent Tree Approximation in Linear Model Latent Tree Approximation in Linear Model Navid Tafaghodi Khajavi Dept. of Electrical Engineering, University of Hawaii, Honolulu, HI 96822 Email: navidt@hawaii.edu ariv:1710.01838v1 [cs.it] 5 Oct 2017

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER Zhen Zhen 1, Jun Young Lee 2, and Abdus Saboor 3 1 Mingde College, Guizhou University, China zhenz2000@21cn.com 2 Department

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

Large-Deviations and Applications for Learning Tree-Structured Graphical Models

Large-Deviations and Applications for Learning Tree-Structured Graphical Models Large-Deviations and Applications for Learning Tree-Structured Graphical Models Vincent Tan Stochastic Systems Group, Lab of Information and Decision Systems, Massachusetts Institute of Technology Thesis

More information

EEG- Signal Processing

EEG- Signal Processing Fatemeh Hadaeghi EEG- Signal Processing Lecture Notes for BSP, Chapter 5 Master Program Data Engineering 1 5 Introduction The complex patterns of neural activity, both in presence and absence of external

More information

MUTUAL information between two random

MUTUAL information between two random 3248 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 57, NO 6, JUNE 2011 Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing Haim H Permuter, Member, IEEE,

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information 204 IEEE International Symposium on Information Theory Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information Omur Ozel, Kaya Tutuncuoglu 2, Sennur Ulukus, and Aylin Yener

More information

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References 24th March 2011 Update Hierarchical Model Rao and Ballard (1999) presented a hierarchical model of visual cortex to show how classical and extra-classical Receptive Field (RF) effects could be explained

More information

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University Learning from Sensor Data: Set II Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University 1 6. Data Representation The approach for learning from data Probabilistic

More information

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models arxiv:1811.03735v1 [math.st] 9 Nov 2018 Lu Zhang UCLA Department of Biostatistics Lu.Zhang@ucla.edu Sudipto Banerjee UCLA

More information

Introduction to graphical models: Lecture III

Introduction to graphical models: Lecture III Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction

More information

Expectation Propagation in Factor Graphs: A Tutorial

Expectation Propagation in Factor Graphs: A Tutorial DRAFT: Version 0.1, 28 October 2005. Do not distribute. Expectation Propagation in Factor Graphs: A Tutorial Charles Sutton October 28, 2005 Abstract Expectation propagation is an important variational

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

BAYESIAN ESTIMATION OF UNKNOWN PARAMETERS OVER NETWORKS

BAYESIAN ESTIMATION OF UNKNOWN PARAMETERS OVER NETWORKS BAYESIAN ESTIMATION OF UNKNOWN PARAMETERS OVER NETWORKS Petar M. Djurić Dept. of Electrical & Computer Engineering Stony Brook University Stony Brook, NY 11794, USA e-mail: petar.djuric@stonybrook.edu

More information

Causality and communities in neural networks

Causality and communities in neural networks Causality and communities in neural networks Leonardo Angelini, Daniele Marinazzo, Mario Pellicoro, Sebastiano Stramaglia TIRES-Center for Signal Detection and Processing - Università di Bari, Bari, Italy

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing

Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing arxiv:092.4872v [cs.it] 24 Dec 2009 Haim H. Permuter, Young-Han Kim, and Tsachy Weissman Abstract We

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information

Communication Theory II

Communication Theory II Communication Theory II Lecture 8: Stochastic Processes Ahmed Elnakib, PhD Assistant Professor, Mansoura University, Egypt March 5 th, 2015 1 o Stochastic processes What is a stochastic process? Types:

More information

The sequential decoding metric for detection in sensor networks

The sequential decoding metric for detection in sensor networks The sequential decoding metric for detection in sensor networks B. Narayanaswamy, Yaron Rachlin, Rohit Negi and Pradeep Khosla Department of ECE Carnegie Mellon University Pittsburgh, PA, 523 Email: {bnarayan,rachlin,negi,pkk}@ece.cmu.edu

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 4 Learning Bayesian Networks CS/CNS/EE 155 Andreas Krause Announcements Another TA: Hongchao Zhou Please fill out the questionnaire about recitations Homework 1 out.

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Lattices for Distributed Source Coding: Jointly Gaussian Sources and Reconstruction of a Linear Function

Lattices for Distributed Source Coding: Jointly Gaussian Sources and Reconstruction of a Linear Function Lattices for Distributed Source Coding: Jointly Gaussian Sources and Reconstruction of a Linear Function Dinesh Krithivasan and S. Sandeep Pradhan Department of Electrical Engineering and Computer Science,

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Scalable robust hypothesis tests using graphical models

Scalable robust hypothesis tests using graphical models Scalable robust hypothesis tests using graphical models Umamahesh Srinivas ipal Group Meeting October 22, 2010 Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

Information maximization in a network of linear neurons

Information maximization in a network of linear neurons Information maximization in a network of linear neurons Holger Arnold May 30, 005 1 Introduction It is known since the work of Hubel and Wiesel [3], that many cells in the early visual areas of mammals

More information

Inference and estimation in probabilistic time series models

Inference and estimation in probabilistic time series models 1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence

More information

Online Forest Density Estimation

Online Forest Density Estimation Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density

More information

On Scalable Coding in the Presence of Decoder Side Information

On Scalable Coding in the Presence of Decoder Side Information On Scalable Coding in the Presence of Decoder Side Information Emrah Akyol, Urbashi Mitra Dep. of Electrical Eng. USC, CA, US Email: {eakyol, ubli}@usc.edu Ertem Tuncel Dep. of Electrical Eng. UC Riverside,

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

Supplementary Note on Bayesian analysis

Supplementary Note on Bayesian analysis Supplementary Note on Bayesian analysis Structured variability of muscle activations supports the minimal intervention principle of motor control Francisco J. Valero-Cuevas 1,2,3, Madhusudhan Venkadesan

More information

Expectation propagation for signal detection in flat-fading channels

Expectation propagation for signal detection in flat-fading channels Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds. Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds. A SIMULATION-BASED COMPARISON OF MAXIMUM ENTROPY AND COPULA

More information

QUANTIZATION FOR DISTRIBUTED ESTIMATION IN LARGE SCALE SENSOR NETWORKS

QUANTIZATION FOR DISTRIBUTED ESTIMATION IN LARGE SCALE SENSOR NETWORKS QUANTIZATION FOR DISTRIBUTED ESTIMATION IN LARGE SCALE SENSOR NETWORKS Parvathinathan Venkitasubramaniam, Gökhan Mergen, Lang Tong and Ananthram Swami ABSTRACT We study the problem of quantization for

More information

Observed Brain Dynamics

Observed Brain Dynamics Observed Brain Dynamics Partha P. Mitra Hemant Bokil OXTORD UNIVERSITY PRESS 2008 \ PART I Conceptual Background 1 1 Why Study Brain Dynamics? 3 1.1 Why Dynamics? An Active Perspective 3 Vi Qimnü^iQ^Dv.aamics'v

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

PERFORMANCE STUDY OF CAUSALITY MEASURES

PERFORMANCE STUDY OF CAUSALITY MEASURES PERFORMANCE STUDY OF CAUSALITY MEASURES T. Bořil, P. Sovka Department of Circuit Theory, Faculty of Electrical Engineering, Czech Technical University in Prague Abstract Analysis of dynamic relations in

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Independent Component Analysis. Contents

Independent Component Analysis. Contents Contents Preface xvii 1 Introduction 1 1.1 Linear representation of multivariate data 1 1.1.1 The general statistical setting 1 1.1.2 Dimension reduction methods 2 1.1.3 Independence as a guiding principle

More information

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy

More information

Layered Synthesis of Latent Gaussian Trees

Layered Synthesis of Latent Gaussian Trees Layered Synthesis of Latent Gaussian Trees Ali Moharrer, Shuangqing Wei, George T. Amariucai, and Jing Deng arxiv:1608.04484v2 [cs.it] 7 May 2017 Abstract A new synthesis scheme is proposed to generate

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

More information

Bayesian probability theory and generative models

Bayesian probability theory and generative models Bayesian probability theory and generative models Bruno A. Olshausen November 8, 2006 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

On Information Maximization and Blind Signal Deconvolution

On Information Maximization and Blind Signal Deconvolution On Information Maximization and Blind Signal Deconvolution A Röbel Technical University of Berlin, Institute of Communication Sciences email: roebel@kgwtu-berlinde Abstract: In the following paper we investigate

More information

Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels

Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels Nan Liu and Andrea Goldsmith Department of Electrical Engineering Stanford University, Stanford CA 94305 Email:

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

A variational radial basis function approximation for diffusion processes

A variational radial basis function approximation for diffusion processes A variational radial basis function approximation for diffusion processes Michail D. Vrettas, Dan Cornford and Yuan Shen Aston University - Neural Computing Research Group Aston Triangle, Birmingham B4

More information

A Monte Carlo Sequential Estimation for Point Process Optimum Filtering

A Monte Carlo Sequential Estimation for Point Process Optimum Filtering 2006 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006 A Monte Carlo Sequential Estimation for Point Process Optimum Filtering

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

On Common Information and the Encoding of Sources that are Not Successively Refinable

On Common Information and the Encoding of Sources that are Not Successively Refinable On Common Information and the Encoding of Sources that are Not Successively Refinable Kumar Viswanatha, Emrah Akyol, Tejaswi Nanjundaswamy and Kenneth Rose ECE Department, University of California - Santa

More information

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 Probabilistic

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing Linear recurrent networks Simpler, much more amenable to analytic treatment E.g. by choosing + ( + ) = Firing rates can be negative Approximates dynamics around fixed point Approximation often reasonable

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Parametrizations of Discrete Graphical Models

Parametrizations of Discrete Graphical Models Parametrizations of Discrete Graphical Models Robin J. Evans www.stat.washington.edu/ rje42 10th August 2011 1/34 Outline 1 Introduction Graphical Models Acyclic Directed Mixed Graphs Two Problems 2 Ingenuous

More information

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,

More information

Directed Graphical Models or Bayesian Networks

Directed Graphical Models or Bayesian Networks Directed Graphical Models or Bayesian Networks Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Bayesian Networks One of the most exciting recent advancements in statistical AI Compact

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information