A Spectral Algorithm for Latent Junction Trees

Size: px
Start display at page:

Download "A Spectral Algorithm for Latent Junction Trees"

Transcription

1 A Algorithm for Latent Junction Trees Ankur P. Parikh Carnegie Mellon University Le Song Georgia Tech Mariya Ishteva Georgia Tech Gabi Teodoru Gatsby Unit, UCL Abstract Latent variable models are an elegant framework for capturing rich probabilistic dependencies in many applications. However, current approaches typically parametrize these models using conditional probability tables, and learning relies predominantly on local search heuristics such as Expectation Maximization. Using tensor algebra, we propose an alternative parameterization of latent variable models (where the model structures are junction trees) that still allows for computation of marginals among observed variables. While this novel representation leads to a moderate increase in the number of parameters for junction trees of low treewidth, it lets us design a local-minimum-free algorithm for learning this parameterization. The main computation of the algorithm involves only tensor operations and SVDs which can be orders of magnitude faster than algorithms for large datasets. To our knowledge, this is the first provably consistent parameter learning technique for a large class of low-treewidth latent graphical models beyond trees. We demonstrate the advantages of our method on synthetic and real datasets. 1 Introduction Eric P. Xing Carnegie Mellon University epxing@cs.cmu.edu Latent variable models such as Hidden Markov Models (HMMs) (Rabiner & Juang, 1986), and Latent Dirichlet Allocation (Blei et al., 2003) have become a popular framework for modeling complex dependencies among variables. A latent variable can represent an abstract concept (such as topic or state), thus enriching the dependency structure among the observed variables while simultaneously allowing for a more tractable representation. Typically, a latent variable model is parameterized by a set of conditional probability tables (CPTs) each associated with an edge in the latent graph structure. For instance, an HMM can be parametrized compactly by a transition probability table and an observation probability table. By summing out the latent variables in the HMM, we obtain a fully connected graphical model for the observed variables. Although the parametrization of latent variable models using CPTs is very compact, parameters in this representation can be difficult to learn. Compared to parameter learning in fully observed models which is either of closed form or convex (Koller & Friedman, 2009), most parameter learning algorithms for latent variable models resort to maximizing a non-convex objective via Expectation Maximization () (Dempster et al., 1977). can get trapped in local optima and has slow convergence. While explicitly learns the CPTs of a latent variable model, in many cases the goal of the model is primarily for prediction and thus the actual latent parameters are not needed. One example is determining splicing sites in DNA sequences (Asuncion & Newman, 2007). One can build a different latent variable model, such as an HMM, for each type of splice site from training data. A new sequence is then classified by determining which model it is most likely to have been generated by. Other examples include supervised topic modelling such as (Blei & McAuliffe, 2007; Lacoste- Julien et al., 2008; Zhu et al., 2009) and collaborative filtering (Su & Khoshgoftaar, 2009). In these cases, it is natural to ask whether there exists an alternative representation/parameterization of a latent variable model where parameter learning can be done consistently and the representation remains tractable for inference among the observed variables. This question has been tackled recently by Hsu et al. (2009), Balle et al. (2011), and Parikh et al. (2011) who proposed spectral algorithms for local-minimumfree learning of HMMs, finite state transducers, and latent tree graphical models respectively. Unlike traditional parameter learning algorithms such as, spectral algorithms do not directly learn the CPTs of

2 Latent Variable Model Latent Junction Tree Tensor Representation Transformed Tensor Representation Estimation B D A C E G H I J K F L C 7 C 1 C 2 C 3 C 10 C 8 C 9 C 4 C 5 C 6 P(R 2 S 2) P(R 1 S 1) P(R 9 S 9) P(R 6 S 6) P(C P(R 10) 7 S 7) P(R 8 S 8) P(R 5 S 5) P(R 4 S 4) P(R 3 S 3) P(R 2 S 2) P(R 1 S 1) P(R 6 S 6) Transformation Inverse Transformation T 7 P(R 9 S 9) P(R 7 S 7) P(C 10) P(R 8 S 8) P(R 5 S 5) Transformed matrix and tensor representation P(R 4 S 4) P(R 3 S 3) T 7 C 7 T 7 C 1 C 2 C 3 T 7 P O 1, O 2, O 3, P(O 3, O 1 ) Figure 1: Our algorithm for local-minimum-free learning of latent variable models consist of four major steps. (1) First, we transform a model into a junction tree, such that each node in the junction tree corresponds to a maximal clique of variables in the triangulated graph of the original model. (2) Then we embed the clique potentials of the junction tree into higher order tensors and express the marginal distribution of the observed variables as a tensor-tensor/matrix multiplication according to the message passing algorithm. (3) Next we transform the tensor representation by inserting a pair of transformations between those tensor-tensor/matrix operations. Each pair of transformations is chosen so that they are inversions of each other. (4) Lastly, we show that each transformed representation is a function of only observed variables. Thus, we can estimate each individual transformed tensor quantity using samples from observed variables. a latent variable model. Instead they learn an alternative parameterization (called the observable representation) which generally contains a larger number of parameters than the CPTs, but where computing observed marginals is still tractable. Moreover, these alternative parameters have the advantage that they only depend on observed variables and can therefore be directly estimated from data. Thus, parameter learning in the alternative representation is fast, localminimum-free, and provably consistent. Furthermore, spectral algorithms can be generalized to nonparametric latent models (Song et al., 2010, 2011) where it is difficult to run. However, existing spectral algorithms apply only to restricted latent structures (HMMs and latent trees), while latent structures beyond trees, such as higher order HMMs (Kundu et al., 1989), factorial HMMs (Ghahramani & Jordan, 1997) and Dynamic Bayesian Networks (Murphy, 2002), are needed and have been proven useful in many real world problems. The challenges for generalizing spectral algorithms to general latent structured models include the larger factors, more complicated conditional independence structures, and the need to sum out multiple variables simultaneously. The goal of this paper is to develop a new representation for latent variable models with structures beyond trees, and design a spectral algorithm for learning this representation. We will focus on latent junction trees; thus the algorithm is suitable for both directed and undirected models which can be transformed into junction trees. Concurrently to our work, Cohen et al. (2012) proposed a spectral algorithm for Latent Probabilistic Context Free Grammars (PCFGs). Latent PCFGs are not trees, but have many tree-like properties, and so the representation Cohen et al. (2012) propose does not easily extend to other non-tree models such as higher order/factorial HMMs that we consider here. Our more general approach requires more complex tensor operations, such as multi-mode inversion, that are not used in the latent PCFG case. The key idea of our approach is to embed the clique potentials of the junction tree into higher order tensors such that the computation of the marginal probability of observed variables can be carried out via tensor operations. While this novel representation leads only to a moderate increase in the number parameters for junction trees of low treewidth, it allows us to design an algorithm that can recover a transformed version of the tensor parameterization and ensure that the joint probability of observed variables are computed correctly and consistently. The main computation of the algorithm involves only tensor operations and singular value decompositions (hence the name spectral ) which can be orders of magnitude faster than algorithms in large datasets. To our knowledge, this is the first provably consistent parameter learning technique for a large class of low-treewidth latent graphical models beyond trees. In our experiments with large scale synthetic datasets, we show that our spectral algorithm can be almost 2 orders of magnitude faster than while at the same achieving considerably better accuracy. Our spectral algorithm also achieves comparable accuracy to on real data. Organization of paper. A high level overview of our approach is given in Figure 1. We first provide some background on tensor algebra and latent junction trees. We then derive the spectral algorithm by representing junction tree message passing with tensor operations, and then transform this representation into one that only depends on observed variables. Finally, we analyze the sample complexity of our method and evaluate it on synthetic and real datasets. 2 Tensor Notation We first give an introduction to the tensor notation tailored to this paper. An Nth order tensor is a multiway array with N modes, i.e., N indices

3 {i 1, i 2,..., i N } are needed to access its entries. Subarrays of a tensor are formed when a subset of the indices is fixed, and we use a colon to denote all elements of a mode. For instance, A(i 1,..., i n 1, :, i n+1,..., i N ) are all elements in the nth mode of a tensor A with indices from the other N 1 modes fixed to {i 1,..., i n 1, i n+1,..., i N } respectively. Furthermore, we also use the shorthand i p:q = {i p, i p+1,..., i q 1, i q } for consecutive indices, e.g., A(i 1,..., i n 1, :, i n+1,..., i N ) = A(i 1:n 1, :, i n+1:n ). Labeling tensor modes with variables. In contrast to the conventional tensor notation such as the one described in Kolda & Bader (2009), the ordering of the modes of a tensors will not be essential in this paper. We will use random variables to label the modes of a tensor: each mode will correspond to a random variable and what is important is to keep track of this correspondence. Therefore, we think two tensors are equivalent if they have the same set of labels and they can be obtained from each other by a permutation of the modes for which the labels are aligned. In the matrix case this translates to A and A being equivalent in the sense that A carries the same information as A, as long as we remember that the rows of A are the columns of A and vice versa. We will use the following notation to denote this equivalence A = A (1) Under this notation, the dimension (or the size) of a mode labeled by variable X will be the same as the number of possible values for variable X. Furthermore, when we multiply two tensors together, we will always carry out the operation along (a set of) modes with matching labels. Tensor multiplication with mode labels. Let A R I1 I2 I N be an Nth order tensor and B R J1 J2 J M be an Mth order tensor. If X is a common mode label for both A and B (w.l.o.g. we assume that this is the first mode, implying also that I 1 = J 1 ), multiplying along this mode will give C = A X B R I2 I N J 2 J M, (2) where the entries of C is defined as C(i 2:N, j 2:M ) = I 1 A(i, i 2:N) B(i, j 2:M ) i=1 Similarly, we can multiply two tensors along multiple modes. Let σ = {X 1,..., X k } be an arbitrary set of k modes (k variables) shared by A and B (w.l.o.g. we assume these labels correspond to the first k modes, and I 1 = J 1,..., I k = J k holds for the corresponding dimensions). Then multiplying A and B along σ results in D = A σ B R I k+1... I N J k+1... J M, (3) where the entries of D are defined as D(i k+1:n, j k+1:m ) = A(i 1:k, i k+1:n )B(i 1:k, j k+1:m ). i 1:k Multi-mode multiplication can also be interpreted as reshaping the σ modes of A and B into a single mode and doing single-mode tensor multiplication. Furthermore, tensor multiplication with labels is symmetric in its arguments, i.e., A σ B = B σ A. Mode-specific identity tensor. We now define our notion of identity tensor with respect to a set of modes σ = {X 1,..., X K }. Let A be a tensor with mode labels containing σ, and I σ be a tensor with 2K modes with mode labels {X 1,..., X K, X 1,..., X K }. Then I σ is an identity tensor with respect to modes σ if A σ I σ = A. (4) One can also understand I σ using its matrix representation: flattening I σ with respect to σ (the first σ modes mapped to rows and the second σ modes mapped to columns) results in an identity matrix. Mode-specific tensor inversion. Let F, F 1 R I1 I K I K+1 I K+K be tensors of order K + K, and both have two sets of mode labels σ = {X 1,..., X K } and ω = {X K+1,..., X K+K }. Then F 1 is the inverse of F w.r.t. modes ω if and only if F ω F 1 = Iσ. (5) Multimode inversion can also be interpreted as reshaping F with respect to ω into a matrix of size (I 1... I K ) (I K+1... I K+K ), taking the inverse, and then rearranging back into a tensor. Thus the existence and uniqueness of this inverse can be characterized by the rank of the matricized version of F. Mode-specific diagonal tensors. We use δ to denote an N-way relation: its entry δ(i 1:N ) at position i 1:N equals 1 when all indexes are the same (i 1 = i 2 =... = i N ), and 0 otherwise. We will use d to denote repetition of an index d times. For instance, we use P( d X) to denote a dth order tensor where its entries at (i 1:d )th position are specified by δ(i 1:d )P(X = x i1 ). A diagonal matrix with its diagonal equal to P(X) is then denoted as P( 2 X). Similarly, we can define a (d + d )th order tensor P( d X d Y ) where its (i 1:d j 1:d )th entry corresponds to δ(i 1:d )δ(j 1:d )P(X = x i1 Y = y j1 ). 3 Latent Junction Trees In this paper, we will focus on discrete latent variable models where the number of states, k h, for each hidden variable is much smaller than the number of states, k o, for each observed variable. Uppercase letters denote random variables (e.g., X i ) and lowercase letters their instantiations (e.g., x i ). A latent variable model defines a joint probability distribution over a set of variables X = O H. Here, O denotes the set of observed variables, { X 1,..., X O }. H denotes the set of hidden variables, { X O +1,..., X H + O }. We will focus on latent variable models where the

4 structure of the model is a junction tree of low treewidth (Cowell et al., 1999). Each node C i in a junction tree corresponds to a subset (clique) of variables from the original graphical model. We will also use C i to denote the collection of variables contained in the node, i.e. C i X. Let C denote the set of all clique nodes. The treewidth is then the size of a largest clique in a junction tree minus one, that is t = max Ci C C i 1. Furthermore, we associate each edge in a junction tree with a separator set S ij := C i C j which contains the common variables of the two cliques C i and C j it is connected to. If we condition on all variables in any S ij, the variables on different sides of S ij will become independent. Without loss of generality, we assume that each internal clique node in the junction tree has exactly 3 neighbors. 1 Then we can pick a clique C r as the root of the tree and reorient all edges away from the root to induce a topological ordering of the clique nodes. Given the ordering, the root node will have 3 children nodes, denoted as C r1, C r2 and C r3. Each other internal node C i will have a unique parent node, denoted as C i0, and 2 children nodes denoted as C i1 and C i2. Each leaf node C l is only connected with its unique parent node C l0. Furthermore, we can simplify the notation for the separator set between a node C i and its parent C i0 as S i = C i C i0, omitting the index for the parent node. Then the remainder set of a node is defined as R i = C i \ S i. We also assume w.l.o.g. that if C i is a leaf in the junction tree, R i consists of only observed variables. We will use r i to denote an instantiation of the set of variables in R i. See Figure 2 for an illustration of notation. Given a root and a topological ordering of the nodes in a junction tree, the joint distribution of all variables X can be factorized according to P(X ) = X P(R i S i ), (6) i=1 where each CPT P(R i S i ), also called a clique potential, corresponds to a node C i. The number of parameters needed to specify the model is O( C ko), t linear in the number of cliques but exponential in the tree width t. Then the marginal distribution of the observed variables can be obtained by summing over the latent variables, P(O) = X... P(R i S i ), (7) X O +1 X O + H i=1 where we use X φ(x) to denote summation over all possible instantiations of φ(x) w.r.t. variable X. Note that each (non-leaf) remainder set R i contains a small subset of all latent variables. The presence of latent 1 If this is not the case, the derivation is similar but notationally much heavier. H C A G I E B D ACH ACE CE BCDE BC BD F BCF BDG C j R i = C i S i S i1 C i1 C i C i0 S i S i2 S i0 C i2 G O i1 F O i2 F, G O i H O i Figure 2: Example latent variable models with variables X = {A, B, C, D, E, F, G, H, I,...}, the observed variables are O = {F, G, H,...} (only partially drawn). Its corresponding junction tree is shown in the middle panel. Corresponding to this junction tree, we also show the general notation for it in the rightmost panel. variables introduces complicated dependency between observed variables, while at the same time only a small number of parameters corresponding to the entries in the CPTs are needed to specify the model. The process of eliminating the latent variables in (7) can be carried out efficiently via message passing. More specifically, the summation can be broken up into local computation for each node in the junction tree. Each node only needs to sum out a small number of variables and then the intermediate result, called the message, is passed to its parent for further processing. In the end the root node incorporates all messages from its children and produces the final result P (O). The local summation step, called the message update, can be generically written as 2 M(S i ) = R i P(R i S i )M(S i1 )M(S i2 ) (8) where we use M(S i ) to denote the intermediate results of eliminating variables in the remainder set R i. This message update is then carried out recursively according the reverse topological order of the junction tree until we reach the root node. The local summation step for the leaf nodes and root node can be viewed as special cases of (8). For a leaf node C l, there is no incoming message from children nodes, and hence M(S l ) = P(r l S l ); for the root node C r, S r = and R i = C i, and hence P(O) = M( ) = X C r P(C r )M(S r1 )M(S r2 )M(S r3 ). Example. The message update at the internal node C BCDE in Figure 2 is M({C, E}) = P(B, D C, E)P(f B, C)P(g B, D). B,D 4 Tensor Representation for Message Passing Although the parametrization of latent junction trees using CPTs is very compact and inference (message passing) can be carried out efficiently, parameters in this representation can be difficult to learn. Since the likelihood of the observed data is no longer convex in the latent parameters, local search heuristics, such as 2 For simplicity of notation, assume C i = S i S i1 S i2.

5 , are often employed to learning the parameters. Therefore, our goal is to design a new representation for latent junction trees, such that subsequent learning can be carried out in a local-minimum-free fashion. In this section, we will develop a new representation for the message update in (8) by embedding each CPT P(R i S i ) into a higher order tensor P(C i ). As we will see, there will be two advantages to the tensor form. The first is that tensor multiplication can be used to compactly express the sum and product steps involved in message passing. As a very simplistic example, let P A B = P(A B) be a conditional probability matrix and P B = P(B) be a marginal probability vector. Then matrix-vector multiplication, P A B P B = P (A), sums out variable B. However, if we put the marginal probability of B on the diagonal of a matrix, then B will not be summed out: e.g., if P 2B = P( 2 B), then P A B P 2B = P(A, B) (but now B is no longer on the diagonal). We will leverage these facts to derive our tensor representation for message passing. Moreover, we can then utilize tensor inversion to construct an alternate parameterization. In the very simplistic (matrix) example, note that P(A, B) = P A B P 2B = P A B F F 1 P 2B. The invertible transformations F will give us an extra degree of freedom to allow us to design an alternate parameterization of the latent junction tree that is only a function of observed variables. This would not be possible in the traditional representation (Eq. 8). 4.1 Embed CPTs to higher order tensors As we can see from (6), the joint probability distribution of all variables can be represented by a set of conditional distributions over just subsets of variables. Each one of this conditionals is a low order tensor. For example in Figure 2, the CPT corresponding to the clique node C BCDE would be a 4th order tensor P(B, D C, E) where each variable corresponds to a different mode of the tensor. However, this representation is not suitable for deriving the observable representation since message passing cannot be defined easily using the tensor multiplication/sum connection shown above. Instead we will embed these tensors into even higher order tensors to facilitate the computation. The key idea is to introduce duplicate indexes using the mode-specific identity tensors, such that the sumproduct steps in message updates can be expressed as tensor multiplications. More specifically, the number of times a mode of the tensor is duplicated will depend on how many times the corresponding variable in the clique C i appears in the separator sets incident to C i. We can define the count for a variable X j C i as d j,i = I[X j S i ] + I[X j S i1 ] + I[X j S i2 ], (9) X j ),. (10) where I[ ] is an indicator function taking value 1 if its argument is true and 0 otherwise. Then the tensor representation of the node C i is P(C i ) := P(..., ( dj,i X j ),......, ( }{{}} dj,i {{..), } X j R i X j S i where the labels for the modes of the tensor are the combined labels of the separator sets, i.e., {S i, S i1, S i2 }. The number of times a variable is repeated in the label set is exactly equal to d j,i. Essentially, tensor P(C i ) contains exactly the same information as the original CPT P(R i S i ). Furthermore, P(C i ) has a lot of zero entries, and the entries from P(R i S i ) are simply embedded in the higher order tensor P(C i ). Suppose all variables in node C i are latent variables each taking k h values. Then the number of entries needed to specify P(R i S i ) is k Ci h, while the high order tensor P(C i ) has k di h entries where d i := j:x j C i d j,i which is never smaller than k Ci h. In a sense, the parametrization using higher order tensor P(C i ) is less compact than the parametrization using the original CPTs. However, constructing the tensor P this way allows us to express the junction tree message update step in (8) as tensor multiplications (more details in the next section), and then we can leverage tools from tensor analysis to design a local-minimum-free learning algorithm. The tensor representation for the leaf nodes and the root node are special cases of the representation in (10). The tensor representation at a leaf node C l is simply equal to its CPT P(C l ) = P(R l S l ). The root node C r has no parent, so P(C r ) = P(..., ( dj,r X j ),...), X j C r. Furthermore, since d j,i is simply a count of how many times a variable in C i appears in each of the incident separators, the size of each tensor does not depend on which clique node was selected as the root. Example. In Figure 2, node C BCDE corresponds to CPT P(B, D C, E). Its high order tensor representation is P(C BCDE ) = P( 2 B, D 2 C, E), since both B and C occur twice in the separator sets incident to C BCDE. Therefore the tensor P({B, C, D, E}) is a 6th order tensor with mode labels {B, B, D, C, C, E}. 4.2 Tensor message passing With the higher order tensor representation for clique potentials in the junction tree as in (10), we can express the message update step in (8) as tensor multiplications. Consequently, we can compute the marginal distribution of the observed variables O in equation (7) recursively using a sequence of tensor multiplications. More specifically the general message update equation

6 for a node in a junction tree can be expressed as M(S i ) = P(C i ) Si1 M(S i1 ) Si2 M(S i2 ). (11) Here the modes of the tensor P(C i ) are labeled by the variables, and the mode labels are used to carry out tensor multiplications as explained in Section 2. Essentially, multiplication with respect to the duplicated modes of the tensor P(C i ) will implement some kind of element-wise multiplication for the incoming messages and then summation over the variables in the remainder set R i. The tensor message passing steps in leaf nodes and the root node are special cases of the tensor message update in equation (11). The outgoing message M(S l ) at a leaf node C l can be computed by simply setting all variables in R l to the actual observed values r l, i.e., M(S l ) = P(C l ) Rl =r l = P(R l = r l S l ). (12) In this step, there is no difference between the augmented tensor representation and the standard message passing in junction tree. At the root, we arrive at the final results of the message passing algorithm, and we obtain the marginal probability of the observed variables by aggregating all incoming messages from its 3 children, i.e., P(O) = (13) P(C r ) Sr1 M(S r1 ) Sr2 M(S r2 ) Sr3 M(S r3 ). Example. For Figure 2, using the following tensors P({B, C, D, E}) = P( 2 B, D 2 C, E) M({B, C}) = P(f B, C) M({B, D}) = P(g B, D), we can write the message update for node C BCDE in the form of equation (11) as M({C, E}) = P({B, C, D, E}) {B,C} M({B, C}) {B,D} M({B, D}). Note how the tensor multiplication sums out B and D: P({B, C, D, E}) has two B labels, and it appears in the subscripts of tensor multiplication twice; D appears once in the label and in the subscript of tensor multiplication respectively. Similarly, C is not summed out since there are two C labels but it appears only once in the subscript of tensor multiplication. 5 Transformed Representation Explicitly learning the tensor representation in (10) is still an intractable problem. Our key observation is that we do not need to recover the tensor representation explicitly if our focus is to perform inference using the message passing algorithm as in (11) (13). As long as we can recover the tensor representation up to some invertible transformation, we can still obtain the correct marginal probability P(O). More specifically, we can insert a mode-specific identity tensor I σ into the message update equation in (11) without changing the outgoing message. Subsequently, we can then replace the mode-specific identity tensor by a pair of tensors, F and F 1, which are mode-specific inversions of each other (F ω F 1 = I σ ). Then we can group these inserted tensors with the representation P(C) from (10), and obtain a transformed version P(C) (also see Figure 1). Furthermore, we have the freedom in choosing these collections of tensor inversion pairs. We will show that if we choose them systematically, we will be able to estimate each transformed tensor P(C) using the marginal probability of a small set of observed variables (observable representation). In this section, we will first explain the transformed tensor representation. As an illustration, consider a sequence of matrix multiplications with two identity matrices I 1 = F 1 F1 1 and I 2 = F 2 F2 1 inserted ABC = A(F 1 F1 1 )B(F 2 F2 1 )C = (AF 1 ) (F1 1 BF 2 ) (F2 1 C). }{{} Ã } {{ } B } {{ } C We see that we can equivalently compute ABC using their transformed versions, i.e., ABC = Ã B C. Moving to the tensor case, let us first consider a node C i and its parent node C i0. Then the outgoing message of C i0 can be computed recursively as M(S i0 ) = P(C i0 ) Si M(S i )... }{{} P(C i) Si1 M(S i1 ) Si2 M(S i2 ) Inserting a mode specific identity tensor I Si with labels {S i, S i } and similarly defined mode specific identity tensors I Si1 and I Si2 into the above two message updates, we obtain M(S i0 ) = P(C i0 ) Si (I Si Si M(S i ))... }{{} P(C i) Si1 (I Si1 Si1 M(S i1 )) Si2 (I Si2 Si2 M(S i1 )) Then we can further expand I Si using tensor inversion pairs F i, F 1 i, i.e., I Si = F i ωi F 1 i. Note that both F and F 1 have two set of mode labels, S i and another set ω i which is related to the observable representation and explained in the next section. Similarly, we expand I Si1 and I Si2 using their corresponding tensor inversion pairs. After expanding tensor identities I, we can regroup terms, and at node C i we have M(S i ) =(P(C i ) Si1 F i1 Si2 F i2 ) (14) ωi1 (F 1 i 1 Si1 M(S i1 )) ωi2 (F 1 i 2 Si2 M(S i2 )) and at the parent node C i0 of C i M(S i0 ) =(P(C i0 ) Si F i...) (15) ωi (F 1 i Si M(S i ))... Now we can define the transformed tensor representation for P(C i ) as P(C i ) := P(C i ) Si1 F i1 Si2 F i2 Si F 1 i, (16) where the two transformations F i1 and F i2 are ob-

7 tained from the children side and the transformation F 1 i is obtained from the parent side. Similarly, we can define the transformed representation for a leaf node and for the root node as P(C l ) = P(C l ) Sl F 1 l (17) P(C r ) = P(C r ) Sr1 F r1 Sr2 F r2 Sr3 F r3 (18) Applying these definitions of the transformed representation recursively, we can perform message passing based purely on these transformed representations M(S i0 ) = P(C i0 ) ωi M(Si )... (19) }{{} P(C i) ωi1 M(Si1 ) ωi2 M(Si2 ) 6 Observable Representation In the transformed tensor representation in (16)-(18), we have the freedom of choosing the collection of tensor pairs F and F 1. We will show that if we choose them systematically, we can recover each transformed tensor P(C) using the marginal probability of a small set of observed variables (observable representation). We will focus on the transformed tensor representation in (16) for an internal node C i (other cases follow as special cases). Due to the recursive way the transformed representation is defined, we only have the freedom of choosing F i1 and F i2 in this formula; the choice of F i will be fixed by the parent node of C i. The idea is to choose F i1 = P(O i1 S i1 ) as the conditional distribution of some set of observed variables O i1 O in the subtree rooted at child node C i1 of node C i, conditioning on the corresponding separator set S i1. Similarly, we choose F i2 = P(O i2 S i2 ) where O i2 O and it lies in subtree rooted at C i2. Following this convention, F i is chosen by the parent node C i0 and is fixed to P(O i S i ). Therefore, we have P(C i ) = P(C i ) Si1 P(O i1 S i1 ) Si2 P(O i2 S i2 ) Si P(O i S i ) 1, (20) where the first two tensor multiplications essentially eliminate the latent variables in S i1 and S i2. 3 With these choices, we also fix the mode labels ω i, ω i1 and ω i2 in (14) (15) and (19). That is ω i = O i, ω i1 = O i1 and ω i2 = O i2. To remove all dependencies on latent variables in P(C i ) and relate it to observed variables, we need to eliminate the latent variables in S i and the tensor P(O i S i ) 1. For this, we multiply the transformed tensor P(C i ) by P(O i, O i ), where O i denotes some set of observed variables which do not belong to the subtree rooted at node C i. Furthermore, P(O i, O i ) can be re-expressed using the conditional distribution 3 If a latent variable in S i1 S i2 is also in S i, it is not eliminated in this step but in another step. of O i and O i respectively, conditioning on the separator set S i, i.e., P(O i, O i ) = P(O i S i ) Si P( 2 S i ) Si P(O i S i ). Therefore, we have P(O i S i ) 1 Oi P(O i, O i ) = P( 2 S i ) Si P(O i S i ), and plugging this into (20), we have P(C i ) Oi P(O i, O i ) =P(C i ) Si1 P(O i1 S i1 ) Si2 P(O i2 S i2 ) Si P( 2 S i ) Si P(O i S i ) =P(O i1, O i2, O i ), (21) where P(C i ) is now related to only marginal probabilities of observed variables. From the equivalent relation, we can inverting P(O i, O i ), and obtain the observable representation for P(C i ) P(C i ) = P(O i1, O i2, O i ) Oi P(O i, O i ) 1. (22) Example. For node C BCDE in Figure 2, the choices of O i, O i1, O i2 and O i are {F, G}, G, F and H respectively. There are many valid choices of O i. In the supplementary, we describe how these different choices can be combined via a linear system using Eq. 21. This can substantially increase performance. For the leaf nodes and the root node, the derivation for their observable representations can be viewed as special cases of that for the internal nodes. We provide the results for their observable representation below: P(C r ) = P(O r1, O r2, O r3 ), (23) P(C l ) = P(O l, O l ) Ol P(O l, O l ) 1. (24) If P(O l, O l ) is invertible, then P(C l ) = I Ol. Otherwise we need to project P(O i, O i ) using a tensor U i to make it invertible, as discussed in the next section. The overall algorithm is given in Algorithm 1. Given N i.i.d. samples of the observed nodes, we simply replace P( ) by the empirical estimate P( ). Algorithm 1 algorithm for latent junction tree In: Junction tree topology and N i.i.d. samples { } x s 1,..., x s N O s=1 Out: Estimated marginal P(O) 1: Estimate P(C i) for the root, leaf and internal nodes P(C r) = P(O r1, O r2, O r3 ) Or1 U r1 Or2 U r2 Or3 U r3 P(C l ) = P(O l, O l ) Ol ( P(O l, O l ) Ol U l ) 1 P(C i) = P(O i1, O i2, O i ) Oi1 U i1 Oi2 U i2 Oi ( P(O i, O i ) Oi U i) 1 2: In reverse topological order, leaf and internal nodes send messages M(S l ) = P(C l ) Ol =o l M(S i) = P(C i) Oi1 M(Si1 ) Oi2 M(Si2 ) 3: At the root, obtain P(O) by P(C r) Or1 M(Sr1 ) Or2 M(Sr2 ) Or3 M(Sr3 )

8 7 Discussion The observable representation exists only if there exist tensor inversion pairs F i = P(O i S i ), and F 1 i. This is equivalent to requiring that the rank of the matricized version of F i (rows corresponds to modes O i and column to modes S i ) has rank τ i := k h S i. Similarly. the matricized version of P(O i S i ) also needs to have rank τ i, so that the matricized version of P(O i, O i ) has rank τ i and is invertible. Thus, it is required that #states(o i ) #states(s i ). This can be achieved by either making O i consist of a few high dimensional observations, or of many smaller dimensional ones. In the case when #states(o i ) > #states(s i ), we need to project F i to a lower dimensional space using a tensor U i so that it can be inverted. In this case, we define F i := P(O i S i ) Oi U i. For example, following this through the computation for the leaf gives us that P(C l ) = P(O l, O l ) Ol (P(O l, O l ) Ol U l ) 1. A good choice of U i can be obtained by performing a singular value decomposition of the matricized version of P(O i, O i ) (variables in O i are arranged to rows and those in O i to columns). For HMMs and latent trees, this rank condition can be expressed simply as requiring the conditional probability tables of the underlying model to not be rankdeficient. However, junction trees encode more complex latent structures that introduce subtle considerations. A general characterization of the existence condition for observable representation with respect to the graph topology will be our future work. In the appendix, we give some intuition using a couple of examples where observable representations do not exist. 8 Sample Complexity We analyze the sample complexity of Algorithm 1 and show that it depends on the junction tree topology and the spectral properties of the true model. Let d i be the order of P(C i ) and e i be the number of modes of P(C i ) that correspond to observed variables. Theorem 1 Let τ i = k h S i, d max = max i d i, and e max = max i e i. Then, for any ɛ > 0, 0 < δ < 1, if N O ( ( 4k 2 h 3β 2 ) dmax k emax o ln C δ C 2 ɛ 2 α 4 where σ τ ( ) returns the τ th largest singular value and α = min i σ τi (P(O i, O i )), β = min i σ τi (F i ) Then with probability 1 δ, x P(x1 1,...,x O,..., x O ) P(x 1,..., x O ) ɛ. See the supplementary for a proof. The result implies that the estimation problem depends exponentially on d max and e max, but note that e max d max. Furthermore, d max is always greater than or equal to the treewidth. Note the dependence on the singular values of certain probability tensors. In fully observed ) models, the accuracy of the learned parameters depends only on how close the empirical estimates of the factors are to the true factors. However, our spectral algorithm also depends on how close the inverses of these empirical estimates are to the true inverses, which depends on the spectral properties of the matrices (Stewart & Sun, 1990). 9 Experiments We now evaluate our method on synthetic and real data and compare it with both standard (Dempster et al., 1977) and stepwise online (Liang & Klein, 2009). All methods were implemented in C++, and the matrix library Eigen (Guennebaud et al., 2010) was used for computing SVDs and solving linear systems. For all experiments, standard is given 5 random restarts. Online tends to be sensitive to the learning rate, so it is given one restart for each of 5 choices of the learning rate {0.6, 0.7, 0.8, 0.9, 1} (the one with highest likelihood is selected). Convergence is determined by measuring the change in the log likelihood at iteration t (denoted by f(t)) over the f(t) f(t 1) average: avg(f(t),f(t 1)) 10 4 (the same precision as used in Murphy (2005)). For large sample sizes our method is almost two orders of magnitude faster than both and online. This is because is iterative and every iteration requires inference over all the training examples which can become expensive. On the other hand, the computational cost of our method is dominated by the SVD/linear system. Thus, it is primarily dependent only on the number of observed states and maximum tensor order, and can easily scale to larger sample sizes. In terms of accuracy, we generally observe 3 distinct regions, low-sample size, mid-sample size, and large sample size. In the low sample size region, /online tend to overfit to the training data and our spectral algorithm usually performs better. In the midsample size region /online tend to perform better since they benefit from a smaller number of parameters. However, once a certain sample size is reached (the large sample size region), our spectral algorithm consistently outperforms /online which suffer from local minima and convergence issues. 9.1 Synthetic Evaluation We first perform a synthetic evaluation. 4 different latent structures are used (see Figure 3): a second order nonhomogenous (NH) HMM, a third order NH HMM, a 2 level NH factorial HMM, and a complicated synthetic junction tree. The second/third order HMMs have k h = 2 and k o = 4, while the factorial HMM and synthetic junction tree have k h = 2, and k o = 16. For each latent structure, we generate 10 sets of model parameters, and then sample N training points and 1000

9 test points from each set, where N is varied from 100 to 100, 000. For evaluation, we measure the accuracy of joint estimation using error = P(x 1,...,x O ) P(x 1,...,x O ) P(x 1,...,x O ). We also measure the training time of both methods. Figure 3 shows the results. As discussed earlier, our algorithm is between one and two orders of magnitude faster than both and online for all the latent structures. is actually slower for very small sample sizes than for mid-range sample sizes because of overfitting. Also, in all cases, the spectral algorithm has the lowest error for large sample sizes. Moreover, critical sample size at which spectral overtakes /online is largely dependent on the number of parameters in the observable representation compared to that in the original parameterization of the model. In higher order/factorial HMM models, this increase is small, while in the synthetic junction tree it is larger. 9.2 Splice dataset We next consider the task of determining splicing sites in DNA sequences (Asuncion & Newman, 2007). Each example consists of a DNA sequence of length 60, where each position in the sequence is either an A, T, C, or G. The goal is to classify whether the sequence is an Intron/Exon site, Exon/Intron site, or neither. During training, for each class a different second order nonhomogeneous HMM with k h = 2 and k o = 4 is trained. At test, the probability of the test sequence is computed for each model, and the one with the highest probability is selected (which we found to perform better than a homogeneous one). Figure 4, shows our results, which are consistent with our synthetic evaluation. performs the best in low sample sizes, while /online perform a little better in the mid-sample size range. The dataset is not large enough to explore the large sample size regime. Moreover, we note that spectral algorithm is much faster for all the sample sizes. 10 Conclusion We have developed an alternative parameterization that allows fast, local minima free, and consistent parameter learning of latent junction trees. Our approach generalizes spectral algorithms to a much wider range of structures such as higher order, factorial, and semi-hidden Markov models. Unlike traditional nonconvex optimization formulations, spectral algorithms allow us to theoretically explore latent variable models in more depth. The spectral algorithm depends not only on the junction tree topology but also on the spectral properties of the parameters. Thus, two models with the same structure may pose different degrees of difficulty based on the underlying singular values. This is very different from learning fully observed junction trees, which is primarily dependent on only the topol- Error Runtime (s) Error Runtime (s) Length = 40 2nd Order NonHomogeneous HMM nd Order NonHomogeneous HMM (a) 2nd Order HMM Length = Level Factorial HMM 2 Level Factorial HMM Error Runtime (s) Error Runtime (s) Length = rd Order NonHomogeneous HMM rd Order NonHomogeneous HMM (b) 3rd Order HMM Synthetic Junction Tree Synthetic Junction Tree 1 (c) 2 Level Factorial HMM (d) Synthetic Junction Tree Figure 3: Comparison of our spectral algorithm (blue) to (red) and online (green) for various latent structures. Both errors and runtimes in log scale. Error Splice Runtime (s) Splice Figure 4: Results on Splice dataset ogy/treewidth. Future directions include learning discriminative models and structure learning. Acknowledgements: This work is supported by an NSF Graduate Fellowship (Grant No ) to APP, Georgia Tech Startup Funding to LS, NIH 1R01- GM093156, and The Gatsby Charitable Foundation. We thank Byron Boots for valuable discussion.

10 References Asuncion, A. and Newman, D.J. UCI machine learning repository, Balle, B., Quattoni, A., and Carreras, X. A spectral learning algorithm for finite state transducers. Machine Learning and Knowledge Discovery in Databases, pp , Blei, David and McAuliffe, Jon. Supervised topic models. In Advances in Neural Information Processing Systems 20, pp Blei, D.M., Ng, A.Y., and Jordan, M.I. Latent dirichlet allocation. The Journal of Machine Learning Research, 3: , Cohen, S.B., Stratos, K., Collins, M., Foster, D.P., and Ungar, L. learning of latent-variable pcfgs. In Association of Computational Linguistics (ACL), volume 50, Cowell, R., Dawid, A., Lauritzen, S., and Spiegelhalter, D. Probabilistic Networks and Expert Sytems. Springer, New York, Dempster, A., Laird, N., and Rubin, D. Maximum likelihood from incomplete data via the algorithm. Journal of the Royal Statistical Society B, 39 (1):1 22, Ghahramani, Z. and Jordan, M.I. Factorial hidden Markov models. Machine learning, 29(2): , Guennebaud, G., Jacob, B., et al. Eigen v Hsu, D., Kakade, S., and Zhang, T. A spectral algorithm for learning hidden Markov models. In Proc. Annual Conf. Computational Learning Theory, Kolda, T. and Bader, B. Tensor decompositions and applications. SIAM Review, 51(3): , Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. The MIT Press, Kundu, A., He, Y., and Bahl, P. Recognition of handwritten word: first and second order hidden Markov model based approach. Pattern recognition, 22(3): , Lacoste-Julien, S., Sha, F., and Jordan, M.I. Disclda: Discriminative learning for dimensionality reduction and classification. volume 21, pp Liang, P. and Klein, D. Online em for unsupervised models. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, pp Association for Computational Linguistics, Murphy, K. Hidden Markov model (HMM) toolbox for matlab Murphy, K.P. Dynamic bayesian networks: representation, inference and learning. PhD thesis, University of California, Parikh, A.P., Song, L., and Xing, E.P. A spectral algorithm for latent tree graphical models. In Proceedings of the 28th International Conference on Machine Learning, pp ACM, Rabiner, L. R. and Juang, B. H. An introduction to hidden Markov models. IEEE ASSP Magazine, 3 (1):4 16, Song, L., Boots, B., Siddiqi, S., Gordon, G., and Smola, A. Hilbert space embeddings of hidden Markov models. In Proceedings of the 27th International Conference on Machine Learning, pp ACM, Song, L., Parikh, A.P., and Xing, E.P. Kernel embeddings of latent tree graphical models. In Advances in Neural Information Processing Systems (NIPS), volume 24, pp Stewart, GW and Sun, J. Matrix Perturbation Theory. Academic Press, Su, X. and Khoshgoftaar, T.M. A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009:4, Zhu, J., Ahmed, A., and Xing, E.P. Medlda: maximum margin supervised topic models for regression and classification. In Proceedings of the 26th Annual International Conference on Machine Learning, pp ACM, 2009.

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Supplemental for Spectral Algorithm For Latent Tree Graphical Models

Supplemental for Spectral Algorithm For Latent Tree Graphical Models Supplemental for Spectral Algorithm For Latent Tree Graphical Models Ankur P. Parikh, Le Song, Eric P. Xing The supplemental contains 3 main things. 1. The first is network plots of the latent variable

More information

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

More information

Spectral Learning of General Latent-Variable Probabilistic Graphical Models: A Supervised Learning Approach

Spectral Learning of General Latent-Variable Probabilistic Graphical Models: A Supervised Learning Approach Spectral Learning of General Latent-Variable Probabilistic Graphical Models: A Supervised Learning Approach Borui Wang Department of Computer Science, Stanford University, Stanford, CA 94305 WBR@STANFORD.EDU

More information

Hierarchical Tensor Decomposition of Latent Tree Graphical Models

Hierarchical Tensor Decomposition of Latent Tree Graphical Models Le Song, Haesun Park {lsong,hpark}@cc.gatech.edu College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA Mariya Ishteva mariya.ishteva@vub.ac.be ELEC, Vrije Universiteit Brussel,

More information

Spectral Probabilistic Modeling and Applications to Natural Language Processing

Spectral Probabilistic Modeling and Applications to Natural Language Processing School of Computer Science Spectral Probabilistic Modeling and Applications to Natural Language Processing Doctoral Thesis Proposal Ankur Parikh Thesis Committee: Eric Xing, Geoff Gordon, Noah Smith, Le

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Estimating Latent Variable Graphical Models with Moments and Likelihoods Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods

More information

Reduced-Rank Hidden Markov Models

Reduced-Rank Hidden Markov Models Reduced-Rank Hidden Markov Models Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University ... x 1 x 2 x 3 x τ y 1 y 2 y 3 y τ Sequence of observations: Y =[y 1 y 2 y 3... y τ ] Assume

More information

Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics

Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics Ankur P. Parikh School of Computer Science Carnegie Mellon University apparikh@cs.cmu.edu Shay B. Cohen School of Informatics

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning

Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning Journal of Machine Learning Research 12 (2017) 663-707 Submitted 1/10; Revised 10/10; Published 3/11 Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning Le Song lsong@cc.gatech.edu

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

Estimating Covariance Using Factorial Hidden Markov Models

Estimating Covariance Using Factorial Hidden Markov Models Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu

More information

A Spectral Algorithm For Latent Junction Trees - Supplementary Material

A Spectral Algorithm For Latent Junction Trees - Supplementary Material A Spectral Algoritm For Latent Junction Trees - Supplementary Material Ankur P. Parik, Le Song, Mariya Isteva, Gabi Teodoru, Eric P. Xing Discussion of Conditions for Observable Representation Te observable

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Inference in Graphical Models Variable Elimination and Message Passing Algorithm Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption

More information

Bayesian Networks: Representation, Variable Elimination

Bayesian Networks: Representation, Variable Elimination Bayesian Networks: Representation, Variable Elimination CS 6375: Machine Learning Class Notes Instructor: Vibhav Gogate The University of Texas at Dallas We can view a Bayesian network as a compact representation

More information

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Arthur Choi and Adnan Darwiche Computer Science Department University of California, Los Angeles

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning

Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning Le Song, Han Liu, Ankur Parikh, and Eric Xing arxiv:1401.3940v1 [stat.ml] 16 Jan 2014 Abstract Tree structured

More information

13 : Variational Inference: Loopy Belief Propagation

13 : Variational Inference: Loopy Belief Propagation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 13 : Variational Inference: Loopy Belief Propagation Lecturer: Eric P. Xing Scribes: Rajarshi Das, Zhengzhong Liu, Dishan Gupta 1 Introduction

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Representation of undirected GM. Kayhan Batmanghelich

Representation of undirected GM. Kayhan Batmanghelich Representation of undirected GM Kayhan Batmanghelich Review Review: Directed Graphical Model Represent distribution of the form ny p(x 1,,X n = p(x i (X i i=1 Factorizes in terms of local conditional probabilities

More information

Spectral Unsupervised Parsing with Additive Tree Metrics

Spectral Unsupervised Parsing with Additive Tree Metrics Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1 Overview Model: We present a novel approach

More information

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 Probabilistic

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

More information

2 : Directed GMs: Bayesian Networks

2 : Directed GMs: Bayesian Networks 10-708: Probabilistic Graphical Models 10-708, Spring 2017 2 : Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Jayanth Koushik, Hiroaki Hayashi, Christian Perez Topic: Directed GMs 1 Types

More information

Estimating Latent-Variable Graphical Models using Moments and Likelihoods

Estimating Latent-Variable Graphical Models using Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University, Stanford, CA, USA CHAGANTY@CS.STANFORD.EDU PLIANG@CS.STANFORD.EDU Abstract Recent work on the method of moments enable consistent parameter estimation,

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Hidden Markov models

Hidden Markov models Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Graphical models: parameter learning

Graphical models: parameter learning Graphical models: parameter learning Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London London WC1N 3AR, England http://www.gatsby.ucl.ac.uk/ zoubin/ zoubin@gatsby.ucl.ac.uk

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1 Graphical Models: Challenges Bayesian Network Markov Network

More information

Message Passing Algorithms and Junction Tree Algorithms

Message Passing Algorithms and Junction Tree Algorithms Message Passing lgorithms and Junction Tree lgorithms Le Song Machine Learning II: dvanced Topics S 8803ML, Spring 2012 Inference in raphical Models eneral form of the inference problem P X 1,, X n Ψ(

More information

Object Detection Grammars

Object Detection Grammars Object Detection Grammars Pedro F. Felzenszwalb and David McAllester February 11, 2010 1 Introduction We formulate a general grammar model motivated by the problem of object detection in computer vision.

More information

Spectral Learning of Predictive State Representations with Insufficient Statistics

Spectral Learning of Predictive State Representations with Insufficient Statistics Spectral Learning of Predictive State Representations with Insufficient Statistics Alex Kulesza and Nan Jiang and Satinder Singh Computer Science & Engineering University of Michigan Ann Arbor, MI, USA

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Belief Update in CLG Bayesian Networks With Lazy Propagation

Belief Update in CLG Bayesian Networks With Lazy Propagation Belief Update in CLG Bayesian Networks With Lazy Propagation Anders L Madsen HUGIN Expert A/S Gasværksvej 5 9000 Aalborg, Denmark Anders.L.Madsen@hugin.com Abstract In recent years Bayesian networks (BNs)

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

12 : Variational Inference I

12 : Variational Inference I 10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of

More information

Efficient Sensitivity Analysis in Hidden Markov Models

Efficient Sensitivity Analysis in Hidden Markov Models Efficient Sensitivity Analysis in Hidden Markov Models Silja Renooij Department of Information and Computing Sciences, Utrecht University P.O. Box 80.089, 3508 TB Utrecht, The Netherlands silja@cs.uu.nl

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Structure Learning: the good, the bad, the ugly

Structure Learning: the good, the bad, the ugly Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform

More information

2-Step Temporal Bayesian Networks (2TBN): Filtering, Smoothing, and Beyond Technical Report: TRCIM1030

2-Step Temporal Bayesian Networks (2TBN): Filtering, Smoothing, and Beyond Technical Report: TRCIM1030 2-Step Temporal Bayesian Networks (2TBN): Filtering, Smoothing, and Beyond Technical Report: TRCIM1030 Anqi Xu anqixu(at)cim(dot)mcgill(dot)ca School of Computer Science, McGill University, Montreal, Canada,

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Fractional Belief Propagation

Fractional Belief Propagation Fractional Belief Propagation im iegerinck and Tom Heskes S, niversity of ijmegen Geert Grooteplein 21, 6525 EZ, ijmegen, the etherlands wimw,tom @snn.kun.nl Abstract e consider loopy belief propagation

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Graphical Models 359

Graphical Models 359 8 Graphical Models Probabilities play a central role in modern pattern recognition. We have seen in Chapter 1 that probability theory can be expressed in terms of two simple equations corresponding to

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University

More information

Linear-Time Inverse Covariance Matrix Estimation in Gaussian Processes

Linear-Time Inverse Covariance Matrix Estimation in Gaussian Processes Linear-Time Inverse Covariance Matrix Estimation in Gaussian Processes Joseph Gonzalez Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 jegonzal@cs.cmu.edu Sue Ann Hong Computer

More information

Junction Tree, BP and Variational Methods

Junction Tree, BP and Variational Methods Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Directed Graphical Models or Bayesian Networks

Directed Graphical Models or Bayesian Networks Directed Graphical Models or Bayesian Networks Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Bayesian Networks One of the most exciting recent advancements in statistical AI Compact

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information