A Spectral Algorithm for Latent Junction Trees

Size: px

Start display at page:

Download "A Spectral Algorithm for Latent Junction Trees"

Sabrina Sharp
5 years ago
Views:

1 A Algorithm for Latent Junction Trees Ankur P. Parikh Carnegie Mellon University Le Song Georgia Tech Mariya Ishteva Georgia Tech Gabi Teodoru Gatsby Unit, UCL Abstract Latent variable models are an elegant framework for capturing rich probabilistic dependencies in many applications. However, current approaches typically parametrize these models using conditional probability tables, and learning relies predominantly on local search heuristics such as Expectation Maximization. Using tensor algebra, we propose an alternative parameterization of latent variable models (where the model structures are junction trees) that still allows for computation of marginals among observed variables. While this novel representation leads to a moderate increase in the number of parameters for junction trees of low treewidth, it lets us design a local-minimum-free algorithm for learning this parameterization. The main computation of the algorithm involves only tensor operations and SVDs which can be orders of magnitude faster than algorithms for large datasets. To our knowledge, this is the first provably consistent parameter learning technique for a large class of low-treewidth latent graphical models beyond trees. We demonstrate the advantages of our method on synthetic and real datasets. 1 Introduction Eric P. Xing Carnegie Mellon University epxing@cs.cmu.edu Latent variable models such as Hidden Markov Models (HMMs) (Rabiner & Juang, 1986), and Latent Dirichlet Allocation (Blei et al., 2003) have become a popular framework for modeling complex dependencies among variables. A latent variable can represent an abstract concept (such as topic or state), thus enriching the dependency structure among the observed variables while simultaneously allowing for a more tractable representation. Typically, a latent variable model is parameterized by a set of conditional probability tables (CPTs) each associated with an edge in the latent graph structure. For instance, an HMM can be parametrized compactly by a transition probability table and an observation probability table. By summing out the latent variables in the HMM, we obtain a fully connected graphical model for the observed variables. Although the parametrization of latent variable models using CPTs is very compact, parameters in this representation can be difficult to learn. Compared to parameter learning in fully observed models which is either of closed form or convex (Koller & Friedman, 2009), most parameter learning algorithms for latent variable models resort to maximizing a non-convex objective via Expectation Maximization () (Dempster et al., 1977). can get trapped in local optima and has slow convergence. While explicitly learns the CPTs of a latent variable model, in many cases the goal of the model is primarily for prediction and thus the actual latent parameters are not needed. One example is determining splicing sites in DNA sequences (Asuncion & Newman, 2007). One can build a different latent variable model, such as an HMM, for each type of splice site from training data. A new sequence is then classified by determining which model it is most likely to have been generated by. Other examples include supervised topic modelling such as (Blei & McAuliffe, 2007; Lacoste- Julien et al., 2008; Zhu et al., 2009) and collaborative filtering (Su & Khoshgoftaar, 2009). In these cases, it is natural to ask whether there exists an alternative representation/parameterization of a latent variable model where parameter learning can be done consistently and the representation remains tractable for inference among the observed variables. This question has been tackled recently by Hsu et al. (2009), Balle et al. (2011), and Parikh et al. (2011) who proposed spectral algorithms for local-minimumfree learning of HMMs, finite state transducers, and latent tree graphical models respectively. Unlike traditional parameter learning algorithms such as, spectral algorithms do not directly learn the CPTs of

Latent Variable Model Latent Junction Tree Tensor Representation Transformed Tensor Representation Estimation B D A C E G H I J K F L C 7 C 1 C 2 C 3 C 10 C 8 C 9 C 4 C 5 C 6 P(R 2 S 2) P(R 1 S 1)

8) P(R 5 S 5) Transformed matrix and tensor representation P(R 4 S 4) P(R 3 S 3) T 7 C 7 T 7 C 1 C 2 C 3 T 7 P O 1, O 2, O 3, P(O 3, O 1 ) Figure 1: Our algorithm for local-minimum-free learning of

(1) First, we transform a model into a junction tree, such that each node in the junction tree corresponds to a maximal clique of variables in the triangulated graph of the original model.

2 Latent Variable Model Latent Junction Tree Tensor Representation Transformed Tensor Representation Estimation B D A C E G H I J K F L C 7 C 1 C 2 C 3 C 10 C 8 C 9 C 4 C 5 C 6 P(R 2 S 2) P(R 1 S 1) P(R 9 S 9) P(R 6 S 6) P(C P(R 10) 7 S 7) P(R 8 S 8) P(R 5 S 5) P(R 4 S 4) P(R 3 S 3) P(R 2 S 2) P(R 1 S 1) P(R 6 S 6) Transformation Inverse Transformation T 7 P(R 9 S 9) P(R 7 S 7) P(C 10) P(R 8 S 8) P(R 5 S 5) Transformed matrix and tensor representation P(R 4 S 4) P(R 3 S 3) T 7 C 7 T 7 C 1 C 2 C 3 T 7 P O 1, O 2, O 3, P(O 3, O 1 ) Figure 1: Our algorithm for local-minimum-free learning of latent variable models consist of four major steps. (1) First, we transform a model into a junction tree, such that each node in the junction tree corresponds to a maximal clique of variables in the triangulated graph of the original model. (2) Then we embed the clique potentials of the junction tree into higher order tensors and express the marginal distribution of the observed variables as a tensor-tensor/matrix multiplication according to the message passing algorithm. (3) Next we transform the tensor representation by inserting a pair of transformations between those tensor-tensor/matrix operations. Each pair of transformations is chosen so that they are inversions of each other. (4) Lastly, we show that each transformed representation is a function of only observed variables. Thus, we can estimate each individual transformed tensor quantity using samples from observed variables. a latent variable model. Instead they learn an alternative parameterization (called the observable representation) which generally contains a larger number of parameters than the CPTs, but where computing observed marginals is still tractable. Moreover, these alternative parameters have the advantage that they only depend on observed variables and can therefore be directly estimated from data. Thus, parameter learning in the alternative representation is fast, localminimum-free, and provably consistent. Furthermore, spectral algorithms can be generalized to nonparametric latent models (Song et al., 2010, 2011) where it is difficult to run. However, existing spectral algorithms apply only to restricted latent structures (HMMs and latent trees), while latent structures beyond trees, such as higher order HMMs (Kundu et al., 1989), factorial HMMs (Ghahramani & Jordan, 1997) and Dynamic Bayesian Networks (Murphy, 2002), are needed and have been proven useful in many real world problems. The challenges for generalizing spectral algorithms to general latent structured models include the larger factors, more complicated conditional independence structures, and the need to sum out multiple variables simultaneously. The goal of this paper is to develop a new representation for latent variable models with structures beyond trees, and design a spectral algorithm for learning this representation. We will focus on latent junction trees; thus the algorithm is suitable for both directed and undirected models which can be transformed into junction trees. Concurrently to our work, Cohen et al. (2012) proposed a spectral algorithm for Latent Probabilistic Context Free Grammars (PCFGs). Latent PCFGs are not trees, but have many tree-like properties, and so the representation Cohen et al. (2012) propose does not easily extend to other non-tree models such as higher order/factorial HMMs that we consider here. Our more general approach requires more complex tensor operations, such as multi-mode inversion, that are not used in the latent PCFG case. The key idea of our approach is to embed the clique potentials of the junction tree into higher order tensors such that the computation of the marginal probability of observed variables can be carried out via tensor operations. While this novel representation leads only to a moderate increase in the number parameters for junction trees of low treewidth, it allows us to design an algorithm that can recover a transformed version of the tensor parameterization and ensure that the joint probability of observed variables are computed correctly and consistently. The main computation of the algorithm involves only tensor operations and singular value decompositions (hence the name spectral ) which can be orders of magnitude faster than algorithms in large datasets. To our knowledge, this is the first provably consistent parameter learning technique for a large class of low-treewidth latent graphical models beyond trees. In our experiments with large scale synthetic datasets, we show that our spectral algorithm can be almost 2 orders of magnitude faster than while at the same achieving considerably better accuracy. Our spectral algorithm also achieves comparable accuracy to on real data. Organization of paper. A high level overview of our approach is given in Figure 1. We first provide some background on tensor algebra and latent junction trees. We then derive the spectral algorithm by representing junction tree message passing with tensor operations, and then transform this representation into one that only depends on observed variables. Finally, we analyze the sample complexity of our method and evaluate it on synthetic and real datasets. 2 Tensor Notation We first give an introduction to the tensor notation tailored to this paper. An Nth order tensor is a multiway array with N modes, i.e., N indices

3 {i 1, i 2,..., i N } are needed to access its entries. Subarrays of a tensor are formed when a subset of the indices is fixed, and we use a colon to denote all elements of a mode. For instance, A(i 1,..., i n 1, :, i n+1,..., i N ) are all elements in the nth mode of a tensor A with indices from the other N 1 modes fixed to {i 1,..., i n 1, i n+1,..., i N } respectively. Furthermore, we also use the shorthand i p:q = {i p, i p+1,..., i q 1, i q } for consecutive indices, e.g., A(i 1,..., i n 1, :, i n+1,..., i N ) = A(i 1:n 1, :, i n+1:n ). Labeling tensor modes with variables. In contrast to the conventional tensor notation such as the one described in Kolda & Bader (2009), the ordering of the modes of a tensors will not be essential in this paper. We will use random variables to label the modes of a tensor: each mode will correspond to a random variable and what is important is to keep track of this correspondence. Therefore, we think two tensors are equivalent if they have the same set of labels and they can be obtained from each other by a permutation of the modes for which the labels are aligned. In the matrix case this translates to A and A being equivalent in the sense that A carries the same information as A, as long as we remember that the rows of A are the columns of A and vice versa. We will use the following notation to denote this equivalence A = A (1) Under this notation, the dimension (or the size) of a mode labeled by variable X will be the same as the number of possible values for variable X. Furthermore, when we multiply two tensors together, we will always carry out the operation along (a set of) modes with matching labels. Tensor multiplication with mode labels. Let A R I1 I2 I N be an Nth order tensor and B R J1 J2 J M be an Mth order tensor. If X is a common mode label for both A and B (w.l.o.g. we assume that this is the first mode, implying also that I 1 = J 1 ), multiplying along this mode will give C = A X B R I2 I N J 2 J M, (2) where the entries of C is defined as C(i 2:N, j 2:M ) = I 1 A(i, i 2:N) B(i, j 2:M ) i=1 Similarly, we can multiply two tensors along multiple modes. Let σ = {X 1,..., X k } be an arbitrary set of k modes (k variables) shared by A and B (w.l.o.g. we assume these labels correspond to the first k modes, and I 1 = J 1,..., I k = J k holds for the corresponding dimensions). Then multiplying A and B along σ results in D = A σ B R I k+1... I N J k+1... J M, (3) where the entries of D are defined as D(i k+1:n, j k+1:m ) = A(i 1:k, i k+1:n )B(i 1:k, j k+1:m ). i 1:k Multi-mode multiplication can also be interpreted as reshaping the σ modes of A and B into a single mode and doing single-mode tensor multiplication. Furthermore, tensor multiplication with labels is symmetric in its arguments, i.e., A σ B = B σ A. Mode-specific identity tensor. We now define our notion of identity tensor with respect to a set of modes σ = {X 1,..., X K }. Let A be a tensor with mode labels containing σ, and I σ be a tensor with 2K modes with mode labels {X 1,..., X K, X 1,..., X K }. Then I σ is an identity tensor with respect to modes σ if A σ I σ = A. (4) One can also understand I σ using its matrix representation: flattening I σ with respect to σ (the first σ modes mapped to rows and the second σ modes mapped to columns) results in an identity matrix. Mode-specific tensor inversion. Let F, F 1 R I1 I K I K+1 I K+K be tensors of order K + K, and both have two sets of mode labels σ = {X 1,..., X K } and ω = {X K+1,..., X K+K }. Then F 1 is the inverse of F w.r.t. modes ω if and only if F ω F 1 = Iσ. (5) Multimode inversion can also be interpreted as reshaping F with respect to ω into a matrix of size (I 1... I K ) (I K+1... I K+K ), taking the inverse, and then rearranging back into a tensor. Thus the existence and uniqueness of this inverse can be characterized by the rank of the matricized version of F. Mode-specific diagonal tensors. We use δ to denote an N-way relation: its entry δ(i 1:N ) at position i 1:N equals 1 when all indexes are the same (i 1 = i 2 =... = i N ), and 0 otherwise. We will use d to denote repetition of an index d times. For instance, we use P( d X) to denote a dth order tensor where its entries at (i 1:d )th position are specified by δ(i 1:d )P(X = x i1 ). A diagonal matrix with its diagonal equal to P(X) is then denoted as P( 2 X). Similarly, we can define a (d + d )th order tensor P( d X d Y ) where its (i 1:d j 1:d )th entry corresponds to δ(i 1:d )δ(j 1:d )P(X = x i1 Y = y j1 ). 3 Latent Junction Trees In this paper, we will focus on discrete latent variable models where the number of states, k h, for each hidden variable is much smaller than the number of states, k o, for each observed variable. Uppercase letters denote random variables (e.g., X i ) and lowercase letters their instantiations (e.g., x i ). A latent variable model defines a joint probability distribution over a set of variables X = O H. Here, O denotes the set of observed variables, { X 1,..., X O }. H denotes the set of hidden variables, { X O +1,..., X H + O }. We will focus on latent variable models where the

structure of the model is a junction tree of low treewidth (Cowell et al., 1999). Each node C i in a junction tree corresponds to a subset (clique) of variables from the original graphical model.

4 structure of the model is a junction tree of low treewidth (Cowell et al., 1999). Each node C i in a junction tree corresponds to a subset (clique) of variables from the original graphical model. We will also use C i to denote the collection of variables contained in the node, i.e. C i X. Let C denote the set of all clique nodes. The treewidth is then the size of a largest clique in a junction tree minus one, that is t = max Ci C C i 1. Furthermore, we associate each edge in a junction tree with a separator set S ij := C i C j which contains the common variables of the two cliques C i and C j it is connected to. If we condition on all variables in any S ij, the variables on different sides of S ij will become independent. Without loss of generality, we assume that each internal clique node in the junction tree has exactly 3 neighbors. 1 Then we can pick a clique C r as the root of the tree and reorient all edges away from the root to induce a topological ordering of the clique nodes. Given the ordering, the root node will have 3 children nodes, denoted as C r1, C r2 and C r3. Each other internal node C i will have a unique parent node, denoted as C i0, and 2 children nodes denoted as C i1 and C i2. Each leaf node C l is only connected with its unique parent node C l0. Furthermore, we can simplify the notation for the separator set between a node C i and its parent C i0 as S i = C i C i0, omitting the index for the parent node. Then the remainder set of a node is defined as R i = C i \ S i. We also assume w.l.o.g. that if C i is a leaf in the junction tree, R i consists of only observed variables. We will use r i to denote an instantiation of the set of variables in R i. See Figure 2 for an illustration of notation. Given a root and a topological ordering of the nodes in a junction tree, the joint distribution of all variables X can be factorized according to P(X ) = X P(R i S i ), (6) i=1 where each CPT P(R i S i ), also called a clique potential, corresponds to a node C i. The number of parameters needed to specify the model is O( C ko), t linear in the number of cliques but exponential in the tree width t. Then the marginal distribution of the observed variables can be obtained by summing over the latent variables, P(O) = X... P(R i S i ), (7) X O +1 X O + H i=1 where we use X φ(x) to denote summation over all possible instantiations of φ(x) w.r.t. variable X. Note that each (non-leaf) remainder set R i contains a small subset of all latent variables. The presence of latent 1 If this is not the case, the derivation is similar but notationally much heavier. H C A G I E B D ACH ACE CE BCDE BC BD F BCF BDG C j R i = C i S i S i1 C i1 C i C i0 S i S i2 S i0 C i2 G O i1 F O i2 F, G O i H O i Figure 2: Example latent variable models with variables X = {A, B, C, D, E, F, G, H, I,...}, the observed variables are O = {F, G, H,...} (only partially drawn). Its corresponding junction tree is shown in the middle panel. Corresponding to this junction tree, we also show the general notation for it in the rightmost panel. variables introduces complicated dependency between observed variables, while at the same time only a small number of parameters corresponding to the entries in the CPTs are needed to specify the model. The process of eliminating the latent variables in (7) can be carried out efficiently via message passing. More specifically, the summation can be broken up into local computation for each node in the junction tree. Each node only needs to sum out a small number of variables and then the intermediate result, called the message, is passed to its parent for further processing. In the end the root node incorporates all messages from its children and produces the final result P (O). The local summation step, called the message update, can be generically written as 2 M(S i ) = R i P(R i S i )M(S i1 )M(S i2 ) (8) where we use M(S i ) to denote the intermediate results of eliminating variables in the remainder set R i. This message update is then carried out recursively according the reverse topological order of the junction tree until we reach the root node. The local summation step for the leaf nodes and root node can be viewed as special cases of (8). For a leaf node C l, there is no incoming message from children nodes, and hence M(S l ) = P(r l S l ); for the root node C r, S r = and R i = C i, and hence P(O) = M( ) = X C r P(C r )M(S r1 )M(S r2 )M(S r3 ). Example. The message update at the internal node C BCDE in Figure 2 is M({C, E}) = P(B, D C, E)P(f B, C)P(g B, D). B,D 4 Tensor Representation for Message Passing Although the parametrization of latent junction trees using CPTs is very compact and inference (message passing) can be carried out efficiently, parameters in this representation can be difficult to learn. Since the likelihood of the observed data is no longer convex in the latent parameters, local search heuristics, such as 2 For simplicity of notation, assume C i = S i S i1 S i2.

5 , are often employed to learning the parameters. Therefore, our goal is to design a new representation for latent junction trees, such that subsequent learning can be carried out in a local-minimum-free fashion. In this section, we will develop a new representation for the message update in (8) by embedding each CPT P(R i S i ) into a higher order tensor P(C i ). As we will see, there will be two advantages to the tensor form. The first is that tensor multiplication can be used to compactly express the sum and product steps involved in message passing. As a very simplistic example, let P A B = P(A B) be a conditional probability matrix and P B = P(B) be a marginal probability vector. Then matrix-vector multiplication, P A B P B = P (A), sums out variable B. However, if we put the marginal probability of B on the diagonal of a matrix, then B will not be summed out: e.g., if P 2B = P( 2 B), then P A B P 2B = P(A, B) (but now B is no longer on the diagonal). We will leverage these facts to derive our tensor representation for message passing. Moreover, we can then utilize tensor inversion to construct an alternate parameterization. In the very simplistic (matrix) example, note that P(A, B) = P A B P 2B = P A B F F 1 P 2B. The invertible transformations F will give us an extra degree of freedom to allow us to design an alternate parameterization of the latent junction tree that is only a function of observed variables. This would not be possible in the traditional representation (Eq. 8). 4.1 Embed CPTs to higher order tensors As we can see from (6), the joint probability distribution of all variables can be represented by a set of conditional distributions over just subsets of variables. Each one of this conditionals is a low order tensor. For example in Figure 2, the CPT corresponding to the clique node C BCDE would be a 4th order tensor P(B, D C, E) where each variable corresponds to a different mode of the tensor. However, this representation is not suitable for deriving the observable representation since message passing cannot be defined easily using the tensor multiplication/sum connection shown above. Instead we will embed these tensors into even higher order tensors to facilitate the computation. The key idea is to introduce duplicate indexes using the mode-specific identity tensors, such that the sumproduct steps in message updates can be expressed as tensor multiplications. More specifically, the number of times a mode of the tensor is duplicated will depend on how many times the corresponding variable in the clique C i appears in the separator sets incident to C i. We can define the count for a variable X j C i as d j,i = I[X j S i ] + I[X j S i1 ] + I[X j S i2 ], (9) X j ),. (10) where I[ ] is an indicator function taking value 1 if its argument is true and 0 otherwise. Then the tensor representation of the node C i is P(C i ) := P(..., ( dj,i X j ),......, ( }{{}} dj,i {{..), } X j R i X j S i where the labels for the modes of the tensor are the combined labels of the separator sets, i.e., {S i, S i1, S i2 }. The number of times a variable is repeated in the label set is exactly equal to d j,i. Essentially, tensor P(C i ) contains exactly the same information as the original CPT P(R i S i ). Furthermore, P(C i ) has a lot of zero entries, and the entries from P(R i S i ) are simply embedded in the higher order tensor P(C i ). Suppose all variables in node C i are latent variables each taking k h values. Then the number of entries needed to specify P(R i S i ) is k Ci h, while the high order tensor P(C i ) has k di h entries where d i := j:x j C i d j,i which is never smaller than k Ci h. In a sense, the parametrization using higher order tensor P(C i ) is less compact than the parametrization using the original CPTs. However, constructing the tensor P this way allows us to express the junction tree message update step in (8) as tensor multiplications (more details in the next section), and then we can leverage tools from tensor analysis to design a local-minimum-free learning algorithm. The tensor representation for the leaf nodes and the root node are special cases of the representation in (10). The tensor representation at a leaf node C l is simply equal to its CPT P(C l ) = P(R l S l ). The root node C r has no parent, so P(C r ) = P(..., ( dj,r X j ),...), X j C r. Furthermore, since d j,i is simply a count of how many times a variable in C i appears in each of the incident separators, the size of each tensor does not depend on which clique node was selected as the root. Example. In Figure 2, node C BCDE corresponds to CPT P(B, D C, E). Its high order tensor representation is P(C BCDE ) = P( 2 B, D 2 C, E), since both B and C occur twice in the separator sets incident to C BCDE. Therefore the tensor P({B, C, D, E}) is a 6th order tensor with mode labels {B, B, D, C, C, E}. 4.2 Tensor message passing With the higher order tensor representation for clique potentials in the junction tree as in (10), we can express the message update step in (8) as tensor multiplications. Consequently, we can compute the marginal distribution of the observed variables O in equation (7) recursively using a sequence of tensor multiplications. More specifically the general message update equation

6 for a node in a junction tree can be expressed as M(S i ) = P(C i ) Si1 M(S i1 ) Si2 M(S i2 ). (11) Here the modes of the tensor P(C i ) are labeled by the variables, and the mode labels are used to carry out tensor multiplications as explained in Section 2. Essentially, multiplication with respect to the duplicated modes of the tensor P(C i ) will implement some kind of element-wise multiplication for the incoming messages and then summation over the variables in the remainder set R i. The tensor message passing steps in leaf nodes and the root node are special cases of the tensor message update in equation (11). The outgoing message M(S l ) at a leaf node C l can be computed by simply setting all variables in R l to the actual observed values r l, i.e., M(S l ) = P(C l ) Rl =r l = P(R l = r l S l ). (12) In this step, there is no difference between the augmented tensor representation and the standard message passing in junction tree. At the root, we arrive at the final results of the message passing algorithm, and we obtain the marginal probability of the observed variables by aggregating all incoming messages from its 3 children, i.e., P(O) = (13) P(C r ) Sr1 M(S r1 ) Sr2 M(S r2 ) Sr3 M(S r3 ). Example. For Figure 2, using the following tensors P({B, C, D, E}) = P( 2 B, D 2 C, E) M({B, C}) = P(f B, C) M({B, D}) = P(g B, D), we can write the message update for node C BCDE in the form of equation (11) as M({C, E}) = P({B, C, D, E}) {B,C} M({B, C}) {B,D} M({B, D}). Note how the tensor multiplication sums out B and D: P({B, C, D, E}) has two B labels, and it appears in the subscripts of tensor multiplication twice; D appears once in the label and in the subscript of tensor multiplication respectively. Similarly, C is not summed out since there are two C labels but it appears only once in the subscript of tensor multiplication. 5 Transformed Representation Explicitly learning the tensor representation in (10) is still an intractable problem. Our key observation is that we do not need to recover the tensor representation explicitly if our focus is to perform inference using the message passing algorithm as in (11) (13). As long as we can recover the tensor representation up to some invertible transformation, we can still obtain the correct marginal probability P(O). More specifically, we can insert a mode-specific identity tensor I σ into the message update equation in (11) without changing the outgoing message. Subsequently, we can then replace the mode-specific identity tensor by a pair of tensors, F and F 1, which are mode-specific inversions of each other (F ω F 1 = I σ ). Then we can group these inserted tensors with the representation P(C) from (10), and obtain a transformed version P(C) (also see Figure 1). Furthermore, we have the freedom in choosing these collections of tensor inversion pairs. We will show that if we choose them systematically, we will be able to estimate each transformed tensor P(C) using the marginal probability of a small set of observed variables (observable representation). In this section, we will first explain the transformed tensor representation. As an illustration, consider a sequence of matrix multiplications with two identity matrices I 1 = F 1 F1 1 and I 2 = F 2 F2 1 inserted ABC = A(F 1 F1 1 )B(F 2 F2 1 )C = (AF 1 ) (F1 1 BF 2 ) (F2 1 C). }{{} Ã } {{ } B } {{ } C We see that we can equivalently compute ABC using their transformed versions, i.e., ABC = Ã B C. Moving to the tensor case, let us first consider a node C i and its parent node C i0. Then the outgoing message of C i0 can be computed recursively as M(S i0 ) = P(C i0 ) Si M(S i )... }{{} P(C i) Si1 M(S i1 ) Si2 M(S i2 ) Inserting a mode specific identity tensor I Si with labels {S i, S i } and similarly defined mode specific identity tensors I Si1 and I Si2 into the above two message updates, we obtain M(S i0 ) = P(C i0 ) Si (I Si Si M(S i ))... }{{} P(C i) Si1 (I Si1 Si1 M(S i1 )) Si2 (I Si2 Si2 M(S i1 )) Then we can further expand I Si using tensor inversion pairs F i, F 1 i, i.e., I Si = F i ωi F 1 i. Note that both F and F 1 have two set of mode labels, S i and another set ω i which is related to the observable representation and explained in the next section. Similarly, we expand I Si1 and I Si2 using their corresponding tensor inversion pairs. After expanding tensor identities I, we can regroup terms, and at node C i we have M(S i ) =(P(C i ) Si1 F i1 Si2 F i2 ) (14) ωi1 (F 1 i 1 Si1 M(S i1 )) ωi2 (F 1 i 2 Si2 M(S i2 )) and at the parent node C i0 of C i M(S i0 ) =(P(C i0 ) Si F i...) (15) ωi (F 1 i Si M(S i ))... Now we can define the transformed tensor representation for P(C i ) as P(C i ) := P(C i ) Si1 F i1 Si2 F i2 Si F 1 i, (16) where the two transformations F i1 and F i2 are ob-

7 tained from the children side and the transformation F 1 i is obtained from the parent side. Similarly, we can define the transformed representation for a leaf node and for the root node as P(C l ) = P(C l ) Sl F 1 l (17) P(C r ) = P(C r ) Sr1 F r1 Sr2 F r2 Sr3 F r3 (18) Applying these definitions of the transformed representation recursively, we can perform message passing based purely on these transformed representations M(S i0 ) = P(C i0 ) ωi M(Si )... (19) }{{} P(C i) ωi1 M(Si1 ) ωi2 M(Si2 ) 6 Observable Representation In the transformed tensor representation in (16)-(18), we have the freedom of choosing the collection of tensor pairs F and F 1. We will show that if we choose them systematically, we can recover each transformed tensor P(C) using the marginal probability of a small set of observed variables (observable representation). We will focus on the transformed tensor representation in (16) for an internal node C i (other cases follow as special cases). Due to the recursive way the transformed representation is defined, we only have the freedom of choosing F i1 and F i2 in this formula; the choice of F i will be fixed by the parent node of C i. The idea is to choose F i1 = P(O i1 S i1 ) as the conditional distribution of some set of observed variables O i1 O in the subtree rooted at child node C i1 of node C i, conditioning on the corresponding separator set S i1. Similarly, we choose F i2 = P(O i2 S i2 ) where O i2 O and it lies in subtree rooted at C i2. Following this convention, F i is chosen by the parent node C i0 and is fixed to P(O i S i ). Therefore, we have P(C i ) = P(C i ) Si1 P(O i1 S i1 ) Si2 P(O i2 S i2 ) Si P(O i S i ) 1, (20) where the first two tensor multiplications essentially eliminate the latent variables in S i1 and S i2. 3 With these choices, we also fix the mode labels ω i, ω i1 and ω i2 in (14) (15) and (19). That is ω i = O i, ω i1 = O i1 and ω i2 = O i2. To remove all dependencies on latent variables in P(C i ) and relate it to observed variables, we need to eliminate the latent variables in S i and the tensor P(O i S i ) 1. For this, we multiply the transformed tensor P(C i ) by P(O i, O i ), where O i denotes some set of observed variables which do not belong to the subtree rooted at node C i. Furthermore, P(O i, O i ) can be re-expressed using the conditional distribution 3 If a latent variable in S i1 S i2 is also in S i, it is not eliminated in this step but in another step. of O i and O i respectively, conditioning on the separator set S i, i.e., P(O i, O i ) = P(O i S i ) Si P( 2 S i ) Si P(O i S i ). Therefore, we have P(O i S i ) 1 Oi P(O i, O i ) = P( 2 S i ) Si P(O i S i ), and plugging this into (20), we have P(C i ) Oi P(O i, O i ) =P(C i ) Si1 P(O i1 S i1 ) Si2 P(O i2 S i2 ) Si P( 2 S i ) Si P(O i S i ) =P(O i1, O i2, O i ), (21) where P(C i ) is now related to only marginal probabilities of observed variables. From the equivalent relation, we can inverting P(O i, O i ), and obtain the observable representation for P(C i ) P(C i ) = P(O i1, O i2, O i ) Oi P(O i, O i ) 1. (22) Example. For node C BCDE in Figure 2, the choices of O i, O i1, O i2 and O i are {F, G}, G, F and H respectively. There are many valid choices of O i. In the supplementary, we describe how these different choices can be combined via a linear system using Eq. 21. This can substantially increase performance. For the leaf nodes and the root node, the derivation for their observable representations can be viewed as special cases of that for the internal nodes. We provide the results for their observable representation below: P(C r ) = P(O r1, O r2, O r3 ), (23) P(C l ) = P(O l, O l ) Ol P(O l, O l ) 1. (24) If P(O l, O l ) is invertible, then P(C l ) = I Ol. Otherwise we need to project P(O i, O i ) using a tensor U i to make it invertible, as discussed in the next section. The overall algorithm is given in Algorithm 1. Given N i.i.d. samples of the observed nodes, we simply replace P( ) by the empirical estimate P( ). Algorithm 1 algorithm for latent junction tree In: Junction tree topology and N i.i.d. samples { } x s 1,..., x s N O s=1 Out: Estimated marginal P(O) 1: Estimate P(C i) for the root, leaf and internal nodes P(C r) = P(O r1, O r2, O r3 ) Or1 U r1 Or2 U r2 Or3 U r3 P(C l ) = P(O l, O l ) Ol ( P(O l, O l ) Ol U l ) 1 P(C i) = P(O i1, O i2, O i ) Oi1 U i1 Oi2 U i2 Oi ( P(O i, O i ) Oi U i) 1 2: In reverse topological order, leaf and internal nodes send messages M(S l ) = P(C l ) Ol =o l M(S i) = P(C i) Oi1 M(Si1 ) Oi2 M(Si2 ) 3: At the root, obtain P(O) by P(C r) Or1 M(Sr1 ) Or2 M(Sr2 ) Or3 M(Sr3 )

8 7 Discussion The observable representation exists only if there exist tensor inversion pairs F i = P(O i S i ), and F 1 i. This is equivalent to requiring that the rank of the matricized version of F i (rows corresponds to modes O i and column to modes S i ) has rank τ i := k h S i. Similarly. the matricized version of P(O i S i ) also needs to have rank τ i, so that the matricized version of P(O i, O i ) has rank τ i and is invertible. Thus, it is required that #states(o i ) #states(s i ). This can be achieved by either making O i consist of a few high dimensional observations, or of many smaller dimensional ones. In the case when #states(o i ) > #states(s i ), we need to project F i to a lower dimensional space using a tensor U i so that it can be inverted. In this case, we define F i := P(O i S i ) Oi U i. For example, following this through the computation for the leaf gives us that P(C l ) = P(O l, O l ) Ol (P(O l, O l ) Ol U l ) 1. A good choice of U i can be obtained by performing a singular value decomposition of the matricized version of P(O i, O i ) (variables in O i are arranged to rows and those in O i to columns). For HMMs and latent trees, this rank condition can be expressed simply as requiring the conditional probability tables of the underlying model to not be rankdeficient. However, junction trees encode more complex latent structures that introduce subtle considerations. A general characterization of the existence condition for observable representation with respect to the graph topology will be our future work. In the appendix, we give some intuition using a couple of examples where observable representations do not exist. 8 Sample Complexity We analyze the sample complexity of Algorithm 1 and show that it depends on the junction tree topology and the spectral properties of the true model. Let d i be the order of P(C i ) and e i be the number of modes of P(C i ) that correspond to observed variables. Theorem 1 Let τ i = k h S i, d max = max i d i, and e max = max i e i. Then, for any ɛ > 0, 0 < δ < 1, if N O ( ( 4k 2 h 3β 2 ) dmax k emax o ln C δ C 2 ɛ 2 α 4 where σ τ ( ) returns the τ th largest singular value and α = min i σ τi (P(O i, O i )), β = min i σ τi (F i ) Then with probability 1 δ, x P(x1 1,...,x O,..., x O ) P(x 1,..., x O ) ɛ. See the supplementary for a proof. The result implies that the estimation problem depends exponentially on d max and e max, but note that e max d max. Furthermore, d max is always greater than or equal to the treewidth. Note the dependence on the singular values of certain probability tensors. In fully observed ) models, the accuracy of the learned parameters depends only on how close the empirical estimates of the factors are to the true factors. However, our spectral algorithm also depends on how close the inverses of these empirical estimates are to the true inverses, which depends on the spectral properties of the matrices (Stewart & Sun, 1990). 9 Experiments We now evaluate our method on synthetic and real data and compare it with both standard (Dempster et al., 1977) and stepwise online (Liang & Klein, 2009). All methods were implemented in C++, and the matrix library Eigen (Guennebaud et al., 2010) was used for computing SVDs and solving linear systems. For all experiments, standard is given 5 random restarts. Online tends to be sensitive to the learning rate, so it is given one restart for each of 5 choices of the learning rate {0.6, 0.7, 0.8, 0.9, 1} (the one with highest likelihood is selected). Convergence is determined by measuring the change in the log likelihood at iteration t (denoted by f(t)) over the f(t) f(t 1) average: avg(f(t),f(t 1)) 10 4 (the same precision as used in Murphy (2005)). For large sample sizes our method is almost two orders of magnitude faster than both and online. This is because is iterative and every iteration requires inference over all the training examples which can become expensive. On the other hand, the computational cost of our method is dominated by the SVD/linear system. Thus, it is primarily dependent only on the number of observed states and maximum tensor order, and can easily scale to larger sample sizes. In terms of accuracy, we generally observe 3 distinct regions, low-sample size, mid-sample size, and large sample size. In the low sample size region, /online tend to overfit to the training data and our spectral algorithm usually performs better. In the midsample size region /online tend to perform better since they benefit from a smaller number of parameters. However, once a certain sample size is reached (the large sample size region), our spectral algorithm consistently outperforms /online which suffer from local minima and convergence issues. 9.1 Synthetic Evaluation We first perform a synthetic evaluation. 4 different latent structures are used (see Figure 3): a second order nonhomogenous (NH) HMM, a third order NH HMM, a 2 level NH factorial HMM, and a complicated synthetic junction tree. The second/third order HMMs have k h = 2 and k o = 4, while the factorial HMM and synthetic junction tree have k h = 2, and k o = 16. For each latent structure, we generate 10 sets of model parameters, and then sample N training points and 1000

9 test points from each set, where N is varied from 100 to 100, 000. For evaluation, we measure the accuracy of joint estimation using error = P(x 1,...,x O ) P(x 1,...,x O ) P(x 1,...,x O ). We also measure the training time of both methods. Figure 3 shows the results. As discussed earlier, our algorithm is between one and two orders of magnitude faster than both and online for all the latent structures. is actually slower for very small sample sizes than for mid-range sample sizes because of overfitting. Also, in all cases, the spectral algorithm has the lowest error for large sample sizes. Moreover, critical sample size at which spectral overtakes /online is largely dependent on the number of parameters in the observable representation compared to that in the original parameterization of the model. In higher order/factorial HMM models, this increase is small, while in the synthetic junction tree it is larger. 9.2 Splice dataset We next consider the task of determining splicing sites in DNA sequences (Asuncion & Newman, 2007). Each example consists of a DNA sequence of length 60, where each position in the sequence is either an A, T, C, or G. The goal is to classify whether the sequence is an Intron/Exon site, Exon/Intron site, or neither. During training, for each class a different second order nonhomogeneous HMM with k h = 2 and k o = 4 is trained. At test, the probability of the test sequence is computed for each model, and the one with the highest probability is selected (which we found to perform better than a homogeneous one). Figure 4, shows our results, which are consistent with our synthetic evaluation. performs the best in low sample sizes, while /online perform a little better in the mid-sample size range. The dataset is not large enough to explore the large sample size regime. Moreover, we note that spectral algorithm is much faster for all the sample sizes. 10 Conclusion We have developed an alternative parameterization that allows fast, local minima free, and consistent parameter learning of latent junction trees. Our approach generalizes spectral algorithms to a much wider range of structures such as higher order, factorial, and semi-hidden Markov models. Unlike traditional nonconvex optimization formulations, spectral algorithms allow us to theoretically explore latent variable models in more depth. The spectral algorithm depends not only on the junction tree topology but also on the spectral properties of the parameters. Thus, two models with the same structure may pose different degrees of difficulty based on the underlying singular values. This is very different from learning fully observed junction trees, which is primarily dependent on only the topol- Error Runtime (s) Error Runtime (s) Length = 40 2nd Order NonHomogeneous HMM nd Order NonHomogeneous HMM (a) 2nd Order HMM Length = Level Factorial HMM 2 Level Factorial HMM Error Runtime (s) Error Runtime (s) Length = rd Order NonHomogeneous HMM rd Order NonHomogeneous HMM (b) 3rd Order HMM Synthetic Junction Tree Synthetic Junction Tree 1 (c) 2 Level Factorial HMM (d) Synthetic Junction Tree Figure 3: Comparison of our spectral algorithm (blue) to (red) and online (green) for various latent structures. Both errors and runtimes in log scale. Error Splice Runtime (s) Splice Figure 4: Results on Splice dataset ogy/treewidth. Future directions include learning discriminative models and structure learning. Acknowledgements: This work is supported by an NSF Graduate Fellowship (Grant No ) to APP, Georgia Tech Startup Funding to LS, NIH 1R01- GM093156, and The Gatsby Charitable Foundation. We thank Byron Boots for valuable discussion.

10 References Asuncion, A. and Newman, D.J. UCI machine learning repository, Balle, B., Quattoni, A., and Carreras, X. A spectral learning algorithm for finite state transducers. Machine Learning and Knowledge Discovery in Databases, pp , Blei, David and McAuliffe, Jon. Supervised topic models. In Advances in Neural Information Processing Systems 20, pp Blei, D.M., Ng, A.Y., and Jordan, M.I. Latent dirichlet allocation. The Journal of Machine Learning Research, 3: , Cohen, S.B., Stratos, K., Collins, M., Foster, D.P., and Ungar, L. learning of latent-variable pcfgs. In Association of Computational Linguistics (ACL), volume 50, Cowell, R., Dawid, A., Lauritzen, S., and Spiegelhalter, D. Probabilistic Networks and Expert Sytems. Springer, New York, Dempster, A., Laird, N., and Rubin, D. Maximum likelihood from incomplete data via the algorithm. Journal of the Royal Statistical Society B, 39 (1):1 22, Ghahramani, Z. and Jordan, M.I. Factorial hidden Markov models. Machine learning, 29(2): , Guennebaud, G., Jacob, B., et al. Eigen v Hsu, D., Kakade, S., and Zhang, T. A spectral algorithm for learning hidden Markov models. In Proc. Annual Conf. Computational Learning Theory, Kolda, T. and Bader, B. Tensor decompositions and applications. SIAM Review, 51(3): , Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. The MIT Press, Kundu, A., He, Y., and Bahl, P. Recognition of handwritten word: first and second order hidden Markov model based approach. Pattern recognition, 22(3): , Lacoste-Julien, S., Sha, F., and Jordan, M.I. Disclda: Discriminative learning for dimensionality reduction and classification. volume 21, pp Liang, P. and Klein, D. Online em for unsupervised models. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, pp Association for Computational Linguistics, Murphy, K. Hidden Markov model (HMM) toolbox for matlab Murphy, K.P. Dynamic bayesian networks: representation, inference and learning. PhD thesis, University of California, Parikh, A.P., Song, L., and Xing, E.P. A spectral algorithm for latent tree graphical models. In Proceedings of the 28th International Conference on Machine Learning, pp ACM, Rabiner, L. R. and Juang, B. H. An introduction to hidden Markov models. IEEE ASSP Magazine, 3 (1):4 16, Song, L., Boots, B., Siddiqi, S., Gordon, G., and Smola, A. Hilbert space embeddings of hidden Markov models. In Proceedings of the 27th International Conference on Machine Learning, pp ACM, Song, L., Parikh, A.P., and Xing, E.P. Kernel embeddings of latent tree graphical models. In Advances in Neural Information Processing Systems (NIPS), volume 24, pp Stewart, GW and Sun, J. Matrix Perturbation Theory. Academic Press, Su, X. and Khoshgoftaar, T.M. A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009:4, Zhu, J., Ahmed, A., and Xing, E.P. Medlda: maximum margin supervised topic models for regression and classification. In Proceedings of the 26th Annual International Conference on Machine Learning, pp ACM, 2009.

Lecture 21: Spectral Learning for Graphical Models

10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation