Supplemental for Spectral Algorithm For Latent Tree Graphical Models

Supplemental for Spectral Algorithm For Latent Tree Graphical Models Ankur P. Parikh, Le Song, Eric P. Xing The supplemental contains 3 main things. 1. The first is network plots of the latent variable tree learned by [1] for the stock market data, and the Chow Liu tree to give a more intuitive explanation why latent variable trees can lead to better performance.. The second is a more detailed representation of the tensor representation where internal nodes are allowed to be evidence variables. 3. The third is the proof of Theorem 1. 1 Latent Tree Structure for Stock Data The latent tree structure learned by the algorithm by [1] is shown in Figure 1. The blue nodes are hidden nodes and the red nodes are observed. ote how integrating out some of these hidden nodes could lead to very large cliques. Thus it is not surprising why both our spectral method and EM perform better than Chow Liu. The Chow Liu Tree is shown in Figure 1. ote how it is forced to pick some of the observed variables as hubs even if latent variables may be more natural. Figure 1: Latent variable tree learned by [1]. The hidden variables are in blue while the observed variables are in red. As one can see the hidden variables can model significantly more complex relationships among the observed variables. More Detailed Information about Tensor Representation for LTMs The computation of the marginal distribution of the observed variables can be expressed in terms of tensor multiplications. Basically, the information contained in each tensor will correspond to the information in a conditional probability table CPT of the model and the tensor multiplications implement the summations. However, there are multiple ways of rewriting the marginal distribution of the observed variables using tensor notation, and not all of them provide intuition or easy derivation to a spectral algorithm. In this section, we will derive a specific representation of the latent tree models which requires only tensors up to 3rd order and provides us a basis for deriving a spectral algorithm. More specifically, we first select a latent or observed variable as the root node and sort the nodes in the tree in topological order. Then we associate the root node X r with a vector r related to the marginal probability of X r. Depending on whether the root node is latent or observed, its entries are defined as X r latent X r observed rk P[X r = k] δ kxr P[x r ] where δ kxr is an indicator variable defined as δ kxr = 1 if k = x r and 0 otherwise. Effectively, δ kxr sets all entries of r to zero except the one corresponding to P[x r ]. ext, we associate each internal node X i with a 3rd order tensor T i related the conditional probabilty table between X i and its parent X πi. This tensor is diagonal in its nd and 3rd mode, and hence its nonzero entries can be accessed by two indices k and l. Depending on whether the internal node and its parent are latent or observed variables, the nonzero entries of T i are defined as 1

Figure : Tree learned by chow liu algorithm over only observed variables. ote how it is forced to pick some of the observed variables as hubs even if latent variables may be more natural. T i k, l, l X πi latent X πi observed X i latent P[X i = k X πi = l] δ lxπi P[X i = k x πi ] X i observed δ kxi P[ X πi = l] δ kxi δ lxπi P[ x πi ] where δ kxi and δ lxπi are also indicator variables. Effectively, the indicator variables zero out further entries in T i for those values that are not equal to the actual observation. Last, we associate each leaf node, which is always observed, with a diagonal matrix M i related to the likelihood of. Depending on whether the parent of is latent of observed, the diagonal entries of M i are defined as X πi latent X πi observed M i k, k P[ X πi = k] δ kxπi P[ x πi ] Let M i defined above be the messages passed from the leaf nodes to their parents. We can show that the marginal probability of the observed variables can be computed recursively using a message passing algorithm: each node in the tree sends a message to its parent according to the reverse topological order of the nodes, and the final messages are aggregated in the root to give the desired quantity. More formally, the outgoing message from an internal node X i to its parent can be computed as M i = T i 1 M j1 M j... M jj 1 i 1 where j 1, j,..., j J χ i range over all children of X i J = χ i. The 1 i is a vector of all ones with suitable size, and it is used to reduce the incoming messages all are diagonal matrices to a single vector. The computation in 1 essentially implements the message update we often see in an ordinary message passing algorithm [5], i.e., m i [x πi ] = P[ x πi ]m j1 [ ]... m jj [ ], where m j [ ] represents incoming messages to X i or intermediate results of the marginalization operation by summing out all latent variables in subtree T j. The M j1 M j... M jj 1 i corresponds to aggregating all incoming messages m j1 [ ]... m jj [ ], and the T i 1 corresponds to P[ x πi ]. At the root node, all incoming messages are combined to produce the final joint probability, i.e., P[x 1,..., x O ] = r M j1 M j... M jj 1 r. 3 Here r basically implements the operation x r P[x r ], which sums out the root variable.

3 otation for Proof of Theorem 1 We now proceed to prove Theorem 1. refers to spectral norm for matrices and tensors but normal euclidean norm for vectors. 1 refers to induced 1 norm for matrices and tensors max column sum, but normal l1 norm for vectors. F refers to Frobenius norm. The tensor spectral norm for 3 dimensions is defined in [4]: T = We will define the induced 1-norm of a tensor as sup T 3 v 3 v 1 v 1 4 v i 1 T 1,1 = sup T 1 v 1 5 v 1 1 using the l 1 norm of a matrix i.e., A 1 = sup v 1 1 Av 1. For more information about matrix norms see []. In general, we suppress the actual subscripts/superscripts on U and O. It is implied that U and O can often be different depending on the transform being considered. However, this makes the notation very messy. It will generally be clear from context which U and O are being referred to. When it is not we will arbitrarily index them 1,,..., so that it is clear which corresponds to which. In general, for simplicity of exposition, we assume that all internal nodes in the tree are unobserved, and all leaves are observed since this is the hardest case. The proof generally follows the technique of HKZ [3], but has key differences due to the tree topology instead of the HMM. We define M i = O Û 1 M i O Û. Then as long as O Û is invertible, O Û 1 Mi O Û = M i. We admit this is a slight abuse of notation, since M i was previously defined to be U O 1 M i U O, but as long as O Û is invertible it doesn t really matter whether it equals U O or not for the purposes of this proof. The other quantities are defined similarly. We seek to prove the following theorem: Theorem 1 Pick any ɛ > 0, δ < 1. Let O 1 d max S H l+1 S O ɛ min i σ SH O i log O min i j σ SH P i,j 4 δ Then with probability 1 δ P[x 1,..., x O ] P[x 1,..., x O ] ɛ 7 In many cases, if the frequency of the observation symbols follow certain distributions, than the dependence on S O can be removed as showed in HKZ [3]. That observation can easily be incorporated into our theorem if desired. 4 Concentration Bounds x denotes a fixed element while i, j, k are over indices. As the number of samples gets large, we expect these quantities to be small. ɛ i = P i P i F 8 ɛ i,j = P i,j P i,j F 9 ɛ x,i,j = P x,i,j P x,i,j F 10 ɛ i,j,k = P i,j,k P i,j,k F 11 Lemma 1 variant of HKZ [3] If the algorithm independently samples observation triples from the tree, then with probability at least 1 δ. C O 1 ɛ i ln δ + 1 C O 1 ɛ i,j ln δ + 13 C O 1 ɛ i,j,k ln δ + 14 C max ɛ O 1 i,x,j ln x δ + 15 max ɛ SO O x,i,j ln x δ + SO 16 3 6

where C is some constant from the union bound over OV 3. V is the total number of observed variables in the tree. The proof is the same as that of HKZ [3] except the union bound is larger. The last bound can be made tighter, identical to HKZ, but for simplicity we do not pursue that approach here. 5 Eigenvalue Bounds Basically this is Lemma 9 in HKZ [3], which is stated below for completeness: Lemma Suppose ɛ i,j ε σ SH P i,j for some ε < 1/. Let ε 0 = ɛ i,j /1 εσ S H P i,j. Then: 1. ε 0 < 1. σ SH Û Pi,j 1 εσ SH P i,j 3. σ SH Û P i,j 1 ε 0 σ SH P i,j 4. σ SH O Û 1 εσ SH O The proof is in HKZ [3]. 6 Bounding the Transformed Quantities If Lemma holds then O Û is invertible. Thus, if we define M i = O Û 1 M i O Û. Then clearly, Û O 1 Mi O Û = M i. We admit this is a slight abuse of notation, since M i is previously defined to be U O 1 M i U O, but as long as O Û is invertible it doesn t really matter whether it equals U O or not for the purposes of this proof. The other quantities are defined similarly. We seek to bound the following four quantities: δ i one = O Û 1 i 1 i 1 17 γ i = T i T i 1 O 1 Û1 1 O Û 3 O 3 Û3 1 18 δ root = r r T O j 1 rûj 1 1 19 i = O 1 Û1 M i M i O Û 1 1 0 Here denotes all observations that are in the subtree of node i since i may be hidden or observed. Sometimes we like to distinguish between when when i is observed and when i is hidden.thus, we sometimes refer to the quantity obs i and hidden i for when i is observed or hidden respectively. Again note that the numbering in O1 Û1 and O Û is just there to avoid confusion in the same equation In reality there are many U s and O s. Lemma 3 Assume ɛ i,j σ SH P i,j /3 for all i j. Then ɛ r δ root 3σSH O j 1 r δone i 4 S H hidden i ɛ i,j σ SH P i,j + ɛ m,j σ SH P i,j + ɛ i 3σSH P i,j ɛ m,j,k 3σSH P i,j γ i 4 S H σ SH O J J 1 + γ i 1 + jk δone i + 1 + γ i m 1 + jk m obs SH i 4 σ SH O ɛ i,j σ SH P j,i + ɛ m,xi,j 3σSH P i,j 1 3 4 5 The main challenge in this part is v and γv hidden. The rest are similar to HKZ. However, we go through the other bounds to be more explicit about some of the properties used, since sometimes we have used different norms etc. 6.1 δ root We note that r = P j Û and similarly r = P 1 j Û. 1 δ root = r r Oj 1 rûj 1 1 P j P 1 j 1 Ûj1 Oj 1 rûj 1 1 6 P j P 1 j O ɛ r 1 j1 rûj 1 1 σ SH Oj 1 rûj 1 7 4

The first inequality follows from the relationship between l and l norm and submultiplicativity. The second follows from a matrix perturbation bound given in Lemma 91. We also use the fact that since Û is orthonormal it has spectral norm 1. ɛ Assuming that ɛ i,j σ SH P i,j /3 gives δ root r by Lemma. 6. δ i one 3σSH O j 1 r δ i one = O Û 1 i 1 i 1 S H O Û 1 i 1 8 = S H 1 i 1 i = S H 1 i 1 i 9 Here we have converted l 1 norm to l norm, used submultiplicativity, the fact that Û is orthonormal so has spectral norm 1, and that O is a conditional probability matrix and therefore also has spectral norm 1. We note that 1 i = P i,j Û + Pi and similarly 1 i = P i,j Û + P i, where i and j are a particular pair of observations described in the main paper. 1 i 1 i = P T m,jû+ Pj P T m,jû + P j 30 = P m,jû+ Pj P m,jû+ Pj + P m,jû+ Pj P m,jû+ P j 31 P m,jû+ Pj P m,jû+ Pj + P m,jû+ Pj P m,jû+ P j 3 P m,jû+ P m,jû+ ˆP j 1 + P m,jû+ P m,jû+ P j P j 33 1 + 5 ɛ m,j minσ SH P + ɛ j 34 m,j, σ SH P m,jû σ SH P m,jû where we have used the triangle inequality in the first inequality and the submultiplicative property of matrix norms in the second. The last inequality follows by matrix perturbation bounds. Thus using the assumption that ɛ i,j σ SH P i, j/3, we get that 6.3 Tensor δ one 4 S H ɛ i,j σ SH P i,j + ɛ i 3σSH P i,j Recall that T i = T i 1 O 1 Û1 Û O 1 3 O 3 Û3 = P m,j,k 1 Û 1 P l,j Û + 3 Û 3. Similarly, T i = P m,j,k 1 Û 1 P l,j Û + 3 Û 3. 35 T i T i 1 O 1 Û1 1 O Û 3 O 3 Û3 1 1,1 SH σ SH O T i T i 36 This is because both Û and O have spectral norm one and the S H factor is the cost of converting from 1 norm to spectral norm. T i T i = P m,j,k 1 Û 1 P l,j Û + 3 Û 3 P m,j,k 1 Û 1 P l,j Û + 3 Û 3 37 = P m,j,k 1 Û 1 P i,k Û + 3 Û 3 P m,j,k 1 Û 1 P l,j Û 3 + 3 Û 3 38 + P m,j,k 1 Û 1 P l,j Û + 3 Û 3 P m,j,k 1 Û 1 P l,j Û + 3 Û 3 39 = P m,j,k 1 Û 1 P l,j Û + P l,j Û + 3 Û 3 40 + P i,j,k 1 Û 1 3 Û 3 P i,j,k 1 Û 1 3 Û 3 P l,j Û + 41 = P m,j,k 1 + 5 ɛ l,j ɛ m,j,k min σ SH P l,j, σ SH P l,j Û + σ SH P l,j Û 4 It is clear that P m,j,k P m,j,k F 1. Using the fact that ɛ i,j σ SH P i,j /3 gives us the following bound: γ v 4 S H σ SH O ɛ i,j σ SH P i,j + ɛ i,j,k 3σSH P i,j 5 43

6.4 Bounding i We now seek to bound i = O1 Û1 M i M i Û O 1 1. There are two cases: either i is a leaf or it is not. 6.4.1 i is leaf node In this case our proof simply follows from HKZ [3] and is repeated here for convenience. O 1 Û1 M MO Û 1 1 S H O 1 1 M i M i O Û 1 44 ote that M i = P j,i Û 1 1 Pm,xi,jÛ and M i = P j,i Û 1 1 P m,xi,jû. S H M i M i σ SH O Û 45 M i M i = P j,i Û 1 1 Pm,xi,jÛ P j,i Û 1 1 P m,xi,jû 46 = P j,i Û 1 1 Pm,xi,jÛ P j,i Û 1 1 Pm,xi,jÛ + P j,i Û 1 1 Pm,xi,jÛ Û 1 P x,i,j Û P i,j 1 47 P j,i Û 1 1 P j,i Û 1 1 P m,xi,jû + P j,i Û 1 1 P m,xi,jû P m,xi,jû 48 P m,xi,j 1 + 5 P[ = x] 1 + 5 ɛ j,i ɛ m,,j min σ SH P j,i, σ SH P j,i Û + σ SH P j,i Û ɛ j,i ɛ m,,j min σ SH P j,i, σ SH P j,i Û + σ SH P j,i Û where the first inequality follows from the triangle inequality, and the second uses matrix perturbation bounds and the fact that spectral norm of Û is 1. The final inequality follows from the fact that spectral norm is less than frobenius norm which is less than l1 norm: P m,xi,j m,j[ P m,xi,j] m,j [P m,xi,j] m,j P[ = x] 51 m,j The first inequality follows from relation between 1 operator norm and operator norm. Because O is a conditional probability matrix O 1 = 1 i.e. the max column sum is 1. Using the fact that ɛ i,j σ SH P i,j /3 gives us the following bound: SH i,x 4 σ SH O ɛ m,xi,j P[ = x] σ SH P j,i + ɛ m,xi,j 3σSH P j,i 49 50 5 Summing over v would give SH i 4 σ SH O ɛ j,i σ SH P j,i + ɛ m,xi,j 3σSH P j,i 53 6

6.4. i is not a leaf node Let m J:1 = M J... M 1 1 i and m J:1 = M J... M 1 1 i O Û ˆM i M i O 3 Û3 1 1 54 = O Û T i 1 Mu... M 1 1 i T i 1 Mu... M 1 1 i O 3 Û3 1 1 55 = O Û T i 1 m J:1 T i 1 m J:1 O 3 Û3 1 1 56 = O Û T i T i 1 m J:1 + T i T i 1 m J:1 m J:1 + T i 1 m J:1 m J:1 O3 Û3 1 1 57 T i T i 1 O 1 Û1 1 Û O 3 O 3 Û3 1 1,1 O 1 Û 1 m J:1 1 58 + O1 Û1 m J:1 m J:1 1 T i T i 1 O1 Û1 1 Û O 3 O3 Û3 1 1,1 59 + T i 1 O Û 1 1 Û O 3 O3 Û3 1 O 1,1 1 Û 1 m J:1 m J:1 60 1 First term is bounded by: T i T i 1 O 1 Û1 1 Û O 3 O 3 Û3 1 1,1 S H S H γ i 61 Second term is bounded by: O 1 Û1 m J:1 m J:1 1 T i T i 1 O 1 Û1 1 Û O 3 O 3 Û3 1 1,1 6 γ i O 1 Û1 m J:1 m J:1 1 63 Third Term is bounded by: T i 1 O1 Û1 1 Û O 3 O3 Û3 1 1,1 O1 Û1 m J:1 m J:1 O 1 Û 1 m J:1 m J:1 64 1 1 In the next section, we will see that J O Û m J:1 m J:1 1 1 + j k δone i J + S H 1 + j k S H. So the overall bound is J J i 1 + γ i 1 + jk δone i + 1 + γ i S H 1 + jk S H. 65 where j 1,..., j J are children of node i. 6.5 Bounding O Û m J:1 m J:1 1 Lemma 4 O Û m J:1 m J:1 1 where j 1,..., j J are children of node i. The proof is by induction. Base case: O Û 1 i 1 i 1 δ i one, by definition of δ i one. Inductive step: Let us say claim holds up until u 1. We show it holds for u. Thus J J 1 + jk δone i + S H 1 + jk S H 66 u 1 u 1 O Û m u 1:1 m u 1:1 1 1 + jk δone i + S H 1 + jk S H 67 7

We now decompose the sum over x as O Û m u:1 m u:1 1 = O Û M u M u m u 1:1 + M u M u m u 1:1 m u 1:1 + m u 1:1 m u 1:1 1 68 Using the triangle inequality, we get O Û M u M u O 1 Û1 1 1 O 1 Û1 m u 1:1 1 69 + O Û M u M u O 1 Û1 1 1 O 1 Û1 m u 1:1 m u 1:1 1 70 + O Û M u O Û 1 1 O Û m u 1:1 m u 1:1 1 71 Again we are just numbering the U s and O s for clarity to see which corresponds with which. They are omitted in the actual theorem statements since we will take minimums etc. at the end. We now must bound these terms. First term: O Û M u M u O Û 1 1 O1 Û1 m u 1:1 1 u m u 1:1 O Û 1 S H u 7 x u x 1:u 1 x u 1:1 since u = O Û M u M u O 1 Û1 1 1. Second term can be bounded by inductive hypothesis: O Û M u M u O 1 Û1 1 1 O 1 Û1 m u 1:1 m u 1:1 1 u u 1 u 1 1 + jk δone i + S H 1 + jk S H 73 The third term is bounded by observing that O Û M u O Û 1 = diagpr[x u Parent]. Thus it is diagonal, and Pr[x Parent] has max row or column sum as 1. This means that the third term is bounded by the inductive hypothesis as well: u 1 O Û M u O Û 1 u 1 1 O Û m u 1:1 m u 1:1 1 1 + jk δone i + S H 1 + jk S H 74 7 Bounding the propagation of error in tree We now wrap up the proof based on the approach of HKZ[3]. Lemma 5 J J P[x 1,..., x O ] P[x 1,..., x O ] S H δ root + 1 + δ root 1 + jk δone r + S H 1 + jk S H P[x 1,..., x O ] P[x 1,..., x O ] = + + 75 r Mj1... M jj 1 r r Mj1... M jj 1 r 76 r r O Û 1 O Û M J:1 1 77 r r O Û 1 O Û M J:1 1 r M J:1 1 r 78 r O Û 1 O Û M J:1 1 i M J:1 1 79 The first sum is bounded using Holder inequality and noting that the first term is a conditional probability of all observed variables conditioned on the root r r O Û 1 O Û M J:1 1 80 r r O Û 1 O Û M J:1 1 1 S H δ root 81 8

The second sum is bounded by another application of Holder s inequality and the previous lemma: r r O Û 1 O Û M J:1 1 r M J:1 1 r 8 r r O Û 1 O Û M J:1 1 r M J:1 1 r 1 83 J J δ root 1 + jk δone r + S H 1 + jk S H 84 The third sum is also bounded by Holder s Inequality and previous lemmas and noting that r U O 1 = P[R = r]: r O Û 1 O Û M J:1 1 i M J:1 1 85 r O Û 1 O Û M J:1 1 r M J:1 1 r 1 86 J J 1 + jk δone r + S H 1 + jk S H 87 Combining these bounds gives us the desired solution. 8 Putting it all together We seek for P[x 1,..., x O ] P[x 1,..., x O ] ɛ 88 Using the fact that for a <.5, 1 + a/t 1 + a, we get that jk Oɛ/S H J. However, j is defined recursively, and thus the error accumulates exponential in the longest path of hidden nodes. For example, obs ɛ i O where l is the d maxs H l longest path of hidden nodes. Tracing this back through will gives the result: Pick any ɛ > 0, δ < 1. Let 1 d max S H l+1 S O O ɛ min i σ SH O i log O 89 min i j σ SH P i,j 4 δ Then with probability 1 δ P[x 1,..., x O ] P[x 1,..., x O ] ɛ 90 In many cases, if the frequency of the observation symbols follow certain distributions, than the dependence on S O can be removed as showed in HKZ [3]. 9 Appendix 9.1 Matrix Perturbation Bounds This is Theorem 3.8 from pg. 143 in Stewart and Sun, 1990 [6]. Let A R m n, with m n and let Ã = A + E. Then References Ã+ A + 1 + 5 max A +, Ã E 91 [1] Myung J. Choi, Vicent Y. Tan, Animashree Anandkumar, and Alan S. Willsky. Learning latent tree graphical models. In arxiv:1009.7v1, 010. [] R.A. Horn and C.R. Johnson. Matrix analysis. Cambridge Univ Pr, 1990. [3] D. Hsu, S. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. In Proc. Annual Conf. Computational Learning Theory, 009. [4].H. guyen, P. Drineas, and T.D. Tran. Tensor sparsification via a bound on the spectral norm of random tensors. Arxiv preprint arxiv:1005.473, 010. [5] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. [6] G.W. Stewart and J. Sun. Matrix perturbation theory, volume 175. Academic press ew York, 1990. 9