Supplemental for Spectral Algorithm For Latent Tree Graphical Models

Similar documents
A Spectral Algorithm For Latent Junction Trees - Supplementary Material

Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics

642:550, Summer 2004, Supplement 6 The Perron-Frobenius Theorem. Summer 2004

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Spectral Learning of General Latent-Variable Probabilistic Graphical Models: A Supervised Learning Approach

ECE 598: Representation Learning: Algorithms and Models Fall 2017

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance

4 : Exact Inference: Variable Elimination

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

9 Forward-backward algorithm, sum-product on factor graphs

Appendix A. Proof to Theorem 1

Cours 7 12th November 2014

Hierarchical Tensor Decomposition of Latent Tree Graphical Models

Reduced-Rank Hidden Markov Models

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

STA 4273H: Statistical Machine Learning

Bayesian Machine Learning - Lecture 7

Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4

Hidden Markov models

STA 414/2104: Machine Learning

Lecture 4 October 18th

Probabilistic Graphical Models

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

6.867 Machine learning, lecture 23 (Jaakkola)

14 : Theory of Variational Inference: Inner and Outer Approximation

Efficient Sensitivity Analysis in Hidden Markov Models

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

A Spectral Algorithm for Latent Junction Trees

Section 3.9. Matrix Norm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Undirected Graphical Models

CS 246 Review of Linear Algebra 01/17/19

Bayesian networks. Chapter 14, Sections 1 4

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Machine Learning for Data Science (CS4786) Lecture 24

Lecture 8: Linear Algebra Background

Lecture 21: Spectral Learning for Graphical Models

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Linear Dynamical Systems

Variational Inference (11/04/13)

Tensor rank-one decomposition of probability tables

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

NOTES ON LINEAR ODES

Fourier analysis of boolean functions in quantum computation

Graphical Models for Collaborative Filtering

Image Data Compression

Estimating Covariance Using Factorial Hidden Markov Models

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Representing Independence Models with Elementary Triplets

Deriving the EM-Based Update Rules in VARSAT. University of Toronto Technical Report CSRG-580

A Single-letter Upper Bound for the Sum Rate of Multiple Access Channels with Correlated Sources

A = , A 32 = n ( 1) i +j a i j det(a i j). (1) j=1

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals

Computational Graphs, and Backpropagation

Spectral Probabilistic Modeling and Applications to Natural Language Processing

Lecture 8 : Eigenvalues and Eigenvectors

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

2 : Directed GMs: Bayesian Networks

A concise proof of Kruskal s theorem on tensor decomposition

Variable Elimination: Algorithm

V C V L T I 0 C V B 1 V T 0 I. l nk

NORMS ON SPACE OF MATRICES

Sequence modelling. Marco Saerens (UCL) Slides references

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Variable Elimination: Algorithm

Bayesian networks. Chapter Chapter

Large-Deviations and Applications for Learning Tree-Structured Graphical Models

CS281A/Stat241A Lecture 19

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

12. Perturbed Matrices

Based on slides by Richard Zemel

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

A PRIMER ON SESQUILINEAR FORMS

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

Orthogonal tensor decomposition

METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS

Ramanujan Graphs of Every Size

Tree-Structured Statistical Modeling via Convex Optimization

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Vector, Matrix, and Tensor Derivatives

Note Set 5: Hidden Markov Models

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Online Supplement: Managing Hospital Inpatient Bed Capacity through Partitioning Care into Focused Wings

Hidden Markov Models (recap BNs)

arxiv: v1 [math.ra] 13 Jan 2009

Probabilistic Latent Semantic Analysis

Linear Algebra March 16, 2019

Comparing Bayesian Networks and Structure Learning Algorithms

Speeding up Inference in Markovian Models

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

11 The Max-Product Algorithm

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Basic Concepts in Linear Algebra

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Transcription:

Supplemental for Spectral Algorithm For Latent Tree Graphical Models Ankur P. Parikh, Le Song, Eric P. Xing The supplemental contains 3 main things. 1. The first is network plots of the latent variable tree learned by [1] for the stock market data, and the Chow Liu tree to give a more intuitive explanation why latent variable trees can lead to better performance.. The second is a more detailed representation of the tensor representation where internal nodes are allowed to be evidence variables. 3. The third is the proof of Theorem 1. 1 Latent Tree Structure for Stock Data The latent tree structure learned by the algorithm by [1] is shown in Figure 1. The blue nodes are hidden nodes and the red nodes are observed. ote how integrating out some of these hidden nodes could lead to very large cliques. Thus it is not surprising why both our spectral method and EM perform better than Chow Liu. The Chow Liu Tree is shown in Figure 1. ote how it is forced to pick some of the observed variables as hubs even if latent variables may be more natural. Figure 1: Latent variable tree learned by [1]. The hidden variables are in blue while the observed variables are in red. As one can see the hidden variables can model significantly more complex relationships among the observed variables. More Detailed Information about Tensor Representation for LTMs The computation of the marginal distribution of the observed variables can be expressed in terms of tensor multiplications. Basically, the information contained in each tensor will correspond to the information in a conditional probability table CPT of the model and the tensor multiplications implement the summations. However, there are multiple ways of rewriting the marginal distribution of the observed variables using tensor notation, and not all of them provide intuition or easy derivation to a spectral algorithm. In this section, we will derive a specific representation of the latent tree models which requires only tensors up to 3rd order and provides us a basis for deriving a spectral algorithm. More specifically, we first select a latent or observed variable as the root node and sort the nodes in the tree in topological order. Then we associate the root node X r with a vector r related to the marginal probability of X r. Depending on whether the root node is latent or observed, its entries are defined as X r latent X r observed rk P[X r = k] δ kxr P[x r ] where δ kxr is an indicator variable defined as δ kxr = 1 if k = x r and 0 otherwise. Effectively, δ kxr sets all entries of r to zero except the one corresponding to P[x r ]. ext, we associate each internal node X i with a 3rd order tensor T i related the conditional probabilty table between X i and its parent X πi. This tensor is diagonal in its nd and 3rd mode, and hence its nonzero entries can be accessed by two indices k and l. Depending on whether the internal node and its parent are latent or observed variables, the nonzero entries of T i are defined as 1

Figure : Tree learned by chow liu algorithm over only observed variables. ote how it is forced to pick some of the observed variables as hubs even if latent variables may be more natural. T i k, l, l X πi latent X πi observed X i latent P[X i = k X πi = l] δ lxπi P[X i = k x πi ] X i observed δ kxi P[ X πi = l] δ kxi δ lxπi P[ x πi ] where δ kxi and δ lxπi are also indicator variables. Effectively, the indicator variables zero out further entries in T i for those values that are not equal to the actual observation. Last, we associate each leaf node, which is always observed, with a diagonal matrix M i related to the likelihood of. Depending on whether the parent of is latent of observed, the diagonal entries of M i are defined as X πi latent X πi observed M i k, k P[ X πi = k] δ kxπi P[ x πi ] Let M i defined above be the messages passed from the leaf nodes to their parents. We can show that the marginal probability of the observed variables can be computed recursively using a message passing algorithm: each node in the tree sends a message to its parent according to the reverse topological order of the nodes, and the final messages are aggregated in the root to give the desired quantity. More formally, the outgoing message from an internal node X i to its parent can be computed as M i = T i 1 M j1 M j... M jj 1 i 1 where j 1, j,..., j J χ i range over all children of X i J = χ i. The 1 i is a vector of all ones with suitable size, and it is used to reduce the incoming messages all are diagonal matrices to a single vector. The computation in 1 essentially implements the message update we often see in an ordinary message passing algorithm [5], i.e., m i [x πi ] = P[ x πi ]m j1 [ ]... m jj [ ], where m j [ ] represents incoming messages to X i or intermediate results of the marginalization operation by summing out all latent variables in subtree T j. The M j1 M j... M jj 1 i corresponds to aggregating all incoming messages m j1 [ ]... m jj [ ], and the T i 1 corresponds to P[ x πi ]. At the root node, all incoming messages are combined to produce the final joint probability, i.e., P[x 1,..., x O ] = r M j1 M j... M jj 1 r. 3 Here r basically implements the operation x r P[x r ], which sums out the root variable.

3 otation for Proof of Theorem 1 We now proceed to prove Theorem 1. refers to spectral norm for matrices and tensors but normal euclidean norm for vectors. 1 refers to induced 1 norm for matrices and tensors max column sum, but normal l1 norm for vectors. F refers to Frobenius norm. The tensor spectral norm for 3 dimensions is defined in [4]: T = We will define the induced 1-norm of a tensor as sup T 3 v 3 v 1 v 1 4 v i 1 T 1,1 = sup T 1 v 1 5 v 1 1 using the l 1 norm of a matrix i.e., A 1 = sup v 1 1 Av 1. For more information about matrix norms see []. In general, we suppress the actual subscripts/superscripts on U and O. It is implied that U and O can often be different depending on the transform being considered. However, this makes the notation very messy. It will generally be clear from context which U and O are being referred to. When it is not we will arbitrarily index them 1,,..., so that it is clear which corresponds to which. In general, for simplicity of exposition, we assume that all internal nodes in the tree are unobserved, and all leaves are observed since this is the hardest case. The proof generally follows the technique of HKZ [3], but has key differences due to the tree topology instead of the HMM. We define M i = O Û 1 M i O Û. Then as long as O Û is invertible, O Û 1 Mi O Û = M i. We admit this is a slight abuse of notation, since M i was previously defined to be U O 1 M i U O, but as long as O Û is invertible it doesn t really matter whether it equals U O or not for the purposes of this proof. The other quantities are defined similarly. We seek to prove the following theorem: Theorem 1 Pick any ɛ > 0, δ < 1. Let O 1 d max S H l+1 S O ɛ min i σ SH O i log O min i j σ SH P i,j 4 δ Then with probability 1 δ P[x 1,..., x O ] P[x 1,..., x O ] ɛ 7 In many cases, if the frequency of the observation symbols follow certain distributions, than the dependence on S O can be removed as showed in HKZ [3]. That observation can easily be incorporated into our theorem if desired. 4 Concentration Bounds x denotes a fixed element while i, j, k are over indices. As the number of samples gets large, we expect these quantities to be small. ɛ i = P i P i F 8 ɛ i,j = P i,j P i,j F 9 ɛ x,i,j = P x,i,j P x,i,j F 10 ɛ i,j,k = P i,j,k P i,j,k F 11 Lemma 1 variant of HKZ [3] If the algorithm independently samples observation triples from the tree, then with probability at least 1 δ. C O 1 ɛ i ln δ + 1 C O 1 ɛ i,j ln δ + 13 C O 1 ɛ i,j,k ln δ + 14 C max ɛ O 1 i,x,j ln x δ + 15 max ɛ SO O x,i,j ln x δ + SO 16 3 6

where C is some constant from the union bound over OV 3. V is the total number of observed variables in the tree. The proof is the same as that of HKZ [3] except the union bound is larger. The last bound can be made tighter, identical to HKZ, but for simplicity we do not pursue that approach here. 5 Eigenvalue Bounds Basically this is Lemma 9 in HKZ [3], which is stated below for completeness: Lemma Suppose ɛ i,j ε σ SH P i,j for some ε < 1/. Let ε 0 = ɛ i,j /1 εσ S H P i,j. Then: 1. ε 0 < 1. σ SH Û Pi,j 1 εσ SH P i,j 3. σ SH Û P i,j 1 ε 0 σ SH P i,j 4. σ SH O Û 1 εσ SH O The proof is in HKZ [3]. 6 Bounding the Transformed Quantities If Lemma holds then O Û is invertible. Thus, if we define M i = O Û 1 M i O Û. Then clearly, Û O 1 Mi O Û = M i. We admit this is a slight abuse of notation, since M i is previously defined to be U O 1 M i U O, but as long as O Û is invertible it doesn t really matter whether it equals U O or not for the purposes of this proof. The other quantities are defined similarly. We seek to bound the following four quantities: δ i one = O Û 1 i 1 i 1 17 γ i = T i T i 1 O 1 Û1 1 O Û 3 O 3 Û3 1 18 δ root = r r T O j 1 rûj 1 1 19 i = O 1 Û1 M i M i O Û 1 1 0 Here denotes all observations that are in the subtree of node i since i may be hidden or observed. Sometimes we like to distinguish between when when i is observed and when i is hidden.thus, we sometimes refer to the quantity obs i and hidden i for when i is observed or hidden respectively. Again note that the numbering in O1 Û1 and O Û is just there to avoid confusion in the same equation In reality there are many U s and O s. Lemma 3 Assume ɛ i,j σ SH P i,j /3 for all i j. Then ɛ r δ root 3σSH O j 1 r δone i 4 S H hidden i ɛ i,j σ SH P i,j + ɛ m,j σ SH P i,j + ɛ i 3σSH P i,j ɛ m,j,k 3σSH P i,j γ i 4 S H σ SH O J J 1 + γ i 1 + jk δone i + 1 + γ i m 1 + jk m obs SH i 4 σ SH O ɛ i,j σ SH P j,i + ɛ m,xi,j 3σSH P i,j 1 3 4 5 The main challenge in this part is v and γv hidden. The rest are similar to HKZ. However, we go through the other bounds to be more explicit about some of the properties used, since sometimes we have used different norms etc. 6.1 δ root We note that r = P j Û and similarly r = P 1 j Û. 1 δ root = r r Oj 1 rûj 1 1 P j P 1 j 1 Ûj1 Oj 1 rûj 1 1 6 P j P 1 j O ɛ r 1 j1 rûj 1 1 σ SH Oj 1 rûj 1 7 4

The first inequality follows from the relationship between l and l norm and submultiplicativity. The second follows from a matrix perturbation bound given in Lemma 91. We also use the fact that since Û is orthonormal it has spectral norm 1. ɛ Assuming that ɛ i,j σ SH P i,j /3 gives δ root r by Lemma. 6. δ i one 3σSH O j 1 r δ i one = O Û 1 i 1 i 1 S H O Û 1 i 1 8 = S H 1 i 1 i = S H 1 i 1 i 9 Here we have converted l 1 norm to l norm, used submultiplicativity, the fact that Û is orthonormal so has spectral norm 1, and that O is a conditional probability matrix and therefore also has spectral norm 1. We note that 1 i = P i,j Û + Pi and similarly 1 i = P i,j Û + P i, where i and j are a particular pair of observations described in the main paper. 1 i 1 i = P T m,jû+ Pj P T m,jû + P j 30 = P m,jû+ Pj P m,jû+ Pj + P m,jû+ Pj P m,jû+ P j 31 P m,jû+ Pj P m,jû+ Pj + P m,jû+ Pj P m,jû+ P j 3 P m,jû+ P m,jû+ ˆP j 1 + P m,jû+ P m,jû+ P j P j 33 1 + 5 ɛ m,j minσ SH P + ɛ j 34 m,j, σ SH P m,jû σ SH P m,jû where we have used the triangle inequality in the first inequality and the submultiplicative property of matrix norms in the second. The last inequality follows by matrix perturbation bounds. Thus using the assumption that ɛ i,j σ SH P i, j/3, we get that 6.3 Tensor δ one 4 S H ɛ i,j σ SH P i,j + ɛ i 3σSH P i,j Recall that T i = T i 1 O 1 Û1 Û O 1 3 O 3 Û3 = P m,j,k 1 Û 1 P l,j Û + 3 Û 3. Similarly, T i = P m,j,k 1 Û 1 P l,j Û + 3 Û 3. 35 T i T i 1 O 1 Û1 1 O Û 3 O 3 Û3 1 1,1 SH σ SH O T i T i 36 This is because both Û and O have spectral norm one and the S H factor is the cost of converting from 1 norm to spectral norm. T i T i = P m,j,k 1 Û 1 P l,j Û + 3 Û 3 P m,j,k 1 Û 1 P l,j Û + 3 Û 3 37 = P m,j,k 1 Û 1 P i,k Û + 3 Û 3 P m,j,k 1 Û 1 P l,j Û 3 + 3 Û 3 38 + P m,j,k 1 Û 1 P l,j Û + 3 Û 3 P m,j,k 1 Û 1 P l,j Û + 3 Û 3 39 = P m,j,k 1 Û 1 P l,j Û + P l,j Û + 3 Û 3 40 + P i,j,k 1 Û 1 3 Û 3 P i,j,k 1 Û 1 3 Û 3 P l,j Û + 41 = P m,j,k 1 + 5 ɛ l,j ɛ m,j,k min σ SH P l,j, σ SH P l,j Û + σ SH P l,j Û 4 It is clear that P m,j,k P m,j,k F 1. Using the fact that ɛ i,j σ SH P i,j /3 gives us the following bound: γ v 4 S H σ SH O ɛ i,j σ SH P i,j + ɛ i,j,k 3σSH P i,j 5 43

6.4 Bounding i We now seek to bound i = O1 Û1 M i M i Û O 1 1. There are two cases: either i is a leaf or it is not. 6.4.1 i is leaf node In this case our proof simply follows from HKZ [3] and is repeated here for convenience. O 1 Û1 M MO Û 1 1 S H O 1 1 M i M i O Û 1 44 ote that M i = P j,i Û 1 1 Pm,xi,jÛ and M i = P j,i Û 1 1 P m,xi,jû. S H M i M i σ SH O Û 45 M i M i = P j,i Û 1 1 Pm,xi,jÛ P j,i Û 1 1 P m,xi,jû 46 = P j,i Û 1 1 Pm,xi,jÛ P j,i Û 1 1 Pm,xi,jÛ + P j,i Û 1 1 Pm,xi,jÛ Û 1 P x,i,j Û P i,j 1 47 P j,i Û 1 1 P j,i Û 1 1 P m,xi,jû + P j,i Û 1 1 P m,xi,jû P m,xi,jû 48 P m,xi,j 1 + 5 P[ = x] 1 + 5 ɛ j,i ɛ m,,j min σ SH P j,i, σ SH P j,i Û + σ SH P j,i Û ɛ j,i ɛ m,,j min σ SH P j,i, σ SH P j,i Û + σ SH P j,i Û where the first inequality follows from the triangle inequality, and the second uses matrix perturbation bounds and the fact that spectral norm of Û is 1. The final inequality follows from the fact that spectral norm is less than frobenius norm which is less than l1 norm: P m,xi,j m,j[ P m,xi,j] m,j [P m,xi,j] m,j P[ = x] 51 m,j The first inequality follows from relation between 1 operator norm and operator norm. Because O is a conditional probability matrix O 1 = 1 i.e. the max column sum is 1. Using the fact that ɛ i,j σ SH P i,j /3 gives us the following bound: SH i,x 4 σ SH O ɛ m,xi,j P[ = x] σ SH P j,i + ɛ m,xi,j 3σSH P j,i 49 50 5 Summing over v would give SH i 4 σ SH O ɛ j,i σ SH P j,i + ɛ m,xi,j 3σSH P j,i 53 6

6.4. i is not a leaf node Let m J:1 = M J... M 1 1 i and m J:1 = M J... M 1 1 i O Û ˆM i M i O 3 Û3 1 1 54 = O Û T i 1 Mu... M 1 1 i T i 1 Mu... M 1 1 i O 3 Û3 1 1 55 = O Û T i 1 m J:1 T i 1 m J:1 O 3 Û3 1 1 56 = O Û T i T i 1 m J:1 + T i T i 1 m J:1 m J:1 + T i 1 m J:1 m J:1 O3 Û3 1 1 57 T i T i 1 O 1 Û1 1 Û O 3 O 3 Û3 1 1,1 O 1 Û 1 m J:1 1 58 + O1 Û1 m J:1 m J:1 1 T i T i 1 O1 Û1 1 Û O 3 O3 Û3 1 1,1 59 + T i 1 O Û 1 1 Û O 3 O3 Û3 1 O 1,1 1 Û 1 m J:1 m J:1 60 1 First term is bounded by: T i T i 1 O 1 Û1 1 Û O 3 O 3 Û3 1 1,1 S H S H γ i 61 Second term is bounded by: O 1 Û1 m J:1 m J:1 1 T i T i 1 O 1 Û1 1 Û O 3 O 3 Û3 1 1,1 6 γ i O 1 Û1 m J:1 m J:1 1 63 Third Term is bounded by: T i 1 O1 Û1 1 Û O 3 O3 Û3 1 1,1 O1 Û1 m J:1 m J:1 O 1 Û 1 m J:1 m J:1 64 1 1 In the next section, we will see that J O Û m J:1 m J:1 1 1 + j k δone i J + S H 1 + j k S H. So the overall bound is J J i 1 + γ i 1 + jk δone i + 1 + γ i S H 1 + jk S H. 65 where j 1,..., j J are children of node i. 6.5 Bounding O Û m J:1 m J:1 1 Lemma 4 O Û m J:1 m J:1 1 where j 1,..., j J are children of node i. The proof is by induction. Base case: O Û 1 i 1 i 1 δ i one, by definition of δ i one. Inductive step: Let us say claim holds up until u 1. We show it holds for u. Thus J J 1 + jk δone i + S H 1 + jk S H 66 u 1 u 1 O Û m u 1:1 m u 1:1 1 1 + jk δone i + S H 1 + jk S H 67 7

We now decompose the sum over x as O Û m u:1 m u:1 1 = O Û M u M u m u 1:1 + M u M u m u 1:1 m u 1:1 + m u 1:1 m u 1:1 1 68 Using the triangle inequality, we get O Û M u M u O 1 Û1 1 1 O 1 Û1 m u 1:1 1 69 + O Û M u M u O 1 Û1 1 1 O 1 Û1 m u 1:1 m u 1:1 1 70 + O Û M u O Û 1 1 O Û m u 1:1 m u 1:1 1 71 Again we are just numbering the U s and O s for clarity to see which corresponds with which. They are omitted in the actual theorem statements since we will take minimums etc. at the end. We now must bound these terms. First term: O Û M u M u O Û 1 1 O1 Û1 m u 1:1 1 u m u 1:1 O Û 1 S H u 7 x u x 1:u 1 x u 1:1 since u = O Û M u M u O 1 Û1 1 1. Second term can be bounded by inductive hypothesis: O Û M u M u O 1 Û1 1 1 O 1 Û1 m u 1:1 m u 1:1 1 u u 1 u 1 1 + jk δone i + S H 1 + jk S H 73 The third term is bounded by observing that O Û M u O Û 1 = diagpr[x u Parent]. Thus it is diagonal, and Pr[x Parent] has max row or column sum as 1. This means that the third term is bounded by the inductive hypothesis as well: u 1 O Û M u O Û 1 u 1 1 O Û m u 1:1 m u 1:1 1 1 + jk δone i + S H 1 + jk S H 74 7 Bounding the propagation of error in tree We now wrap up the proof based on the approach of HKZ[3]. Lemma 5 J J P[x 1,..., x O ] P[x 1,..., x O ] S H δ root + 1 + δ root 1 + jk δone r + S H 1 + jk S H P[x 1,..., x O ] P[x 1,..., x O ] = + + 75 r Mj1... M jj 1 r r Mj1... M jj 1 r 76 r r O Û 1 O Û M J:1 1 77 r r O Û 1 O Û M J:1 1 r M J:1 1 r 78 r O Û 1 O Û M J:1 1 i M J:1 1 79 The first sum is bounded using Holder inequality and noting that the first term is a conditional probability of all observed variables conditioned on the root r r O Û 1 O Û M J:1 1 80 r r O Û 1 O Û M J:1 1 1 S H δ root 81 8

The second sum is bounded by another application of Holder s inequality and the previous lemma: r r O Û 1 O Û M J:1 1 r M J:1 1 r 8 r r O Û 1 O Û M J:1 1 r M J:1 1 r 1 83 J J δ root 1 + jk δone r + S H 1 + jk S H 84 The third sum is also bounded by Holder s Inequality and previous lemmas and noting that r U O 1 = P[R = r]: r O Û 1 O Û M J:1 1 i M J:1 1 85 r O Û 1 O Û M J:1 1 r M J:1 1 r 1 86 J J 1 + jk δone r + S H 1 + jk S H 87 Combining these bounds gives us the desired solution. 8 Putting it all together We seek for P[x 1,..., x O ] P[x 1,..., x O ] ɛ 88 Using the fact that for a <.5, 1 + a/t 1 + a, we get that jk Oɛ/S H J. However, j is defined recursively, and thus the error accumulates exponential in the longest path of hidden nodes. For example, obs ɛ i O where l is the d maxs H l longest path of hidden nodes. Tracing this back through will gives the result: Pick any ɛ > 0, δ < 1. Let 1 d max S H l+1 S O O ɛ min i σ SH O i log O 89 min i j σ SH P i,j 4 δ Then with probability 1 δ P[x 1,..., x O ] P[x 1,..., x O ] ɛ 90 In many cases, if the frequency of the observation symbols follow certain distributions, than the dependence on S O can be removed as showed in HKZ [3]. 9 Appendix 9.1 Matrix Perturbation Bounds This is Theorem 3.8 from pg. 143 in Stewart and Sun, 1990 [6]. Let A R m n, with m n and let à = A + E. Then References Ã+ A + 1 + 5 max A +, à E 91 [1] Myung J. Choi, Vicent Y. Tan, Animashree Anandkumar, and Alan S. Willsky. Learning latent tree graphical models. In arxiv:1009.7v1, 010. [] R.A. Horn and C.R. Johnson. Matrix analysis. Cambridge Univ Pr, 1990. [3] D. Hsu, S. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. In Proc. Annual Conf. Computational Learning Theory, 009. [4].H. guyen, P. Drineas, and T.D. Tran. Tensor sparsification via a bound on the spectral norm of random tensors. Arxiv preprint arxiv:1005.473, 010. [5] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. [6] G.W. Stewart and J. Sun. Matrix perturbation theory, volume 175. Academic press ew York, 1990. 9