A Spectral Algorithm For Latent Junction Trees - Supplementary Material

A Spectral Algoritm For Latent Junction Trees - Supplementary Material Ankur P. Parik, Le Song, Mariya Isteva, Gabi Teodoru, Eric P. Xing Discussion of Conditions for Observable Representation Te observable representation exists only if tere exist transformations F i = P[O i S i ] wit rank τ i := k S i and P[O i S i ] also as rank τ i (so tat P(O i, O i as rank τ i. Tus, it is required tat #states(o i #states(s i. Tis can eiter be acieved by eiter making O i consist of a few ig dimensional observations, or many smaller dimensional ones. In te case wen #states(o i > #states(s i, we need to project F i to a lower dimensional space suc tat it can be inverted using a tensor U i. In tis case, we define F i := P[O i S i ] Oi U i ]. Following tis troug te computation gives us tat P(C l = P(O l, O l Ol (P(O l, O l Oi U i. A goooice of U i can be obtained by performing a singular value decomposition of te matricized version of P(O i, O i (variables in O i are arranged to rows and tose in O i to columns. For HMMs and latent trees, tis rank condition can be expressed simply as requiring te conditional probability tables of te underlying model to not be rank-deficient. However, junction trees encode significantly more complex latent structures tat introduce more subtle considerations. Wile we consider a general caracterization of suc models were te observable representation to be future work, ere try to give some intuition on wat types of latent structures te rank condition may fail. First, te rank condition can fail is if tere are not enoug observed nodes/states, and tus #states(o i < τ i. Intuitively, tis corresponds to a model were te latent space is too expressive and inerently represents an intractable model (e.g. a set of n binary variables connected by a idden node wit n states is equivalent to a clique of size n. However, tere are more subtle considerations unique to non-tree models. In general, our metod is not limited to non-triangulated graps (see te factorial HMM in Figure 3 of te main paper, but te process of triangulation can introduce artificial dependencies tat can lead to complications. Consider Figure (a wic sows a DAG and its corresponding junction tree. To construct P(C AD, we may set P[O i, O i ] = P[D, F ] based on te junction tree topology. However, in te original model before triangulation, D F because of te v-structure. As a result, P[D, F ] does not ave rank k and tus cannot be inverted. However, note tat coosing P[O i, O i ] = P[D, E] is valid. Finally consider Figure (b, ignoring te orange nodes for now and assuming te variables are binary. In tis model, te idden variables are largely redundant since integrating out A and B would simply give a cain. F r must be set to P[F S r ] = P[F AB]. If we tink of A, B as just one large variable, ten it is clear tat P[F AB] = P[F D]P[D AB]. However, D only as two states wile AB as 4, so P[F AB] only as rank. Now, consider adding te orange node. In tis case we could set F r to P[F, G S r ] = P[F, G A, B] wose matricized version as rank 4. Note tat once te orange node as been added, integrating out A and B no longer produces a simple cain, but a more complicated structure. Tus, we believe tat more rigorously caracterizing te existence of te observable representation in more detail, may sed ligt on te intrinsic complexity/redundancy of latent variable models in te context of linear and tensor algebra. Linear Systems to Improve Stability As derived in Section 6 in te main paper, for non-root and non-leaf nodes: P(C i Oi P(O i, O i = P(O i, O i, O i (

D A C E B F AD S ABC CE BF E C A B D G F ACE ABC AG Sr ABD DF (a (b Figure : Models and teir corresponding junction trees, were constructing te observable representation poses complications. See Section. wic ten implies tat P(C i = P(O i, O i, O i Oi P(O i, O i ( However, tere may be many coices of O i (wic we denote wit O ( i,..., O(N i for wic te above equality is true: P(C i Oi P(O i, O ( i = P(O i, O i, O ( i P(C i Oi P(O i, O ( i = P(O i, O i, O ( i... P(C i Oi P(O i, O (N i = P(O i, O i, O (N i Tis defines an over-constrained linear system, and we can solve for P(C i using least squares. In te case were #states(o i > #states(s i, and te projection tensor U i is needed, our system of equations becomes: P(C i Oi (P(O i, O ( i O i U i = P(O i, O i, O ( i O i U i Oi U i P(C i Oi (P(O i, O ( i O i U i = P(O i, O i, O ( i O i U i Oi U i... P(C i Oi (P(O i, O (N i O i U i = P(O i, O i, O (N O i U i Oi U i In tis case, one goooice of U i is te top singular vectors of te matrix formed by te orizontal concatenation of P(O i, O ( i,..., P(O i, O (N i. Tis linear system metod allows for more robust estimation especially in smaller sample sizes (at te cost of more computation. It can be applied to te leaf case as well. One does not need to set up te linear system wit all te valioices; a subset is also acceptable. 3 Sample Complexity Teorem We prove te sample complexity teorem. 3. Notation For simplicity, te main text describes te algoritm in te context of a binary tree. However, in te sample complexity proof, we adopt a more general notation. Let C i be a clique, and C αi(,...,c αi(γ i denote its γ i cildren. Let C denote te set of all te cliques in te junction tree, and C denote its size. Define d max to be maximum degree of a tensor in te observable representation and e max to be maximum number of i

observed variables in any tensor. Furtermore, let τ i = k S i (i.e. te number of states associated wit te separator. We can now write te transformed representation for junction trees wit more tan 3 neigbors as: Root: P(C i = P(C i Sαi ( F α i(... Sαi (γ i F α i(γ i Internal nodes: P(C i = P(C i Si F i S αi ( F α i(... Sαi (γ i F α i(γ i Leaf: P(C i = P(C i Si F i and te observable representation as: root: P(C i = P(O αi(,..., O αi(γ i Oαi ( U α i(... Oαi (γ i U α i(γ i internal: P(C i = P(O αi(,..., O αi(γ i, O i O i (P(O i, O i Oi U i Oαi ( U α i(... Oαi (γ i U α i(γ i leaf: P(C i = P(R i, O i O i ( P(O i, O i Oi U i Sometimes wen te multiplication indices are clear, we will omit it to make tings simpler. Rearranged version: We will often find it convenient to rearrange te tensors into lower order tensors for te purpose of taking some norms (suc as te spectral norm. We define R( as te rearranging operation. For example, R( P(O i, O i is te matricized version of P(O i, O i wit O i being mapped to te rows and O i being mapped to te columns. More generally, if P(C i (wic as order d i ten R(P(C i is a rearrangement of of P(C i suc tat it as order equal to te number of neigbors of C i. Te set of modes of P(C i corresponding to a single separator of C i get mapped to a single mode in R(P(C i. We let R(S αi(j denote te mode tat S αi(j maps to in R(P(C i. In te case of te root, R(P(C i rearranges P(C i into a tensor of order γ i. For oter clique nodes, R(P(C i rearranges P(C i into a tensor of order γ i +. Example: In Figure of te main text P(C BCDE = P( B, D C, E. C BCDE as 3 neigbors and so R(P(C BCDE is of order 3 were eac of te modes correspond to {B, C}, {B, D}, {C, E} respectively. Tis rearrangement can be applied to te oter quantities. For example, R(P(O αi(,..., O αi(γ i is a rearranging of P(O αi(,..., O αi(γ i into a tensor of order γ i were te O αi(,..., O αi(γ i correspond to one mode eac. R( P(O i, O i is te matricized version of P(O i, O i wit O i being mapped to te rows and O i being mapped to te columns. Similarly, R(F i is te matricized version of F i and R(U i is te matricized version of U i. Tus, we can define te rearranged quantities: Root: R( P(C i = R(P(C i R(Sαi ( R(F αi(... R(Sαi (γ i R(F αi(γ i R( P(C i = R(P(C i R(Si R(F i R(S αi ( F αi(... R(Sαi (γ i R(F γi Leaf: R( P(C i = R(P(C i R(Si R(F i 3

and te observable representation as: root: R(P(C i = R(P(O αi(,..., O αi(γ i R(Sαi ( R(U αi(... R(Sαi (γ i R(U αi(γ i internal: R(P(C i = R(P(O Sαi (,..., O α i(γ i, O i R(S i R((P(O i, O i Si U i (3 R(Sαi ( R(U αi(... R(Sαi (γ i R(U αi(γ i leaf: R(P(C i = R( P(R i, O i R(O i R(( P(O i, O i Oi U i Furtermore define te following: α = min Ci C σ τi (P[O i, O i ] (4 β = min Ci C σ τi (F i (5 Te proof is based on te tecnique of HKZ [] but as differences due to te junction tree topology and iger order tensors. 3. Tensor Norms We briefly define several tensor norms tat are a generalization of matrix norms. For more information about matrix norms see []. Frobenius Norm: Just like for matrices, te frobenius norm of a tensor of order N is defined as: T F = T (i,..., i N (6 i,...,i N Elementwise One Norm: Similarly, te elementwise one norm of a tensor of order N is defined as: T = T (i,..., i N (7 i,...,i N Spectral Norm: For tensors of order N te spectral norm [3] can be defined as as T = sup T N v N,..., v v (8 v i s.t. v i in In our case, we will find it more convenient to use te rearranged spectral norm, wic we define as: were R( was defined in te previous section. T = R(T (9 Induced One Norm: For matrices te induced one norm is defined as te max column sum of te matrix: M = sup v s.t. v Mv. We can generalize tis to be te max slice sum of a tensor on a tensor were some modes are fixed and oters are summed over. Let σ denote te modes tat will be maxed over. Ten: T σ = sup T σ v,..., σ σ v σ (0 v i s.t. v i, i σ In te Appendix, we prove several lemmas regarding tese tensor norms. 3.3 Concentration Bounds ɛ(o,..., O d = P(O,..., O d P(O,..., O d ( F ɛ(o,..., O d e, o d e+,..., o d = P(O,..., O d e, o d e+,..., o d P(O,..., O d e, o d e+,..., o d ( F,..., d e denote te d e non-evidence variables wile d e +,..., d denote te e evidence variables. d indicates te total number of modes of te tensor. As te number of samples N gets large, we expect tese quantities to be small. 4

Lemma (variant of HKZ [] If te algoritm independently samples N of te observations, ten wit probability at least δ. ɛ(o,..., O d o d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d C ln + N δ N k e max C k e max o ln + on δ N (3 (4 for all tuples (O,..., O d tat are used in te spectral algoritm. Te proof is te same as tat of HKZ [] except te union bound is taken over C instead of 3 (since eac transformed quantity in te spectral algoritm is composed of at most two suc terms. Te last bounan be made tigter, identical to HKZ, but for simplicity we do not pursue tat approac ere. 3.4 Singular Value Bounds Basically tis is te generalized version of Lemma 9 in HKZ [], wic is stated below for completeness: Lemma (variant of HKZ [] Suppose ɛ(o i, O i ε σ τi (P(O i, O i for some ε < /. Let ε 0 = ɛ(o i, O i /(( εσ τi (P(O i, O i. Ten:. ε 0 <. σ τi ( P(O i, O i Oi Û i ( ε 0 σ τi (P(O i, O i 3. σ τi (P(O i, O i Oi Û i ε 0 σ τi (P(O i, O i 4. σ τi ( P(O i S i Oi Û i ε 0 σ τi (P(O i S i It follows tat if ɛ(o i, O i σ τi (P(O i, O i /3 ten tis implies tat ε 0 /9 4/9 = 4. It ten follows tat,. σ τi ( P(O i, O i Oi Û i 3 4 σ τ i (P(O i, O i. σ τi (P(O i, O i Oi Û i 3 σ τ i (P(O i, O i 3. σ τi ( P(O i S i Oi Û i 3 σ τ i (P(O i S i 3.5 Bounding te Transformed Quantities Define,. root: P(Ci := P(O αi(,..., O αi(γ i Oαi ( Û αi(... Oαi (γ i Û αi(γ i. leaf: P ri (C i = P(R i = r i, O i O i (PÛ(O i, O i 3. internal: P(Ci = P(O Sαi (,..., O α i(γ i, O i O i (PÛ(O i, O i Oαi ( Û αi(... Oαi (γ i Û αi(γ i (wic are te observable quantities wit te true probabilities, but empirical Û s. Similarly, we can define i = P[O i S i ] Oi Û i. We ave also abbreviated P(O i, O i Oi Û i wit PÛ(O i, O i. We seek to bound te following tree quantities: δ root i := ( P(C i P(C i Oαi ( δi internal := ( P(C i P(C i O i i Oαi ( α i(,..., Oαi (γ i α i(γ i (5 Si (6 i,..., Oαi (γ i α i(γ i ξ leaf i := r i ( P ri (C i P ri (C i O i i (7 5

Lemma 3 If ɛ(o i, O i σ τi (P(O i, O i /3 ten δi root dmax ɛ(o αi(,..., O αi(γ i (8 3 d/ β ( d i dmax+3 ɛ(oαi(, 3..., O αi(γ i, O i ɛ(o i, O i 3 d + β d σ τi (P(O i, O i (σ τi (P(O i, O i (9 ( ξ leaf i 8kdmax ɛ(o i, O i 3 σ τi (P(O i, O i + r i ɛ(r i = r i, O i (0 σ τi (PÛ(O i, O i δ internal We define = max(δ root i, δ internal i, ξ leaf i (over all i. Proof Case δi root : For te root, P(Ci = P(O αi(,..., O αi(γ i Oαi Û ( αi(... Oαi (γ i Û αi(γ i and similarly P(C i = P(O αi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i. δ root = = ( P(Oαi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i P(O αi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i Oαi ( α i(,..., Oαi (γ i ( P(O αi(,..., O αi(γ i P(O αi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i... αi( α i( γi j= σ τ αi (j ( αi(j γi α i( ( P(O αi(,..., O αi(γ i P(O αi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i j= σ τ ( ( P(O αi(,..., O αi(γ i P(O αi(,..., O αi(γ i αi (j αi(j Û αi(... Û α i(γ i Û α i( γi j= σ τ αi (j ( αi(j ɛ(o α i(,..., O αi(γ i Between te first and second line we use Lemma 8 to convert from elementwise one norm to spectral norm, and Lemma 6 (submultiplicativity. Lemma 6 (submultiplicativity is applied again to get to te second-to-last line. We also use te fact tat Û α i( =. Tus, by Lemma δ root i dmax ɛ(o αi(,..., O αi(γ i ( 3 dmax/ β dmax Case ξ leaf i : We note tat P ri (C i = P(R i = r i, O i O i ( PÛ(O i, O i and similarly P ri (C i = P(R i = r i, O i O i (PÛ(O i, O i. Again, we use Lemma 8 to convert from te one norm to te spectral norm, and Lemma 6 for submulti- 6

plicativity. ξ leaf i = r i ( Pri (C i P ri (C i O i i S i = r i ( Pri (C i P ri (C i O i Û i S i r i r i P[O i S i ] Si ( P ri (C i P ri (C i Û i ( P ri (C i P ri (C i ( Note tat P[O i S i ] Si =, and Û i =. P ri (C i P ri (C i = P(R i = r i, O i O i ( PÛ(O i, O i P(R i = r i, O i O i (PÛ(O i, O i P(R i = r i, O i O i ( PÛ(O i, O i PÛ(O i, O i + ( P(R i = r i, O i P(R i = r i, O i O i PÛ(O i, O i P(R i = r i, O i ( PÛ(O i, O i PÛ(O i, O i + PÛ(O i, O i ( P(Ri = r i, O i P(R i = r i, O i P(R i = r i + 5 + ɛ(r i = r i, O i σ τi (PÛ(O i, O i ɛ(o i, O i min(σ τi ( PÛ(O i, O i, σ τij (PÛ(O i, O i Te last line follows from Eq. 35. We ave also used te fact tat P(R i = r i, O i P(R i = r i, O i F P[R i = r] by Lemma 7. Tus, using Lemma, ξ leaf i r i ( P(R i = r i + 5 ξ leaf i r i 8kdmax 3 ( ( ɛ(o i, O i min(σ τi ( PÛ(O i, O i, σ τi (PÛ(O i, O i + + 5 6P(R i = r i ɛ(o i, O i 9σ τi (PÛ(O i, O i + ɛ(r i = r i, O i 3στi (PÛ(O i, O i ɛ(o i, O i σ τi (P(O i, O i + r i ɛ(r i = r i, O i σ τi (PÛ(O i, O i ɛ(r i = r i, O i σ τi (PÛ(O i, O i (3 (4 3.5. δ internal i P(C i = P(O αi(,..., O αi(γ i, O i O i (PÛ(O i, O i Oαi ( Û αi(... Oαi (γ i Û αi(γ i. Similarly, P(C i = P(O αi(,..., O αi(γ i, O i O i ( PÛ(O i, O i Oαi ( Û αi(... Oαi (γ i Û αi(γ i. 7

Again, we use Lemma 8 to convert from one norm to spectral norm and Lemma 6 for submultiplicativity. δi internal = ( P(C i P(C i O i i Oαi ( α i(,..., Oαi (γ i α i(γ i Si ( P(C i P(C i O i Û i Oαi ( α i(,..., Oαi (γ i α i(γ i Si P[O i S i ] Si ( P(C i P(C i Û i... γi j= σ τ αi (j ( αi(j Note tat P[O i S i ] Si =, and Û i =. = α i( α i(γ i ( P(C i P(C i (5 P(C i P(C i P(O αi(,..., O αi(γ i, O i O i ( PÛ(O i, O i Û Oαi ( αi(... Û Oαi (γ i αi(γ i P(O αi(,..., O αi(γ i, O i O i (PÛ(O i, O i Û Oαi ( αi(... Û Oαi (γ i αi(γ i ( P(O αi(,..., O αi(γ i, O i P(O αi(,..., O αi(γ i, O i O i PÛ(O i, O i Û Oαi ( αi(... Û Oαi (γ i αi(γ i + P(O αi(,..., O αi(γ i, O i O i (( PÛ(O i, O i (PÛ(O i, O i Û Oαi ( αi(... Û Oαi (γ i αi(γ i ( P(O αi(,..., O αi(γ i, O i P(O αi(,..., O αi(γ i, O i PÛ(Oi, O i Û αi(... + P(Oαi(,..., O αi(γ i, O i PÛ(O i, O i PÛ(O i, O i Û αi(... Û α i(γ i ɛ(o α i(,..., O αi(γ i, O i σ τi (PÛ(O i, O i + + 5 ɛ(o i, O i min(σ τi (PÛ(O i, O i, σ τi ( PÛ(O i, O i Te last line follows from Eq. 35. We ave also used te fact tat P(Oαi(,..., O αi(γ i, O i P(O αi(,..., O αi(γ i, O i F via Lemma 7. Using Lemma, δ internal i ( (k dmax ɛ(oαi(, (β..., O αi(γ i, O i 3 dmax σ τi (PÛ(O i, O i + + 5 ɛ(o i, O i min(σ τi (PÛ(O i, O i, σ τi ( PÛ(O i, O i (6 Û α i(γ i ( δi internal (k dmax ɛ(oαi(, (β..., O αi(γ i, O i 8ɛ(O i, O i + 3 dmax 3στi (P(O i, O i 3(σ τi (P(O i, O i ( 8(k dmax ɛ(oαi(, 3(β..., O αi(γ i, O i ɛ(o i, O i + 3 dmax σ τi (P(O i, O i (σ τi (P(O i, O i (7 3.6 Bounding te Propagation of Error We now sow all tese errors propagate on te junction tree. For tis section, assume te clique nodes are numbered,,..., C in breadt first order (suc tat is te root. Moreover let Φ :c (C be te transformed 8

factors accumulated so far if we computed te joint probability from te root down (instead of te bottom up. For example, Φ : (C = P(C (8 Φ : (C = P(C S P(C (9 Φ : (C = P(C S P(C S3 P(C 3 (30 (Note ow tis is very computationally inefficient: te tensors get very large. However, it is useful for proving statistical properties. Ten te modes of Φ :c can be partitioned into mode groups, M,..., M dc (were eac mode group consists of te variables on te corresponding separator edge. We now prove te following lemma, Lemma 4 Define = max(δi root x :c ( Φ :c (C Φ :c (C M, δ internal i, ξ leaf i. Ten,,..., Mdc ( + c δ root i + ( + c (3 x :c is all te observed variables in cliques troug c. Note tat wen c = C ten tis implies tat Φ :c (C = P[x,..., x O ] and tus, x P[x,..., x O ] P[x,..., x O ] ( + C δ root i + ( + C (3 Proof Te proof is by induction on c. Te base case follows trivially from te definition of δi root. For te induction step, assume te claim olds for c. Ten we prove it olds for c +. ( Φ :c+ (C Φ :c+ (C M,..., Mdc + x :c+ = ( Φ:c (C ( P(C c+ P(C c+ + ( Φ :c (C Φ :c (C ( P(C c+ P(C c+ + ( Φ :c (C Φ :c (C P(C c+ x :c+ M,..., Mdc+ + ( Φ :c (C ( P(C c+ P(C c+ M x :c+ (( Φ :c (C Φ :c (C ( P(C c+ P(C c+ M x :c+ x :c+ (( Φ :c (C Φ :c (C P(C c+ M,..., Mdc+ + +,..., Mdc+ +,..., Mdc+ + + Now we consider two cases, wen c + is a leaf clique and wen it is an internal clique. Case : Internal Clique Note ere te summation over x :c+ is irrelevant since tere is no evidence. We use Lemma 5 to break up te tree terms: Te first term, ( Φ :c (C ( P(C c+ P(C c+ M ( P(C c+ P(C c+ O (c+ c+ Oαc+ ( Φ :c (C M,..., Mdc,..., Mdc+ + α c+(,..., Oαc+ (γ c+ α c+(γ c+ Si 9

Te first term above is simply δi internal wile te second equals one since it is a joint distribution. Now for te second term, ( Φ :c (C Φ :c (C ( P(C c+ P(C c+ M ( Φ :c (C Φ :c (C M,..., Mdc ( P(C c+ P(C c+ O (c+ c+ Oαc+ ( Te tird term,,..., Mdc+ + α c+(,..., Oαc+ (γ c+ (( + c δ root + ( + c δc+ internal (via induction ypotesis (( + c δ root + ( + c (( Φ :c (C Φ :c (C P(C c+ M ( Φ :c (C Φ :c (C M P(C c+ O (c+ c+ Oαc+ (,..., Mdc+ +,..., Mdc (( + c δ root + ( + c Case : Leaf Clique Again we use Lemma 5 to break up te tree terms: Te first term, ( Φ :c (C ( P ri (C c+ P ri (C c+ M x :c+ x :c+ Φ :c (C M Φ :c (C M x :c α c+(,..., Oαc+ (γ c+ α c+(γ c+,..., Mdc+ + α c+(γ c+ Sc+,..., Mdc ( Pri (C c+ P ri (C c+ O (c+ c+ S c+,..., Mdc x c+ Sc+ ( P ri (C c+ P S c+ ri (C c+ O (c+ c+ Te first term above equals because it is a joint distribution and te second is te bound on te transformed quantity we ad proved earlier (since r i = x c+. Te second term, x :c x :c+ ( Φ :c (C Φ :c (C ( P ri (C c+ P ri (C c+ M ( Φ :c (C Φ :c (C M,..., Mdc (( + c δ root + ( + c ξ leaf c+ (( + c δ root + ( + c x c+,..., Mdc+ + ( P ri (C c+ P ri (C c+ O (c+ c+ Te tird term, x :c+ (( Φ :c (C Φ :c (C P ri (C c+ M x :c ( Φ :c (C Φ :c (C M (( + c δ root + ( + c,..., Mdc+ +,..., Mdc x c+ P S c+ ri (C c+ O (c+ c+ 0

Combining tese terms proves te induction step. 3.7 Putting it all togeter We use te fact from HKZ [] tat ( + a/t t + a for a /. Now is te main source of error. We set O(ɛ total / C. Note tat dmax+3 3 3 dmax β dmax ( o d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d + α o d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d α Tis gives, ( dmax+3 3 o d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d o + d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d 3 dmax β α α Kɛ total / C dmax were K is some constant. Tis implies, totalα βdmax ɛ(o,..., O d e, o d e+,..., o d K 3dmax/+ ɛ o d e+,...,o d dmax+3 C (33 Now using te concentration bound (Lemma will give, Solving for N: K 3dmax/+ ɛ total α β dmax dmax+3 C k e max on ln C δ ( ( 4k dmax N O ko emax ln C δ C 3β ɛ total α4 + k e max o N (34 and tis completes te proof. 4 Appendix 4. Matrix Perturbation Bounds Tis is Teorem 3.8 from pg. 43 in Stewart and Sun, 990 [4]. Let A R m n, wit m n and let Ã = A + E. Ten 4. Tensor Norm Bounds Ã+ A + + 5 max( A +, Ã E (35 For matrices it is true tat Mv M v. We prove te generalization to tensors. Lemma 5 Let T and T be tensors σ a set of (labeled modes. T σ T T σ T (36

Proof T σ T = i :N \σ j :M\σ = T (i :N \ σ, σ = xt (j :N \ σ, σ = x x T (i :N \ σ, σ = xt (j :N \ σ, σ = x i :N \σ x j :M\σ = T (i :N \ σ, σ = x T (j :N \ σ, σ = x i :N \σ σ j :M\σ max T (j :N \ σ, σ = x T x j :M\σ = T σ T We prove a restricted analog of te fact tat spectral norm is submultiplicative for matrices i.e. AB A B. Lemma 6 Let T be a tensor of order N and let M be a matrix. Ten, Proof T M = sup v m,v,...,v N = sup v m,v,...,v N i,...,i N,m T M T M (37 T (i, i,..., i N M(i, mv m (mv i (i v i3 (i 3...v in (i n T (i, i,..., i N M(i, mv m (m i,...,i N i m ( sup M(i, mv m (m i m ( T (i, i,..., i N i,...,i N m M(i M(i, mv m (m v i (i v i3 (i 3...v in (i n, mv m (m m sup v m,v,...,v N M T Lemma 7 Let T be a tensor of order N. Ten, T T v,..., n v n F T v,..., n v n F... T F (38 Proof It suffices to sow tat sup v s.t. v T v F T F. By submultiplicativity of te frobenius norm: T v F T F v F T, since v F = v. Lemma 8 Let T be a tensor of order N, were eac mode is of dimension k. Ten, T σ kn T (39 For any σ. Proof We simply prove tis for σ = (wic corresponds to elementwise one norm since T σ T σ if σ σ. Note tat T k N max( T (were max(t is te maximum element of T. Similarly, max( T T wic implies tat T k N T.

References [] R.A. Horn and C.R. Jonson. Matrix analysis. Cambridge Univ Pr, 990. [] D. Hsu, S. Kakade, and T. Zang. A spectral algoritm for learning idden Markov models. In Proc. Annual Conf. Computational Learning Teory, 009. [3] N.H. Nguyen, P. Drineas, and T.D. Tran. Tensor sparsification via a bound on te spectral norm of random tensors. Arxiv preprint arxiv:005.473, 00. [4] GW Stewart and J. Sun. Matrix Perturbation Teory. Academic Press, 990. 3