A Spectral Algorithm For Latent Junction Trees - Supplementary Material

Similar documents
Supplemental for Spectral Algorithm For Latent Tree Graphical Models

Numerical Differentiation

Generic maximum nullity of a graph

Differentiation in higher dimensions

1. Questions (a) through (e) refer to the graph of the function f given below. (A) 0 (B) 1 (C) 2 (D) 4 (E) does not exist

Exam 1 Review Solutions

2.8 The Derivative as a Function

How to Find the Derivative of a Function: Calculus 1

Efficient algorithms for for clone items detection

Function Composition and Chain Rules

MATH745 Fall MATH745 Fall

SECTION 1.10: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES

SECTION 3.2: DERIVATIVE FUNCTIONS and DIFFERENTIABILITY

The derivative function

lecture 26: Richardson extrapolation

MA455 Manifolds Solutions 1 May 2008

Click here to see an animation of the derivative

Symmetry Labeling of Molecular Energies

A = h w (1) Error Analysis Physics 141

Exercises for numerical differentiation. Øyvind Ryan

1 Calculus. 1.1 Gradients and the Derivative. Q f(x+h) f(x)

Continuity and Differentiability Worksheet

These errors are made from replacing an infinite process by finite one.

Introduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f =

5 Ordinary Differential Equations: Finite Difference Methods for Boundary Problems

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx.

3.4 Worksheet: Proof of the Chain Rule NAME

NUMERICAL DIFFERENTIATION. James T. Smith San Francisco State University. In calculus classes, you compute derivatives algebraically: for example,

Function Composition and Chain Rules

A SHORT INTRODUCTION TO BANACH LATTICES AND

1watt=1W=1kg m 2 /s 3

Copyright c 2008 Kevin Long

Solution for the Homework 4

University Mathematics 2

Continuity and Differentiability of the Trigonometric Functions

Math 161 (33) - Final exam

Preconditioning in H(div) and Applications

Poisson Equation in Sobolev Spaces

Higher Derivatives. Differentiable Functions

1 2 x Solution. The function f x is only defined when x 0, so we will assume that x 0 for the remainder of the solution. f x. f x h f x.

OSCILLATION OF SOLUTIONS TO NON-LINEAR DIFFERENCE EQUATIONS WITH SEVERAL ADVANCED ARGUMENTS. Sandra Pinelas and Julio G. Dix

Introduction to Derivatives

Robotic manipulation project

4. The slope of the line 2x 7y = 8 is (a) 2/7 (b) 7/2 (c) 2 (d) 2/7 (e) None of these.

2.3 Algebraic approach to limits

. If lim. x 2 x 1. f(x+h) f(x)

Chapter 5 FINITE DIFFERENCE METHOD (FDM)

REVIEW LAB ANSWER KEY

Explicit Interleavers for a Repeat Accumulate Accumulate (RAA) code construction

Order of Accuracy. ũ h u Ch p, (1)

3.1 Extreme Values of a Function

HOW TO DEAL WITH FFT SAMPLING INFLUENCES ON ADEV CALCULATIONS

Polynomial Interpolation

1. Consider the trigonometric function f(t) whose graph is shown below. Write down a possible formula for f(t).

Bob Brown Math 251 Calculus 1 Chapter 3, Section 1 Completed 1 CCBC Dundalk

ch (for some fixed positive number c) reaching c

More on generalized inverses of partitioned matrices with Banachiewicz-Schur forms

Math Spring 2013 Solutions to Assignment # 3 Completion Date: Wednesday May 15, (1/z) 2 (1/z 1) 2 = lim

Volume 29, Issue 3. Existence of competitive equilibrium in economies with multi-member households

Online Learning: Bandit Setting

Global Output Feedback Stabilization of a Class of Upper-Triangular Nonlinear Systems

Derivatives of Exponentials

Lecture 21. Numerical differentiation. f ( x+h) f ( x) h h

1. Introduction. We consider the model problem: seeking an unknown function u satisfying

Polynomial Interpolation

1. Which one of the following expressions is not equal to all the others? 1 C. 1 D. 25x. 2. Simplify this expression as much as possible.

Numerical Experiments Using MATLAB: Superconvergence of Nonconforming Finite Element Approximation for Second-Order Elliptic Problems

2.1 THE DEFINITION OF DERIVATIVE

Quantum Mechanics Chapter 1.5: An illustration using measurements of particle spin.

Sin, Cos and All That

Regularized Regression

1 The concept of limits (p.217 p.229, p.242 p.249, p.255 p.256) 1.1 Limits Consider the function determined by the formula 3. x since at this point

Minimizing D(Q,P) def = Q(h)

Lecture 15. Interpolation II. 2 Piecewise polynomial interpolation Hermite splines

Calculus I Homework: The Derivative as a Function Page 1

Topics in Generalized Differentiation

Mathematics 5 Worksheet 11 Geometry, Tangency, and the Derivative

Dedicated to the 70th birthday of Professor Lin Qun

ERROR BOUNDS FOR THE METHODS OF GLIMM, GODUNOV AND LEVEQUE BRADLEY J. LUCIER*

Chapter 2 Limits and Continuity

CS522 - Partial Di erential Equations

On convexity of polynomial paths and generalized majorizations

HOMEWORK HELP 2 FOR MATH 151

Lecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.

Continuity. Example 1

Name: Answer Key No calculators. Show your work! 1. (21 points) All answers should either be,, a (finite) real number, or DNE ( does not exist ).

Gradient Descent etc.

Lab 6 Derivatives and Mutant Bacteria

Digital Filter Structures

Taylor Series and the Mean Value Theorem of Derivatives

Solutions to the Multivariable Calculus and Linear Algebra problems on the Comprehensive Examination of January 31, 2014

MTH 119 Pre Calculus I Essex County College Division of Mathematics Sample Review Questions 1 Created April 17, 2007

MVT and Rolle s Theorem

Some Review Problems for First Midterm Mathematics 1300, Calculus 1

Parameter Fitted Scheme for Singularly Perturbed Delay Differential Equations

Finite Difference Method

WHEN GENERALIZED SUMSETS ARE DIFFERENCE DOMINATED

7 Semiparametric Methods and Partially Linear Regression

Solve exponential equations in one variable using a variety of strategies. LEARN ABOUT the Math. What is the half-life of radon?

LIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION

Transcription:

A Spectral Algoritm For Latent Junction Trees - Supplementary Material Ankur P. Parik, Le Song, Mariya Isteva, Gabi Teodoru, Eric P. Xing Discussion of Conditions for Observable Representation Te observable representation exists only if tere exist transformations F i = P[O i S i ] wit rank τ i := k S i and P[O i S i ] also as rank τ i (so tat P(O i, O i as rank τ i. Tus, it is required tat #states(o i #states(s i. Tis can eiter be acieved by eiter making O i consist of a few ig dimensional observations, or many smaller dimensional ones. In te case wen #states(o i > #states(s i, we need to project F i to a lower dimensional space suc tat it can be inverted using a tensor U i. In tis case, we define F i := P[O i S i ] Oi U i ]. Following tis troug te computation gives us tat P(C l = P(O l, O l Ol (P(O l, O l Oi U i. A goooice of U i can be obtained by performing a singular value decomposition of te matricized version of P(O i, O i (variables in O i are arranged to rows and tose in O i to columns. For HMMs and latent trees, tis rank condition can be expressed simply as requiring te conditional probability tables of te underlying model to not be rank-deficient. However, junction trees encode significantly more complex latent structures tat introduce more subtle considerations. Wile we consider a general caracterization of suc models were te observable representation to be future work, ere try to give some intuition on wat types of latent structures te rank condition may fail. First, te rank condition can fail is if tere are not enoug observed nodes/states, and tus #states(o i < τ i. Intuitively, tis corresponds to a model were te latent space is too expressive and inerently represents an intractable model (e.g. a set of n binary variables connected by a idden node wit n states is equivalent to a clique of size n. However, tere are more subtle considerations unique to non-tree models. In general, our metod is not limited to non-triangulated graps (see te factorial HMM in Figure 3 of te main paper, but te process of triangulation can introduce artificial dependencies tat can lead to complications. Consider Figure (a wic sows a DAG and its corresponding junction tree. To construct P(C AD, we may set P[O i, O i ] = P[D, F ] based on te junction tree topology. However, in te original model before triangulation, D F because of te v-structure. As a result, P[D, F ] does not ave rank k and tus cannot be inverted. However, note tat coosing P[O i, O i ] = P[D, E] is valid. Finally consider Figure (b, ignoring te orange nodes for now and assuming te variables are binary. In tis model, te idden variables are largely redundant since integrating out A and B would simply give a cain. F r must be set to P[F S r ] = P[F AB]. If we tink of A, B as just one large variable, ten it is clear tat P[F AB] = P[F D]P[D AB]. However, D only as two states wile AB as 4, so P[F AB] only as rank. Now, consider adding te orange node. In tis case we could set F r to P[F, G S r ] = P[F, G A, B] wose matricized version as rank 4. Note tat once te orange node as been added, integrating out A and B no longer produces a simple cain, but a more complicated structure. Tus, we believe tat more rigorously caracterizing te existence of te observable representation in more detail, may sed ligt on te intrinsic complexity/redundancy of latent variable models in te context of linear and tensor algebra. Linear Systems to Improve Stability As derived in Section 6 in te main paper, for non-root and non-leaf nodes: P(C i Oi P(O i, O i = P(O i, O i, O i (

D A C E B F AD S ABC CE BF E C A B D G F ACE ABC AG Sr ABD DF (a (b Figure : Models and teir corresponding junction trees, were constructing te observable representation poses complications. See Section. wic ten implies tat P(C i = P(O i, O i, O i Oi P(O i, O i ( However, tere may be many coices of O i (wic we denote wit O ( i,..., O(N i for wic te above equality is true: P(C i Oi P(O i, O ( i = P(O i, O i, O ( i P(C i Oi P(O i, O ( i = P(O i, O i, O ( i... P(C i Oi P(O i, O (N i = P(O i, O i, O (N i Tis defines an over-constrained linear system, and we can solve for P(C i using least squares. In te case were #states(o i > #states(s i, and te projection tensor U i is needed, our system of equations becomes: P(C i Oi (P(O i, O ( i O i U i = P(O i, O i, O ( i O i U i Oi U i P(C i Oi (P(O i, O ( i O i U i = P(O i, O i, O ( i O i U i Oi U i... P(C i Oi (P(O i, O (N i O i U i = P(O i, O i, O (N O i U i Oi U i In tis case, one goooice of U i is te top singular vectors of te matrix formed by te orizontal concatenation of P(O i, O ( i,..., P(O i, O (N i. Tis linear system metod allows for more robust estimation especially in smaller sample sizes (at te cost of more computation. It can be applied to te leaf case as well. One does not need to set up te linear system wit all te valioices; a subset is also acceptable. 3 Sample Complexity Teorem We prove te sample complexity teorem. 3. Notation For simplicity, te main text describes te algoritm in te context of a binary tree. However, in te sample complexity proof, we adopt a more general notation. Let C i be a clique, and C αi(,...,c αi(γ i denote its γ i cildren. Let C denote te set of all te cliques in te junction tree, and C denote its size. Define d max to be maximum degree of a tensor in te observable representation and e max to be maximum number of i

observed variables in any tensor. Furtermore, let τ i = k S i (i.e. te number of states associated wit te separator. We can now write te transformed representation for junction trees wit more tan 3 neigbors as: Root: P(C i = P(C i Sαi ( F α i(... Sαi (γ i F α i(γ i Internal nodes: P(C i = P(C i Si F i S αi ( F α i(... Sαi (γ i F α i(γ i Leaf: P(C i = P(C i Si F i and te observable representation as: root: P(C i = P(O αi(,..., O αi(γ i Oαi ( U α i(... Oαi (γ i U α i(γ i internal: P(C i = P(O αi(,..., O αi(γ i, O i O i (P(O i, O i Oi U i Oαi ( U α i(... Oαi (γ i U α i(γ i leaf: P(C i = P(R i, O i O i ( P(O i, O i Oi U i Sometimes wen te multiplication indices are clear, we will omit it to make tings simpler. Rearranged version: We will often find it convenient to rearrange te tensors into lower order tensors for te purpose of taking some norms (suc as te spectral norm. We define R( as te rearranging operation. For example, R( P(O i, O i is te matricized version of P(O i, O i wit O i being mapped to te rows and O i being mapped to te columns. More generally, if P(C i (wic as order d i ten R(P(C i is a rearrangement of of P(C i suc tat it as order equal to te number of neigbors of C i. Te set of modes of P(C i corresponding to a single separator of C i get mapped to a single mode in R(P(C i. We let R(S αi(j denote te mode tat S αi(j maps to in R(P(C i. In te case of te root, R(P(C i rearranges P(C i into a tensor of order γ i. For oter clique nodes, R(P(C i rearranges P(C i into a tensor of order γ i +. Example: In Figure of te main text P(C BCDE = P( B, D C, E. C BCDE as 3 neigbors and so R(P(C BCDE is of order 3 were eac of te modes correspond to {B, C}, {B, D}, {C, E} respectively. Tis rearrangement can be applied to te oter quantities. For example, R(P(O αi(,..., O αi(γ i is a rearranging of P(O αi(,..., O αi(γ i into a tensor of order γ i were te O αi(,..., O αi(γ i correspond to one mode eac. R( P(O i, O i is te matricized version of P(O i, O i wit O i being mapped to te rows and O i being mapped to te columns. Similarly, R(F i is te matricized version of F i and R(U i is te matricized version of U i. Tus, we can define te rearranged quantities: Root: R( P(C i = R(P(C i R(Sαi ( R(F αi(... R(Sαi (γ i R(F αi(γ i R( P(C i = R(P(C i R(Si R(F i R(S αi ( F αi(... R(Sαi (γ i R(F γi Leaf: R( P(C i = R(P(C i R(Si R(F i 3

and te observable representation as: root: R(P(C i = R(P(O αi(,..., O αi(γ i R(Sαi ( R(U αi(... R(Sαi (γ i R(U αi(γ i internal: R(P(C i = R(P(O Sαi (,..., O α i(γ i, O i R(S i R((P(O i, O i Si U i (3 R(Sαi ( R(U αi(... R(Sαi (γ i R(U αi(γ i leaf: R(P(C i = R( P(R i, O i R(O i R(( P(O i, O i Oi U i Furtermore define te following: α = min Ci C σ τi (P[O i, O i ] (4 β = min Ci C σ τi (F i (5 Te proof is based on te tecnique of HKZ [] but as differences due to te junction tree topology and iger order tensors. 3. Tensor Norms We briefly define several tensor norms tat are a generalization of matrix norms. For more information about matrix norms see []. Frobenius Norm: Just like for matrices, te frobenius norm of a tensor of order N is defined as: T F = T (i,..., i N (6 i,...,i N Elementwise One Norm: Similarly, te elementwise one norm of a tensor of order N is defined as: T = T (i,..., i N (7 i,...,i N Spectral Norm: For tensors of order N te spectral norm [3] can be defined as as T = sup T N v N,..., v v (8 v i s.t. v i in In our case, we will find it more convenient to use te rearranged spectral norm, wic we define as: were R( was defined in te previous section. T = R(T (9 Induced One Norm: For matrices te induced one norm is defined as te max column sum of te matrix: M = sup v s.t. v Mv. We can generalize tis to be te max slice sum of a tensor on a tensor were some modes are fixed and oters are summed over. Let σ denote te modes tat will be maxed over. Ten: T σ = sup T σ v,..., σ σ v σ (0 v i s.t. v i, i σ In te Appendix, we prove several lemmas regarding tese tensor norms. 3.3 Concentration Bounds ɛ(o,..., O d = P(O,..., O d P(O,..., O d ( F ɛ(o,..., O d e, o d e+,..., o d = P(O,..., O d e, o d e+,..., o d P(O,..., O d e, o d e+,..., o d ( F,..., d e denote te d e non-evidence variables wile d e +,..., d denote te e evidence variables. d indicates te total number of modes of te tensor. As te number of samples N gets large, we expect tese quantities to be small. 4

Lemma (variant of HKZ [] If te algoritm independently samples N of te observations, ten wit probability at least δ. ɛ(o,..., O d o d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d C ln + N δ N k e max C k e max o ln + on δ N (3 (4 for all tuples (O,..., O d tat are used in te spectral algoritm. Te proof is te same as tat of HKZ [] except te union bound is taken over C instead of 3 (since eac transformed quantity in te spectral algoritm is composed of at most two suc terms. Te last bounan be made tigter, identical to HKZ, but for simplicity we do not pursue tat approac ere. 3.4 Singular Value Bounds Basically tis is te generalized version of Lemma 9 in HKZ [], wic is stated below for completeness: Lemma (variant of HKZ [] Suppose ɛ(o i, O i ε σ τi (P(O i, O i for some ε < /. Let ε 0 = ɛ(o i, O i /(( εσ τi (P(O i, O i. Ten:. ε 0 <. σ τi ( P(O i, O i Oi Û i ( ε 0 σ τi (P(O i, O i 3. σ τi (P(O i, O i Oi Û i ε 0 σ τi (P(O i, O i 4. σ τi ( P(O i S i Oi Û i ε 0 σ τi (P(O i S i It follows tat if ɛ(o i, O i σ τi (P(O i, O i /3 ten tis implies tat ε 0 /9 4/9 = 4. It ten follows tat,. σ τi ( P(O i, O i Oi Û i 3 4 σ τ i (P(O i, O i. σ τi (P(O i, O i Oi Û i 3 σ τ i (P(O i, O i 3. σ τi ( P(O i S i Oi Û i 3 σ τ i (P(O i S i 3.5 Bounding te Transformed Quantities Define,. root: P(Ci := P(O αi(,..., O αi(γ i Oαi ( Û αi(... Oαi (γ i Û αi(γ i. leaf: P ri (C i = P(R i = r i, O i O i (PÛ(O i, O i 3. internal: P(Ci = P(O Sαi (,..., O α i(γ i, O i O i (PÛ(O i, O i Oαi ( Û αi(... Oαi (γ i Û αi(γ i (wic are te observable quantities wit te true probabilities, but empirical Û s. Similarly, we can define i = P[O i S i ] Oi Û i. We ave also abbreviated P(O i, O i Oi Û i wit PÛ(O i, O i. We seek to bound te following tree quantities: δ root i := ( P(C i P(C i Oαi ( δi internal := ( P(C i P(C i O i i Oαi ( α i(,..., Oαi (γ i α i(γ i (5 Si (6 i,..., Oαi (γ i α i(γ i ξ leaf i := r i ( P ri (C i P ri (C i O i i (7 5

Lemma 3 If ɛ(o i, O i σ τi (P(O i, O i /3 ten δi root dmax ɛ(o αi(,..., O αi(γ i (8 3 d/ β ( d i dmax+3 ɛ(oαi(, 3..., O αi(γ i, O i ɛ(o i, O i 3 d + β d σ τi (P(O i, O i (σ τi (P(O i, O i (9 ( ξ leaf i 8kdmax ɛ(o i, O i 3 σ τi (P(O i, O i + r i ɛ(r i = r i, O i (0 σ τi (PÛ(O i, O i δ internal We define = max(δ root i, δ internal i, ξ leaf i (over all i. Proof Case δi root : For te root, P(Ci = P(O αi(,..., O αi(γ i Oαi Û ( αi(... Oαi (γ i Û αi(γ i and similarly P(C i = P(O αi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i. δ root = = ( P(Oαi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i P(O αi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i Oαi ( α i(,..., Oαi (γ i ( P(O αi(,..., O αi(γ i P(O αi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i... αi( α i( γi j= σ τ αi (j ( αi(j γi α i( ( P(O αi(,..., O αi(γ i P(O αi(,..., O αi(γ i Û Oαi ( αi(... Û Oαi (γ i αi(γ i j= σ τ ( ( P(O αi(,..., O αi(γ i P(O αi(,..., O αi(γ i αi (j αi(j Û αi(... Û α i(γ i Û α i( γi j= σ τ αi (j ( αi(j ɛ(o α i(,..., O αi(γ i Between te first and second line we use Lemma 8 to convert from elementwise one norm to spectral norm, and Lemma 6 (submultiplicativity. Lemma 6 (submultiplicativity is applied again to get to te second-to-last line. We also use te fact tat Û α i( =. Tus, by Lemma δ root i dmax ɛ(o αi(,..., O αi(γ i ( 3 dmax/ β dmax Case ξ leaf i : We note tat P ri (C i = P(R i = r i, O i O i ( PÛ(O i, O i and similarly P ri (C i = P(R i = r i, O i O i (PÛ(O i, O i. Again, we use Lemma 8 to convert from te one norm to te spectral norm, and Lemma 6 for submulti- 6

plicativity. ξ leaf i = r i ( Pri (C i P ri (C i O i i S i = r i ( Pri (C i P ri (C i O i Û i S i r i r i P[O i S i ] Si ( P ri (C i P ri (C i Û i ( P ri (C i P ri (C i ( Note tat P[O i S i ] Si =, and Û i =. P ri (C i P ri (C i = P(R i = r i, O i O i ( PÛ(O i, O i P(R i = r i, O i O i (PÛ(O i, O i P(R i = r i, O i O i ( PÛ(O i, O i PÛ(O i, O i + ( P(R i = r i, O i P(R i = r i, O i O i PÛ(O i, O i P(R i = r i, O i ( PÛ(O i, O i PÛ(O i, O i + PÛ(O i, O i ( P(Ri = r i, O i P(R i = r i, O i P(R i = r i + 5 + ɛ(r i = r i, O i σ τi (PÛ(O i, O i ɛ(o i, O i min(σ τi ( PÛ(O i, O i, σ τij (PÛ(O i, O i Te last line follows from Eq. 35. We ave also used te fact tat P(R i = r i, O i P(R i = r i, O i F P[R i = r] by Lemma 7. Tus, using Lemma, ξ leaf i r i ( P(R i = r i + 5 ξ leaf i r i 8kdmax 3 ( ( ɛ(o i, O i min(σ τi ( PÛ(O i, O i, σ τi (PÛ(O i, O i + + 5 6P(R i = r i ɛ(o i, O i 9σ τi (PÛ(O i, O i + ɛ(r i = r i, O i 3στi (PÛ(O i, O i ɛ(o i, O i σ τi (P(O i, O i + r i ɛ(r i = r i, O i σ τi (PÛ(O i, O i ɛ(r i = r i, O i σ τi (PÛ(O i, O i (3 (4 3.5. δ internal i P(C i = P(O αi(,..., O αi(γ i, O i O i (PÛ(O i, O i Oαi ( Û αi(... Oαi (γ i Û αi(γ i. Similarly, P(C i = P(O αi(,..., O αi(γ i, O i O i ( PÛ(O i, O i Oαi ( Û αi(... Oαi (γ i Û αi(γ i. 7

Again, we use Lemma 8 to convert from one norm to spectral norm and Lemma 6 for submultiplicativity. δi internal = ( P(C i P(C i O i i Oαi ( α i(,..., Oαi (γ i α i(γ i Si ( P(C i P(C i O i Û i Oαi ( α i(,..., Oαi (γ i α i(γ i Si P[O i S i ] Si ( P(C i P(C i Û i... γi j= σ τ αi (j ( αi(j Note tat P[O i S i ] Si =, and Û i =. = α i( α i(γ i ( P(C i P(C i (5 P(C i P(C i P(O αi(,..., O αi(γ i, O i O i ( PÛ(O i, O i Û Oαi ( αi(... Û Oαi (γ i αi(γ i P(O αi(,..., O αi(γ i, O i O i (PÛ(O i, O i Û Oαi ( αi(... Û Oαi (γ i αi(γ i ( P(O αi(,..., O αi(γ i, O i P(O αi(,..., O αi(γ i, O i O i PÛ(O i, O i Û Oαi ( αi(... Û Oαi (γ i αi(γ i + P(O αi(,..., O αi(γ i, O i O i (( PÛ(O i, O i (PÛ(O i, O i Û Oαi ( αi(... Û Oαi (γ i αi(γ i ( P(O αi(,..., O αi(γ i, O i P(O αi(,..., O αi(γ i, O i PÛ(Oi, O i Û αi(... + P(Oαi(,..., O αi(γ i, O i PÛ(O i, O i PÛ(O i, O i Û αi(... Û α i(γ i ɛ(o α i(,..., O αi(γ i, O i σ τi (PÛ(O i, O i + + 5 ɛ(o i, O i min(σ τi (PÛ(O i, O i, σ τi ( PÛ(O i, O i Te last line follows from Eq. 35. We ave also used te fact tat P(Oαi(,..., O αi(γ i, O i P(O αi(,..., O αi(γ i, O i F via Lemma 7. Using Lemma, δ internal i ( (k dmax ɛ(oαi(, (β..., O αi(γ i, O i 3 dmax σ τi (PÛ(O i, O i + + 5 ɛ(o i, O i min(σ τi (PÛ(O i, O i, σ τi ( PÛ(O i, O i (6 Û α i(γ i ( δi internal (k dmax ɛ(oαi(, (β..., O αi(γ i, O i 8ɛ(O i, O i + 3 dmax 3στi (P(O i, O i 3(σ τi (P(O i, O i ( 8(k dmax ɛ(oαi(, 3(β..., O αi(γ i, O i ɛ(o i, O i + 3 dmax σ τi (P(O i, O i (σ τi (P(O i, O i (7 3.6 Bounding te Propagation of Error We now sow all tese errors propagate on te junction tree. For tis section, assume te clique nodes are numbered,,..., C in breadt first order (suc tat is te root. Moreover let Φ :c (C be te transformed 8

factors accumulated so far if we computed te joint probability from te root down (instead of te bottom up. For example, Φ : (C = P(C (8 Φ : (C = P(C S P(C (9 Φ : (C = P(C S P(C S3 P(C 3 (30 (Note ow tis is very computationally inefficient: te tensors get very large. However, it is useful for proving statistical properties. Ten te modes of Φ :c can be partitioned into mode groups, M,..., M dc (were eac mode group consists of te variables on te corresponding separator edge. We now prove te following lemma, Lemma 4 Define = max(δi root x :c ( Φ :c (C Φ :c (C M, δ internal i, ξ leaf i. Ten,,..., Mdc ( + c δ root i + ( + c (3 x :c is all te observed variables in cliques troug c. Note tat wen c = C ten tis implies tat Φ :c (C = P[x,..., x O ] and tus, x P[x,..., x O ] P[x,..., x O ] ( + C δ root i + ( + C (3 Proof Te proof is by induction on c. Te base case follows trivially from te definition of δi root. For te induction step, assume te claim olds for c. Ten we prove it olds for c +. ( Φ :c+ (C Φ :c+ (C M,..., Mdc + x :c+ = ( Φ:c (C ( P(C c+ P(C c+ + ( Φ :c (C Φ :c (C ( P(C c+ P(C c+ + ( Φ :c (C Φ :c (C P(C c+ x :c+ M,..., Mdc+ + ( Φ :c (C ( P(C c+ P(C c+ M x :c+ (( Φ :c (C Φ :c (C ( P(C c+ P(C c+ M x :c+ x :c+ (( Φ :c (C Φ :c (C P(C c+ M,..., Mdc+ + +,..., Mdc+ +,..., Mdc+ + + Now we consider two cases, wen c + is a leaf clique and wen it is an internal clique. Case : Internal Clique Note ere te summation over x :c+ is irrelevant since tere is no evidence. We use Lemma 5 to break up te tree terms: Te first term, ( Φ :c (C ( P(C c+ P(C c+ M ( P(C c+ P(C c+ O (c+ c+ Oαc+ ( Φ :c (C M,..., Mdc,..., Mdc+ + α c+(,..., Oαc+ (γ c+ α c+(γ c+ Si 9

Te first term above is simply δi internal wile te second equals one since it is a joint distribution. Now for te second term, ( Φ :c (C Φ :c (C ( P(C c+ P(C c+ M ( Φ :c (C Φ :c (C M,..., Mdc ( P(C c+ P(C c+ O (c+ c+ Oαc+ ( Te tird term,,..., Mdc+ + α c+(,..., Oαc+ (γ c+ (( + c δ root + ( + c δc+ internal (via induction ypotesis (( + c δ root + ( + c (( Φ :c (C Φ :c (C P(C c+ M ( Φ :c (C Φ :c (C M P(C c+ O (c+ c+ Oαc+ (,..., Mdc+ +,..., Mdc (( + c δ root + ( + c Case : Leaf Clique Again we use Lemma 5 to break up te tree terms: Te first term, ( Φ :c (C ( P ri (C c+ P ri (C c+ M x :c+ x :c+ Φ :c (C M Φ :c (C M x :c α c+(,..., Oαc+ (γ c+ α c+(γ c+,..., Mdc+ + α c+(γ c+ Sc+,..., Mdc ( Pri (C c+ P ri (C c+ O (c+ c+ S c+,..., Mdc x c+ Sc+ ( P ri (C c+ P S c+ ri (C c+ O (c+ c+ Te first term above equals because it is a joint distribution and te second is te bound on te transformed quantity we ad proved earlier (since r i = x c+. Te second term, x :c x :c+ ( Φ :c (C Φ :c (C ( P ri (C c+ P ri (C c+ M ( Φ :c (C Φ :c (C M,..., Mdc (( + c δ root + ( + c ξ leaf c+ (( + c δ root + ( + c x c+,..., Mdc+ + ( P ri (C c+ P ri (C c+ O (c+ c+ Te tird term, x :c+ (( Φ :c (C Φ :c (C P ri (C c+ M x :c ( Φ :c (C Φ :c (C M (( + c δ root + ( + c,..., Mdc+ +,..., Mdc x c+ P S c+ ri (C c+ O (c+ c+ 0

Combining tese terms proves te induction step. 3.7 Putting it all togeter We use te fact from HKZ [] tat ( + a/t t + a for a /. Now is te main source of error. We set O(ɛ total / C. Note tat dmax+3 3 3 dmax β dmax ( o d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d + α o d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d α Tis gives, ( dmax+3 3 o d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d o + d e+,...,o d ɛ(o,..., O d e, o d e+,..., o d 3 dmax β α α Kɛ total / C dmax were K is some constant. Tis implies, totalα βdmax ɛ(o,..., O d e, o d e+,..., o d K 3dmax/+ ɛ o d e+,...,o d dmax+3 C (33 Now using te concentration bound (Lemma will give, Solving for N: K 3dmax/+ ɛ total α β dmax dmax+3 C k e max on ln C δ ( ( 4k dmax N O ko emax ln C δ C 3β ɛ total α4 + k e max o N (34 and tis completes te proof. 4 Appendix 4. Matrix Perturbation Bounds Tis is Teorem 3.8 from pg. 43 in Stewart and Sun, 990 [4]. Let A R m n, wit m n and let à = A + E. Ten 4. Tensor Norm Bounds Ã+ A + + 5 max( A +, à E (35 For matrices it is true tat Mv M v. We prove te generalization to tensors. Lemma 5 Let T and T be tensors σ a set of (labeled modes. T σ T T σ T (36

Proof T σ T = i :N \σ j :M\σ = T (i :N \ σ, σ = xt (j :N \ σ, σ = x x T (i :N \ σ, σ = xt (j :N \ σ, σ = x i :N \σ x j :M\σ = T (i :N \ σ, σ = x T (j :N \ σ, σ = x i :N \σ σ j :M\σ max T (j :N \ σ, σ = x T x j :M\σ = T σ T We prove a restricted analog of te fact tat spectral norm is submultiplicative for matrices i.e. AB A B. Lemma 6 Let T be a tensor of order N and let M be a matrix. Ten, Proof T M = sup v m,v,...,v N = sup v m,v,...,v N i,...,i N,m T M T M (37 T (i, i,..., i N M(i, mv m (mv i (i v i3 (i 3...v in (i n T (i, i,..., i N M(i, mv m (m i,...,i N i m ( sup M(i, mv m (m i m ( T (i, i,..., i N i,...,i N m M(i M(i, mv m (m v i (i v i3 (i 3...v in (i n, mv m (m m sup v m,v,...,v N M T Lemma 7 Let T be a tensor of order N. Ten, T T v,..., n v n F T v,..., n v n F... T F (38 Proof It suffices to sow tat sup v s.t. v T v F T F. By submultiplicativity of te frobenius norm: T v F T F v F T, since v F = v. Lemma 8 Let T be a tensor of order N, were eac mode is of dimension k. Ten, T σ kn T (39 For any σ. Proof We simply prove tis for σ = (wic corresponds to elementwise one norm since T σ T σ if σ σ. Note tat T k N max( T (were max(t is te maximum element of T. Similarly, max( T T wic implies tat T k N T.

References [] R.A. Horn and C.R. Jonson. Matrix analysis. Cambridge Univ Pr, 990. [] D. Hsu, S. Kakade, and T. Zang. A spectral algoritm for learning idden Markov models. In Proc. Annual Conf. Computational Learning Teory, 009. [3] N.H. Nguyen, P. Drineas, and T.D. Tran. Tensor sparsification via a bound on te spectral norm of random tensors. Arxiv preprint arxiv:005.473, 00. [4] GW Stewart and J. Sun. Matrix Perturbation Teory. Academic Press, 990. 3