Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky. 1
The basic decomposition problem Notation: For a vector x = (x 1, x 2,..., x n ) R n, x x x denotes the 3-way array (call it a tensor ) in R n n n (i, j, k) th entry is x i x j x k. whose Problem: Given T R n n n with the promise that T = n λ t v t v t v t for some orthonormal basis { v t } of R n (w.r.t. standard inner product) and positive scalars {λ t > 0}, approximately find {( v t, λ t )} (up to some desired precision). 3
Basic questions 1. Is {( v t, λ t )} uniquely determined? 2. If so, is there an efficient algorithm for finding the decomposition? 3. What if T is perturbed by some small amount? Perturbed problem: Same as the original problem, except instead of T, we are given T + E for some error tensor E. How large can E be if we want ε precision? 4
Analogous matrix problem Matrix problem: Given M R n n with the promise that M = n λ t v t v t for some orthonormal basis { v t } of R n (w.r.t. standard inner product) and positive scalars {λ t > 0}, approximately find {( v t, λ t )} (up to some desired precision). 5
Analogous matrix problem We re promised that M is symmetric and positive definite, so requested decomposition is an eigendecomposition. In this case, an eigendecomposition always exists, and can be found efficiently. It is unique if and only if the {λ i } are distinct. What if M is perturbed by some small amount? Perturbed matrix problem: Same as the original problem, except instead of M, we are given M + E for some error matrix E (assume to be symmetric). Answer provided by matrix perturbation theory (e.g., Davis-Kahan), which requires E 2 < min i j λ i λ j. 6
Back to the original problem Problem: Given T R n n n with the promise that T = n λ t v t v t v t for some orthonormal basis { v t } of R n (w.r.t. standard inner product) and positive scalars {λ t > 0}, approximately find {( v t, λ t )} (up to some desired precision). Such decompositions do not necessarily exist, even for symmetric tensors. Where the decompositions do exist, the Perturbed problem asks if they are robust. 7
Main ideas Easy claim: Repeated application of a certain quadratic operator based on T (a power iteration ) recovers a single ( v t, λ t ) up to any desired precision. Self-reduction: Replace T with T λ t v t v t v t. Why?: T λ t v t v t v t = τ t λ τ v τ v τ v τ. Catch: We don t recover ( v t, λ t ) exactly, so we actually can only replace T with for some error tensor E t. T λ t v t v t v t + E t Therefore, must anyway deal with perturbations. 8
Rest of this talk 1. Identifiability of decomposition {( v t, λ t )} from T. 2. A decomposition algorithm based on tensor power iteration. 3. Error analysis of decomposition algorithm. 9
Identifiability of the decomposition Orthonormal basis { v t } of R n, positive scalars {λ t > 0}: T = n λ t v t v t v t In what sense is {( v t, λ t )} uniquely determined? Claim: { v t } are the n isolated local maximizers of certain cubic form f T : B n R, and f T ( v t ) = λ t. 11
Aside: multilinear form There is a natural trilinear form associated with T : ( x, y, z) i,j,k T i,j,k x i y j z k. For matrices M, it looks like ( x, y) i,j M i,j x i y j = x M y. 12
Review: Rayleigh quotient Recall Rayleigh quotient for matrix M := n λ t v t v t (assuming x S n 1 ): R M ( x) := x M x = n λ t ( v t x) 2. Every v t such that λ t = max! is a maximizer of R M. (These are also the only local maximizers.) 13
The natural cubic form Consider the function f T : B n R given by x f T ( x) = i,j,k T i,j,k x i x j x k. For our promised T = n λ t v t v t v t, f T becomes f T ( x) = = = n n λ t λ t ( ) v t v t v t i,j,k i,j,k x ix j x k ( v t ) i ( v t ) j ( v t ) k x i x j x k i,j,k n λ t ( v t x) 3. Observation: f T ( v t ) = λ t. 14
Variational characterization Claim: Isolated local maximizers of f T on B n are { v t }. Objective function (with constraint): x inf λ 0 n λ t ( v t x) 3 1.5λ( x 2 2 1). First-order condition for local maxima: n λ t ( v t x) 2 v t = λ x. Second-order condition for isolated local maxima: w (2 n ) λ t ( v t x) v t v t λi w < 0, w x. 15
Intuition behind variational characterization May as well assume v t is t th coordinate basis vector, so max f T ( x) = x R n n λ t xt 3 s.t. n xt 2 1. Intuition: Suppose supp( x) = {1, 2}, and x 1, x 2 > 0. f T ( x) = λ 1 x 3 1 + λ 2x 3 2 < λ 1x 2 1 + λ 2x 2 2 max{λ 1, λ 2 }. Better to have supp( x) = 1, i.e., picking x to be a coordinate basis vector. 16
Aside: canonical polyadic decomposition Rank-K canonical polyadic decomposition (CPD) of T (also called PARAFAC, CANDECOMP, or CP): T = K σ i u i v i w i. i=1 [Harshman/Jennrich, 1970; Kruskal, 1977; Leurgans et al., 1993]. Number of parameters: K (3n + 1) (compared to n 3 in general). Fact: Our promised T has a rank-n CPD. N.B.: Overcomplete (K > n) CPD is also interesting and a possibility as long as K (3n + 1) n 3. 17
The quadratic operator Easy claim: Repeated application of a certain quadratic operator (based on T ) recovers a single (λ t, v t ) up to any desired precision. For any T R n n n and x = (x 1, x 2,..., x n ) R n, define the quadratic operator φ T ( x) := i,j,k T i,j,k x j x k e i R n where e i R n is the i th coordinate basis vector. n n ( If T = λ t v t v t v t, then φ T ( x) = λ t v t x )2 v t. 19
An algorithm? Recall: First-order condition for local maxima of f T ( x) = n λ t ( v t x) 3 for x B n : φ T ( x) = n λ t ( v t x) 2 v t = λ x i.e., eigenvector -like condition. Algorithm: Find x B n fixed under x φ T ( x)/ φ T ( x). (Ignoring numerical issues, can just repeatedly apply φ T and defer normalization until later.) N.B.: Gradient ascent also works [Kolda & Mayo, 11]. 20
Tensor power iteration [De Lathauwer et al, 2000] Start with some x (0), and for j = 1, 2,... : x (j) := φ T ( x (j 1)) = n ( λ t v t x (j 1))2 v t. Claim: For almost all initial x (0), the sequence ( x (j) / x (j) ) j=1 converges quadratically fast to some v t. 21
Review: matrix power iteration Recall matrix power iteration for matrix M := n λ t v t v t : Start with some x (0), and for j = 1, 2,... : x (i) := M x (j 1) = n ( λ t v t x (j 1)) v t. i.e., component in v t direction is scaled by λ t. If λ 1 > λ 2, then ( v 1 x (j)) 2 ( ) 2j n ( v t x (j)) 2 1 k λ2. λ 1 i.e., converges linearly to v 1 (assuming gap λ 2 /λ 1 < 1). 22
Tensor power iteration convergence analysis Let c t := v t x (0) (initial component in v t direction); assume WLOG λ 1 c 1 > λ 2 c 2 λ 3 c 3. Then n x (1) ( = λ t n v t x (0))2 v t = λ t ct 2 v t i.e., component in v t direction is squared then scaled by λ t. Easy to show ( v 1 x (j)) 2 ( ) ( v t x (j)) 2 1 k λ 2 1 λ 2 c 2 max t 1 λ t λ 1 c 1 n 2j+1. 23
Example n = 1024, λ t u.a.r. [0, 1]. 0.012 0.01 0.008 0.006 0.004 0.002 0 0 200 400 600 800 1000 Value of ( v t x (0) ) 2 for t = 1, 2,..., 1024 24
Example n = 1024, λ t u.a.r. [0, 1]. 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 200 400 600 800 1000 Value of ( v t x (1) ) 2 for t = 1, 2,..., 1024 25
Example n = 1024, λ t u.a.r. [0, 1]. 0.2 0.15 0.1 0.05 0 0 200 400 600 800 1000 Value of ( v t x (2) ) 2 for t = 1, 2,..., 1024 26
Example n = 1024, λ t u.a.r. [0, 1]. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 Value of ( v t x (3) ) 2 for t = 1, 2,..., 1024 27
Example n = 1024, λ t u.a.r. [0, 1]. 1 0.8 0.6 0.4 0.2 0 0 200 400 600 800 1000 Value of ( v t x (4) ) 2 for t = 1, 2,..., 1024 28
Example n = 1024, λ t u.a.r. [0, 1]. 1 0.8 0.6 0.4 0.2 0 0 200 400 600 800 1000 Value of ( v t x (5) ) 2 for t = 1, 2,..., 1024 29
Matrix vs. tensor power iteration Matrix power iteration: 1. Requires gap between largest and second-largest λ t. (Property of the matrix only.) 2. Converges to top v t. 3. Linear convergence. (Need O(log(1/ɛ)) iterations.) Tensor power iteration: 1. Requires gap between largest and second-largest λ t c t. (Property of the tensor and initialization x (0).) 2. Converges to v t for which λ t c t = max! (could be any of them). 3. Quadratic convergence. (Need O(log log(1/ɛ)) iterations.) 30
Initialization of tensor power iteration Convergence of tensor power iteration requires gap between largest and second-largest λ t v t x (0). Example of bad initialization: Suppose T = t v t v t v t, and x (0) = 1 2 ( v 1 + v 2 ). φ T ( x (0) ) = ( v 1 x (0) ) 2 v 1 + ( v 2 x (0) ) 2 v 2 = 1 2 ( v 1 + v 2 ) = 1 x (0). 2 Fortunately, bad initialization points are atypical. 31
Full decomposition algorithm Input: T R n n n. Initialize: T := T. For i = 1, 2,..., n: 1. Pick x (0) S n 1 unif. at random. 2. Run tensor power iteration with T starting from x (0) for N iterations. 3. Set ˆv i := x (N) / x (N) and ˆλ i := f T (ˆv i ). 4. Replace T := T ˆλ i ˆv i ˆv i ˆv i. Output: {(ˆv i, ˆλ i ) : i [n]}. Actually: repeat Steps 1 3 several times, and take results of trial yielding largest ˆλ i. 32
Aside: direct minimization Can also consider directly minimizing n 2 T ˆλ t ˆv t ˆv t ˆv t via local optimization (e.g., coord. descent, alternating least squares). Decomposition algorithm via tensor power iteration can be viewed as orthgonal greedy algorithm for minimizing above objective [Zhang & Golub, 01]. F 33
Aside: implementation for bag-of-words models Let f (i) be empirical word frequency vector for document i: ( f (i) ) j = # times word j appears in document i length of document i Matrix of word-pair frequencies (from m documents) Pairs 1 m m f (i) f (i) i=1 K µ t µ t. Tensor of word-triple frequencies (from m documents) Triples 1 m m f (i) f (i) f (i) i=1 K µ t µ t µ t. 34
Aside: implementation for bag-of-words models Use inner product system given by x, y := x Pairs y. Why?: If Pairs = K µ t µ t, then µ i, µ j = 1 {i=j}. { µ i } are orthonormal under this inner product system. Power iteration step: φ Triples ( x) := 1 m m x, f (i) 2 f (i) = 1 m m ( x Pairs f (i) ) 2 f (i). i=1 i=1 1. First compute y := Pairs x (use low-rank factors of Pairs). 2. Then compute ( y f (i) ) 2 f (i) for all documents i, and add them up (all sparse operations). Final running time # topics (model size + input size). 35
Effect of errors in tensor power iterations Suppose we are given T := T + E, with n T = λ t v t v t v t, ε := sup x S n 1 φ E ( x). What can we say about the resulting ˆv i and ˆλ i? 37
Perturbation analysis Theorem: If ε O( min t λ t n ), then with high probability, a modified variant of the full decomposition algorithm returns {(ˆv i, ˆλ i ) : i [n]} with ˆv i v i O(ε/λ i ), ˆλ i λ i O(ε), i [n]. Essentially third-order analogue of Wedin s theorem for SVD of matrices, but specific to fixed-point iteration algorithm. Similar analysis holds for variational characterization. 38
Effect of errors in tensor power iterations Quadratic operator φ T with T : φ T ( x) = n ( λ t v t x )2 v t + φ E ( x). Claim: If ε O( min t λ t n ) and N Ω(log(n) + log log max t λ t ε ), then N steps of tensor power iteration on T + E (with good initialization) gives ˆv i v i O(ε/λ i ), ˆλ i λ i O(ε). 39
Deflation (For simplicity, assume λ 1 = = λ n = 1.) Using tensor power iteration on T := T + E: Approximate (say) v 1 with ˆv 1 up to error v 1 ˆv 1 ε. Deflation danger: To find next v t, use T ˆv 1 ˆv 1 ˆv 1 = n v t v t v t t=2 + E + ( v 1 v 1 v 1 ˆv 1 ˆv 1 ˆv 1 ). Now error seems to be of size 2ε... exponential explosion? 40
How do the errors look? E 1 := v 1 v 1 v 1 ˆv 1 ˆv 1 ˆv 1 Take any direction x orthogonal to v 1 : φ E 1 ( x) = ( v 1 x) 2 v 1 (ˆv 1 x) 2ˆv 1 = (ˆv 1 x) 2ˆv 1 = ((ˆv 1 v 1 ) x) 2 ˆv 1 v 1 2 ε 2. Effect of E + E 1 in directions orthogonal to v 1 is just (1 + o(1))ε. 41
Deflation analysis Upshot: all errors due to deflation have only lower-order effects on ability to find subsequent v t. Analogous statement for matrix power iteration is not true. 42
Recap and remarks Orthogonally diagonalizable tensors have very nice identifiability, computational, and robustness properties. Many analogues to matrix SVD, but also many important differences arising from non-linearity. Greedy algorithm for finding the decomposition can be rigorously analyzed and shown to be effective and efficient. Many other approaches to moment-based estimation (e.g., subspace ID / OOMs, local optimization). 44
Other stuff I didn t talk about 1. Overcomplete tensor decomposition: K > n components in R n. K T = λ t v t v t v t. ICA/blind source separation [Cardoso, 1991; Goyal et al, 2014] Mixture models [Bhaskara et al, 2014; Anderson et al, 2014] Dictionary learning [Barak et al, 2014]... 2. General Tucker decompositions (CPD is a special case). Exploit other structure (e.g., sparsity) Talk to Anima about this! 45
Questions? 46
Tensor product of vector spaces What is the tensor product V W of vector spaces V and W? Define objects E v, w for v V and w W. Declare equivalences E v1 + v 2, w E v1, w + E v2, w E v, w1 + w 2 E v, w1 + E v, w2 c E v, w E c v, w E v,c w for c R. Pick any bases B V for V, and B W for W. V W := span of {E v, w : v B V, w B W }, modulo equivalences (eliminating dependence on choice of bases). Can check that V W is a vector space. v w (tensor product of v V and w W ) is the equivalence class of E v, w in V W. 48
Tensor algebra perspective From tensor algebra: Since { v t : t [n] } is a basis for R n, { } v i v j v k : i, j, k [n] is a basis for R n R n R n ( denotes the tensor product of vector spaces) Every tensor T R n R n R n has a unique representation in this basis: T = c i,j,k v i v j v k i,j,k N.B.: dim(r n R n R n ) = n 3. 49
Aside: general bases for R n R n R n Pick any bases ( { α i }, { β i }, { γ i } ) for R n (not necessary orthonormal). Basis for R n R n R n : { αi β j γ k : 1 i, j, k n }. Every tensor T R n R n R n has a unique representation in this basis: T = c i,j,k α i β j γ k. i,j,k A tensor T such that c i,j,k 0 i = j = k is called diagonal: T = n c i,i,i α i β i γ i. i=1 Claim: A tensor T can be diagonal w.r.t. at most one basis. 50
Aside: canonical polyadic decomposition Rank-K canonical polyadic decomposition (CPD) of T (also called PARAFAC, CANDECOMP, or CP): T = K σ i u i v i w i. i=1 Number of parameters: K (3n + 1) (compared to n 3 in general). Fact: If T is diagonal w.r.t. bases then it has a rank-k CPD with K n. Diagonal w.r.t. bases non-overcomplete CPD. N.B.: Overcomplete (K > n) CPD is also interesting and a possibility as long as K (3n + 1) n 3. 51
Initialization of tensor power iteration Let t max := arg max t λ t, and draw x (0) S n 1 unif. at random. Most coefficients of x (0) are around 1/ n; largest is around log(n)/n. Almost surely, a gap exists: max t t max λ t v t x (0) λ tmax v tmax x (0) < 1. With probability 1/n 1.2, the gap is non-negligible: max t t max λ t v t x (0) λ tmax v tmax x (0) < 0.9. Try O(n 1.3 ) initializers; chances are at least one is good. (Very conservative estimate only; can be much better than this.) 53