Orthogonal tensor decomposition

Similar documents
A Robust Form of Kruskal s Identifiability Theorem

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Volodymyr Kuleshov ú Arun Tejasvi Chaganty ú Percy Liang. May 11, 2015

Non-convex Robust PCA: Provable Bounds

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

A concise proof of Kruskal s theorem on tensor decomposition

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

ECE 598: Representation Learning: Algorithms and Models Fall 2017

EECS 275 Matrix Computation

Dictionary Learning Using Tensor Methods

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018

arxiv: v1 [math.ra] 13 Jan 2009

Lecture 8 : Eigenvalues and Eigenvectors

Linear Algebra (Review) Volker Tresp 2017

Linear Algebra (Review) Volker Tresp 2018


THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

The Eigenvalue Problem: Perturbation Theory

Structured matrix factorizations. Example: Eigenfaces

Review problems for MA 54, Fall 2004.

Definition (T -invariant subspace) Example. Example

Applied Linear Algebra in Geoscience Using MATLAB

An Introduction to Spectral Learning

Least Squares Optimization

1. Structured representation of high-order tensors revisited. 2. Multi-linear algebra (MLA) with Kronecker-product data.

Computational Linear Algebra

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

A Simpler Approach to Low-Rank Tensor Canonical Polyadic Decomposition

Foundations of Computer Vision

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

18.06SC Final Exam Solutions

arxiv: v1 [math.na] 5 May 2011

Lecture 3: Review of Linear Algebra

Introduction to Tensors. 8 May 2014

Lecture 3: Review of Linear Algebra

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

2. Review of Linear Algebra

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Provable Alternating Minimization Methods for Non-convex Optimization

Linear Algebra- Final Exam Review

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

There are two things that are particularly nice about the first basis

Appendix A. Proof to Theorem 1

Tensor Factorization via Matrix Factorization

14 Singular Value Decomposition

Numerical Methods I Eigenvalue Problems

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Lecture 8: Linear Algebra Background

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

MAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction

Lecture 4. CP and KSVD Representations. Charles F. Van Loan

Computational Methods. Eigenvalues and Singular Values

Conjugate Gradient (CG) Method

MATH 304 Linear Algebra Lecture 34: Review for Test 2.

Introduction to gradient descent

Linear Regression and Its Applications

33AH, WINTER 2018: STUDY GUIDE FOR FINAL EXAM

CS 143 Linear Algebra Review

1. What is the determinant of the following matrix? a 1 a 2 4a 3 2a 2 b 1 b 2 4b 3 2b c 1. = 4, then det

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

Principal Component Analysis

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

5.) For each of the given sets of vectors, determine whether or not the set spans R 3. Give reasons for your answers.

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints

Lecture 2: Numerical linear algebra

TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Principal components analysis COMS 4771

Tensors. Notes by Mateusz Michalek and Bernd Sturmfels for the lecture on June 5, 2018, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra

Eigenvalues and eigenvectors

Matrices, Moments and Quadrature, cont d

The Singular Value Decomposition

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

COMP 558 lecture 18 Nov. 15, 2010

Computational Methods CMSC/AMSC/MAPL 460. EigenValue decomposition Singular Value Decomposition. Ramani Duraiswami, Dept. of Computer Science

Lecture 9: Low Rank Approximation

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

Statistical Geometry Processing Winter Semester 2011/2012

Lecture Note 12: The Eigenvalue Problem

Quantum Computing Lecture 2. Review of Linear Algebra

Math 312 Final Exam Jerry L. Kazdan May 5, :00 2:00

CS 246 Review of Linear Algebra 01/17/19

Chapter Two Elements of Linear Algebra

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

arxiv: v5 [math.na] 16 Nov 2017

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Low Rank Approximation Lecture 7. Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL

MATH 581D FINAL EXAM Autumn December 12, 2016

Iterative Methods for Solving A x = b

Chapter 3 Transformations

Math 205, Summer I, Week 4b:

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Fundamentals of Multilinear Subspace Learning

MAT 1302B Mathematical Methods II

Transcription:

Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky. 1

The basic decomposition problem Notation: For a vector x = (x 1, x 2,..., x n ) R n, x x x denotes the 3-way array (call it a tensor ) in R n n n (i, j, k) th entry is x i x j x k. whose Problem: Given T R n n n with the promise that T = n λ t v t v t v t for some orthonormal basis { v t } of R n (w.r.t. standard inner product) and positive scalars {λ t > 0}, approximately find {( v t, λ t )} (up to some desired precision). 3

Basic questions 1. Is {( v t, λ t )} uniquely determined? 2. If so, is there an efficient algorithm for finding the decomposition? 3. What if T is perturbed by some small amount? Perturbed problem: Same as the original problem, except instead of T, we are given T + E for some error tensor E. How large can E be if we want ε precision? 4

Analogous matrix problem Matrix problem: Given M R n n with the promise that M = n λ t v t v t for some orthonormal basis { v t } of R n (w.r.t. standard inner product) and positive scalars {λ t > 0}, approximately find {( v t, λ t )} (up to some desired precision). 5

Analogous matrix problem We re promised that M is symmetric and positive definite, so requested decomposition is an eigendecomposition. In this case, an eigendecomposition always exists, and can be found efficiently. It is unique if and only if the {λ i } are distinct. What if M is perturbed by some small amount? Perturbed matrix problem: Same as the original problem, except instead of M, we are given M + E for some error matrix E (assume to be symmetric). Answer provided by matrix perturbation theory (e.g., Davis-Kahan), which requires E 2 < min i j λ i λ j. 6

Back to the original problem Problem: Given T R n n n with the promise that T = n λ t v t v t v t for some orthonormal basis { v t } of R n (w.r.t. standard inner product) and positive scalars {λ t > 0}, approximately find {( v t, λ t )} (up to some desired precision). Such decompositions do not necessarily exist, even for symmetric tensors. Where the decompositions do exist, the Perturbed problem asks if they are robust. 7

Main ideas Easy claim: Repeated application of a certain quadratic operator based on T (a power iteration ) recovers a single ( v t, λ t ) up to any desired precision. Self-reduction: Replace T with T λ t v t v t v t. Why?: T λ t v t v t v t = τ t λ τ v τ v τ v τ. Catch: We don t recover ( v t, λ t ) exactly, so we actually can only replace T with for some error tensor E t. T λ t v t v t v t + E t Therefore, must anyway deal with perturbations. 8

Rest of this talk 1. Identifiability of decomposition {( v t, λ t )} from T. 2. A decomposition algorithm based on tensor power iteration. 3. Error analysis of decomposition algorithm. 9

Identifiability of the decomposition Orthonormal basis { v t } of R n, positive scalars {λ t > 0}: T = n λ t v t v t v t In what sense is {( v t, λ t )} uniquely determined? Claim: { v t } are the n isolated local maximizers of certain cubic form f T : B n R, and f T ( v t ) = λ t. 11

Aside: multilinear form There is a natural trilinear form associated with T : ( x, y, z) i,j,k T i,j,k x i y j z k. For matrices M, it looks like ( x, y) i,j M i,j x i y j = x M y. 12

Review: Rayleigh quotient Recall Rayleigh quotient for matrix M := n λ t v t v t (assuming x S n 1 ): R M ( x) := x M x = n λ t ( v t x) 2. Every v t such that λ t = max! is a maximizer of R M. (These are also the only local maximizers.) 13

The natural cubic form Consider the function f T : B n R given by x f T ( x) = i,j,k T i,j,k x i x j x k. For our promised T = n λ t v t v t v t, f T becomes f T ( x) = = = n n λ t λ t ( ) v t v t v t i,j,k i,j,k x ix j x k ( v t ) i ( v t ) j ( v t ) k x i x j x k i,j,k n λ t ( v t x) 3. Observation: f T ( v t ) = λ t. 14

Variational characterization Claim: Isolated local maximizers of f T on B n are { v t }. Objective function (with constraint): x inf λ 0 n λ t ( v t x) 3 1.5λ( x 2 2 1). First-order condition for local maxima: n λ t ( v t x) 2 v t = λ x. Second-order condition for isolated local maxima: w (2 n ) λ t ( v t x) v t v t λi w < 0, w x. 15

Intuition behind variational characterization May as well assume v t is t th coordinate basis vector, so max f T ( x) = x R n n λ t xt 3 s.t. n xt 2 1. Intuition: Suppose supp( x) = {1, 2}, and x 1, x 2 > 0. f T ( x) = λ 1 x 3 1 + λ 2x 3 2 < λ 1x 2 1 + λ 2x 2 2 max{λ 1, λ 2 }. Better to have supp( x) = 1, i.e., picking x to be a coordinate basis vector. 16

Aside: canonical polyadic decomposition Rank-K canonical polyadic decomposition (CPD) of T (also called PARAFAC, CANDECOMP, or CP): T = K σ i u i v i w i. i=1 [Harshman/Jennrich, 1970; Kruskal, 1977; Leurgans et al., 1993]. Number of parameters: K (3n + 1) (compared to n 3 in general). Fact: Our promised T has a rank-n CPD. N.B.: Overcomplete (K > n) CPD is also interesting and a possibility as long as K (3n + 1) n 3. 17

The quadratic operator Easy claim: Repeated application of a certain quadratic operator (based on T ) recovers a single (λ t, v t ) up to any desired precision. For any T R n n n and x = (x 1, x 2,..., x n ) R n, define the quadratic operator φ T ( x) := i,j,k T i,j,k x j x k e i R n where e i R n is the i th coordinate basis vector. n n ( If T = λ t v t v t v t, then φ T ( x) = λ t v t x )2 v t. 19

An algorithm? Recall: First-order condition for local maxima of f T ( x) = n λ t ( v t x) 3 for x B n : φ T ( x) = n λ t ( v t x) 2 v t = λ x i.e., eigenvector -like condition. Algorithm: Find x B n fixed under x φ T ( x)/ φ T ( x). (Ignoring numerical issues, can just repeatedly apply φ T and defer normalization until later.) N.B.: Gradient ascent also works [Kolda & Mayo, 11]. 20

Tensor power iteration [De Lathauwer et al, 2000] Start with some x (0), and for j = 1, 2,... : x (j) := φ T ( x (j 1)) = n ( λ t v t x (j 1))2 v t. Claim: For almost all initial x (0), the sequence ( x (j) / x (j) ) j=1 converges quadratically fast to some v t. 21

Review: matrix power iteration Recall matrix power iteration for matrix M := n λ t v t v t : Start with some x (0), and for j = 1, 2,... : x (i) := M x (j 1) = n ( λ t v t x (j 1)) v t. i.e., component in v t direction is scaled by λ t. If λ 1 > λ 2, then ( v 1 x (j)) 2 ( ) 2j n ( v t x (j)) 2 1 k λ2. λ 1 i.e., converges linearly to v 1 (assuming gap λ 2 /λ 1 < 1). 22

Tensor power iteration convergence analysis Let c t := v t x (0) (initial component in v t direction); assume WLOG λ 1 c 1 > λ 2 c 2 λ 3 c 3. Then n x (1) ( = λ t n v t x (0))2 v t = λ t ct 2 v t i.e., component in v t direction is squared then scaled by λ t. Easy to show ( v 1 x (j)) 2 ( ) ( v t x (j)) 2 1 k λ 2 1 λ 2 c 2 max t 1 λ t λ 1 c 1 n 2j+1. 23

Example n = 1024, λ t u.a.r. [0, 1]. 0.012 0.01 0.008 0.006 0.004 0.002 0 0 200 400 600 800 1000 Value of ( v t x (0) ) 2 for t = 1, 2,..., 1024 24

Example n = 1024, λ t u.a.r. [0, 1]. 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 200 400 600 800 1000 Value of ( v t x (1) ) 2 for t = 1, 2,..., 1024 25

Example n = 1024, λ t u.a.r. [0, 1]. 0.2 0.15 0.1 0.05 0 0 200 400 600 800 1000 Value of ( v t x (2) ) 2 for t = 1, 2,..., 1024 26

Example n = 1024, λ t u.a.r. [0, 1]. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 Value of ( v t x (3) ) 2 for t = 1, 2,..., 1024 27

Example n = 1024, λ t u.a.r. [0, 1]. 1 0.8 0.6 0.4 0.2 0 0 200 400 600 800 1000 Value of ( v t x (4) ) 2 for t = 1, 2,..., 1024 28

Example n = 1024, λ t u.a.r. [0, 1]. 1 0.8 0.6 0.4 0.2 0 0 200 400 600 800 1000 Value of ( v t x (5) ) 2 for t = 1, 2,..., 1024 29

Matrix vs. tensor power iteration Matrix power iteration: 1. Requires gap between largest and second-largest λ t. (Property of the matrix only.) 2. Converges to top v t. 3. Linear convergence. (Need O(log(1/ɛ)) iterations.) Tensor power iteration: 1. Requires gap between largest and second-largest λ t c t. (Property of the tensor and initialization x (0).) 2. Converges to v t for which λ t c t = max! (could be any of them). 3. Quadratic convergence. (Need O(log log(1/ɛ)) iterations.) 30

Initialization of tensor power iteration Convergence of tensor power iteration requires gap between largest and second-largest λ t v t x (0). Example of bad initialization: Suppose T = t v t v t v t, and x (0) = 1 2 ( v 1 + v 2 ). φ T ( x (0) ) = ( v 1 x (0) ) 2 v 1 + ( v 2 x (0) ) 2 v 2 = 1 2 ( v 1 + v 2 ) = 1 x (0). 2 Fortunately, bad initialization points are atypical. 31

Full decomposition algorithm Input: T R n n n. Initialize: T := T. For i = 1, 2,..., n: 1. Pick x (0) S n 1 unif. at random. 2. Run tensor power iteration with T starting from x (0) for N iterations. 3. Set ˆv i := x (N) / x (N) and ˆλ i := f T (ˆv i ). 4. Replace T := T ˆλ i ˆv i ˆv i ˆv i. Output: {(ˆv i, ˆλ i ) : i [n]}. Actually: repeat Steps 1 3 several times, and take results of trial yielding largest ˆλ i. 32

Aside: direct minimization Can also consider directly minimizing n 2 T ˆλ t ˆv t ˆv t ˆv t via local optimization (e.g., coord. descent, alternating least squares). Decomposition algorithm via tensor power iteration can be viewed as orthgonal greedy algorithm for minimizing above objective [Zhang & Golub, 01]. F 33

Aside: implementation for bag-of-words models Let f (i) be empirical word frequency vector for document i: ( f (i) ) j = # times word j appears in document i length of document i Matrix of word-pair frequencies (from m documents) Pairs 1 m m f (i) f (i) i=1 K µ t µ t. Tensor of word-triple frequencies (from m documents) Triples 1 m m f (i) f (i) f (i) i=1 K µ t µ t µ t. 34

Aside: implementation for bag-of-words models Use inner product system given by x, y := x Pairs y. Why?: If Pairs = K µ t µ t, then µ i, µ j = 1 {i=j}. { µ i } are orthonormal under this inner product system. Power iteration step: φ Triples ( x) := 1 m m x, f (i) 2 f (i) = 1 m m ( x Pairs f (i) ) 2 f (i). i=1 i=1 1. First compute y := Pairs x (use low-rank factors of Pairs). 2. Then compute ( y f (i) ) 2 f (i) for all documents i, and add them up (all sparse operations). Final running time # topics (model size + input size). 35

Effect of errors in tensor power iterations Suppose we are given T := T + E, with n T = λ t v t v t v t, ε := sup x S n 1 φ E ( x). What can we say about the resulting ˆv i and ˆλ i? 37

Perturbation analysis Theorem: If ε O( min t λ t n ), then with high probability, a modified variant of the full decomposition algorithm returns {(ˆv i, ˆλ i ) : i [n]} with ˆv i v i O(ε/λ i ), ˆλ i λ i O(ε), i [n]. Essentially third-order analogue of Wedin s theorem for SVD of matrices, but specific to fixed-point iteration algorithm. Similar analysis holds for variational characterization. 38

Effect of errors in tensor power iterations Quadratic operator φ T with T : φ T ( x) = n ( λ t v t x )2 v t + φ E ( x). Claim: If ε O( min t λ t n ) and N Ω(log(n) + log log max t λ t ε ), then N steps of tensor power iteration on T + E (with good initialization) gives ˆv i v i O(ε/λ i ), ˆλ i λ i O(ε). 39

Deflation (For simplicity, assume λ 1 = = λ n = 1.) Using tensor power iteration on T := T + E: Approximate (say) v 1 with ˆv 1 up to error v 1 ˆv 1 ε. Deflation danger: To find next v t, use T ˆv 1 ˆv 1 ˆv 1 = n v t v t v t t=2 + E + ( v 1 v 1 v 1 ˆv 1 ˆv 1 ˆv 1 ). Now error seems to be of size 2ε... exponential explosion? 40

How do the errors look? E 1 := v 1 v 1 v 1 ˆv 1 ˆv 1 ˆv 1 Take any direction x orthogonal to v 1 : φ E 1 ( x) = ( v 1 x) 2 v 1 (ˆv 1 x) 2ˆv 1 = (ˆv 1 x) 2ˆv 1 = ((ˆv 1 v 1 ) x) 2 ˆv 1 v 1 2 ε 2. Effect of E + E 1 in directions orthogonal to v 1 is just (1 + o(1))ε. 41

Deflation analysis Upshot: all errors due to deflation have only lower-order effects on ability to find subsequent v t. Analogous statement for matrix power iteration is not true. 42

Recap and remarks Orthogonally diagonalizable tensors have very nice identifiability, computational, and robustness properties. Many analogues to matrix SVD, but also many important differences arising from non-linearity. Greedy algorithm for finding the decomposition can be rigorously analyzed and shown to be effective and efficient. Many other approaches to moment-based estimation (e.g., subspace ID / OOMs, local optimization). 44

Other stuff I didn t talk about 1. Overcomplete tensor decomposition: K > n components in R n. K T = λ t v t v t v t. ICA/blind source separation [Cardoso, 1991; Goyal et al, 2014] Mixture models [Bhaskara et al, 2014; Anderson et al, 2014] Dictionary learning [Barak et al, 2014]... 2. General Tucker decompositions (CPD is a special case). Exploit other structure (e.g., sparsity) Talk to Anima about this! 45

Questions? 46

Tensor product of vector spaces What is the tensor product V W of vector spaces V and W? Define objects E v, w for v V and w W. Declare equivalences E v1 + v 2, w E v1, w + E v2, w E v, w1 + w 2 E v, w1 + E v, w2 c E v, w E c v, w E v,c w for c R. Pick any bases B V for V, and B W for W. V W := span of {E v, w : v B V, w B W }, modulo equivalences (eliminating dependence on choice of bases). Can check that V W is a vector space. v w (tensor product of v V and w W ) is the equivalence class of E v, w in V W. 48

Tensor algebra perspective From tensor algebra: Since { v t : t [n] } is a basis for R n, { } v i v j v k : i, j, k [n] is a basis for R n R n R n ( denotes the tensor product of vector spaces) Every tensor T R n R n R n has a unique representation in this basis: T = c i,j,k v i v j v k i,j,k N.B.: dim(r n R n R n ) = n 3. 49

Aside: general bases for R n R n R n Pick any bases ( { α i }, { β i }, { γ i } ) for R n (not necessary orthonormal). Basis for R n R n R n : { αi β j γ k : 1 i, j, k n }. Every tensor T R n R n R n has a unique representation in this basis: T = c i,j,k α i β j γ k. i,j,k A tensor T such that c i,j,k 0 i = j = k is called diagonal: T = n c i,i,i α i β i γ i. i=1 Claim: A tensor T can be diagonal w.r.t. at most one basis. 50

Aside: canonical polyadic decomposition Rank-K canonical polyadic decomposition (CPD) of T (also called PARAFAC, CANDECOMP, or CP): T = K σ i u i v i w i. i=1 Number of parameters: K (3n + 1) (compared to n 3 in general). Fact: If T is diagonal w.r.t. bases then it has a rank-k CPD with K n. Diagonal w.r.t. bases non-overcomplete CPD. N.B.: Overcomplete (K > n) CPD is also interesting and a possibility as long as K (3n + 1) n 3. 51

Initialization of tensor power iteration Let t max := arg max t λ t, and draw x (0) S n 1 unif. at random. Most coefficients of x (0) are around 1/ n; largest is around log(n)/n. Almost surely, a gap exists: max t t max λ t v t x (0) λ tmax v tmax x (0) < 1. With probability 1/n 1.2, the gap is non-negligible: max t t max λ t v t x (0) λ tmax v tmax x (0) < 0.9. Try O(n 1.3 ) initializers; chances are at least one is good. (Very conservative estimate only; can be much better than this.) 53