Guaranteed Learning of Latent Variable Models through Tensor Methods

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018 1/75

Tutorial Topic Learning algorithms for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75

Tutorial Topic Learning algorithms (parameter estimation) for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. 3/75

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures). 3/75

Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization. 4/75

Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of social actors. 5/75

Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products. 6/75

Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures. 7/75

Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups 8/75

Application 7: Human Disease Hierarchy Discovery CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases. Scalable Latent TreeModel and its Application to Health Analytics by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun, A. Anandkumar, NIPS 2015 MLHC workshop. 9/75

How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model and some advanced models. 10/75

Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent Variable model MCMC Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model L el i ti Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model el Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? Guaranteed and efficient learning through spectral methods 11/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 12/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 12/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions Decomposition for tensors with linearly independent components Decomposition for tensors with orthogonal components 12/75

Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 13/75

Gaussian Mixture Model Generative Model Samples are comprised of K different Gaussians according to Cat(π 1,π 2,...,π K ) Each sample is from one of the K Gaussians, N(µ h,σ h ), h [K] H Cat(π 1,π 2,...,π K ) X H=h N(µ h,σ h ), h [K] Learning Problem Estimate mean vector µ h, covariance matrix Σ h, and mixing weight Cat(π 1,π 2,...,π K ) of each subpopulation from unlabeled data. 14/75

Maximum Likelihood Estimator (MLE) Data {x i } n i=1 Likelihood Pr θ (data) iid = n i=1 Pr θ(x i ) Model parameter estimation θ mle := argmax logpr θ (data) θ Θ Latent variable models: some variables are hidden No direct estimators when some variables are hidden Local optimization via Expectation-Maximization (EM) (Dempster, Laird, & Rubin, 1977) 15/75

MLE for Gaussian Mixture Models Given data {x i } n i=1 and the number of Gaussian components K, the model parameters to be estimated are θ = {(µ h,σ h,π h )} K h=1. θ mle for Gaussian Mixture Models θ mle := argmax θ ( n K ( π h log exp 1 ) ) det(σ h ) 1/2 2 (x i µ h ) Σ 1 h (x i µ h ) i=1 h=1 Solving MLE estimator is NP-hard (Dasgupta, 2008; Aloise, Deshpande, Hansen, & Popat, 2009; Mahajan, Nimbhorkar, & Varadarajan, 2009; Vattani, 2009; Awasthi, Charikar, Krishnaswamy, & Sinop, 2015). 16/75

Definition Consistent Estimator Suppose iid samples {x i } n i=1 are generated by distribution Pr θ(x i ) where the model parameters θ Θ are unknown. An estimator θ is consistent if E θ θ 0 as n Spherical Gaussian Mixtures Σ h = I (as n ) For K = 2 and π h = 1/2: EM is consistent (Xu, H., & Maleki, 2016; Daskalakis, Tzamos, & Zampetakis, 2016). Larger K: easily trapped in local maxima, far from global max (Jin, Zhang, Balakrishnan, Wainwright, & Jordan, 2016). Practitioners often use EM with many (random) restarts, but may take a long time to get near the global max. 17/75

Hardness of Parameter Estimation Exponentially difficult computationally or statistically to learn model parameters, even under the parametric setting. Cryptographic hardness Information-theoretic hardness E.g., Mossel & Roch, 2006 E.g., Moitra & Valiant, 2010 May require 2 Ω(K) running time or 2 Ω(K) sample size. 18/75

Ways Around the Hardness Separation conditions. µ E.g., assume min i µ j 2 is sufficiently large. i j σi 2+σ2 j (Dasgupta, 1999; Arora & Kannan, 2001; Vempala & Wang, 2002;... ) Structural assumptions. E.g., assume sparsity, separable (anchor words). (Spielman, Wang & Wright, 2012; Arora, Ge & Moitra, 2012;... ) Non-degeneracy conditions. E.g., assume µ1,..., µ K span a K-dimensional space. This tutorial: statistically and computationally efficient learning algorithms for non-degenerate instances via method-of-moments. 19/75

Method-of-Moments At A Glance 1 Determine function of model parameters θ estimatable from observable data: Moments E θ [f(x)] 2 Form estimates of moments using data (iid samples {x i } n i=1 ): Empirical Moments Ê[f(X)] 3 Solve the approximate equations for parameters θ: Moment matching E θ [f(x)] n = Ê[f(X)] Toy Example How to estimate Gaussian variable, i.e., (µ,σ), given iid samples {x i } n i=1 N(µ,Σ2 )? 21/75

What is a tensor? Multi-dimensional Array Tensor - Higher order matrix The number of dimensions is called tensor order. 22/75

Tensor Product [a b] i1,i 2 a i1 = b i2 [a b c] i1,i 2,i 3 = c i3 a i1 b i2 [a b] i1,i 2 = a i1 b i2 Rank-1 matrix [a b c] i1,i 2,i 3 = a i1 b i2 c i3 Rank-1 tensor 23/75

Slices Horizontal slices Lateral slices Frontal slices 24/75

Fiber Mode-1 (column) fibers Mode-2 (row) fibers Mode-3 (tube) fibers 25/75

CP decomposition X = R a h b h c h h=1 Rank: Minimum number of rank-1 tensors whose sum generates the tensor. 26/75

Multi-linear Transform Multi-linear Operation If T = R a h b h c h, a multi-linear operation using matrices h=1 (X,Y,Z) is as follows K T (X,Y,Z) := (X a h ) (Y b h ) (Z c h ). h=1 Similarly for a multi-linear operation using vectors (x, y, z) K T (x,y,z) := (x a h ) (y b h ) (z c h ). h=1 27/75

Tensors in Method of Moments Matrix: Pair-wise relationship Signal or data observed x R d Rank 1 matrix: [x x] i,j = x i x j Aggregated pair-wise relationship [x x] i,j x i = x j M 2 = E[x x] Tensor: Triple-wise relationship or higher Signal or data observed x R d Rank 1 tensor: [x x x] i,j,k = x i x j x k Aggregated triple-wise relationship M 3 = E[x x x] = E[x 3 ] [x x x] i,j,k x k = x i x j 28/75

Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap e 2 1 0 = e 0 1 1 e 1 +e 2e 2 = u 1u 1 +u 2u 2 e 1 u 1 = [ u 2 = [ 2 2, 2 2 ] 2 2, 2 2 ] 29/75

Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e 0 1 1 e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed = + 29/75

Topic Modeling General Topic Model (e.g., Latent Dirichlet Allocation) K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) K 1 Document iid mixture of topics Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75

Topic Modeling Topic Model for Single-topic Documents K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) {e 1,...,e K } Document iid a h 1.0 0 0 0 Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75

Model Parameters of Topic Model for Single-topic Documents Estimate Topic Proportion Sports Science Business Polics Topic proportion w = [w 1,...,w K ] w h = P[topic of word = h] Estimate Topic Word Matrix Sports play game season Science Business Polics Topic-word matrix A = [a 1,...,a K ] A jh = P[word = e j topic = h] Goal: to estimate model parameters {(a h,w h )} K h=1, given iid samples of n documents (word count {c (i) } n i=1 ) Frequency vector x (i) = c(i) L, the length of document is L = j c(i) j 32/75

Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1p[topic = h]e[x topic = h] E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75

Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 1 : Distribution of words ( M 1 : Occurrence frequency of words) witness police campus M 1 = E[x] = h w h a h ; M1 = 1 n x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus M 2 = E[x x] = h w h a h a h ; M2 = 1 n x (i) x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Matrix decomposition recovers subspace, not actual model Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Find a W W W W such that M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Many such W s, find one such that v h = W a h orthogonal Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) witness police campus campus police witness campus police witness M 3 = E[x 3 ] = h w h a h 3 ; M3 = 1 n x (i) 3 n i=1 = + + Educaon Orthogonalize the tensor, project data with W: M 3 (W,W,W) Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W Unique orthogonal tensor decomposition { v h } K h=1 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n h i=1 W = W + + W Model parameter estimation: â h = (W ) v h 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W L 3: Learning Topic Models through Matrix/Tensor Decomposition 33/75

Take Away Message Consider topic models satisfying linear independent word distributions under different topics. Parameters of topic model for single-topic documents can be efficiently recovered from distribution of three-word documents. Distribution of three-word documents (word triples) M 3 = E[x x x] = h w h a h a h a h M3 : Co-occurrence of word triples Two-word documents are not sufficient for identifiability. 34/75

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity 10 5 10 4 Tensor Variational Running Time (s) 10 104 8 6 4 2 10 3 0 35/75

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity 10 5 10 4 Tensor Variational Running Time (s) 10 104 8 6 4 2 10 3 0 Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group 10 1 10 0 10-1 10-2 FB YP DBLPsub DBLP 10 2 10 5 10 4 10 3 Running Times (s)106 FB YP DBLPsub DBLP 35/75

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity 10 5 10 4 10 3 Tensor Variational 10 104 Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group 10 1 10 0 10-1 Orders of Magnitude Faster & More Accurate Running Time (s) 8 6 4 2 0 10 5 10 4 10 3 Running Times (s)106 10-2 FB YP DBLPsub DBLP 10 2 FB YP DBLPsub DBLP Online Tensor Methods for Learning Latent Variable Models, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. Tensor Methods on Apache Spark, by F. Huang, A. Anandkumar, Oct. 2015. 35/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). = + 37/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + Intuitions for Jennrich s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors 37/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) 38/75

Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 39/75

Error-tolerant algorithms for tensor decompositions 40/75

Moment Estimator: Empirical Moments 41/75

Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only 41/75

Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only Example Third Order Moment: distribution of word triples E[x x x] = h w ha h a h a h Empirical Third Order Moment: co-occurrence frequency of word triples n Ê[x x x] = 1 n x i x i x i i=1 Inevitably expect error of order n 1 2 in some norm, e.g., Operator norm: E[x x x] Ê[x x x] n 1 2 where T := sup T (x,y,z) x,y,z S d 1 Frobenius norm: E[x x x] Ê[x x x] F n 1 2 where T F := i,j,k T 2 i,j,k 41/75

Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) Challenge: Only have access to T such that T T n 1 2 42/75

Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Challenge: Only have access to T such that T T n 1 2 42/75

Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps 42/75

Initial Ideas In many applications, we estimate moments of the form M 3 = K w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. 43/75

Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 43/75

Whitening is the process of finding a whitening matrix W such that multi-linear operation (using W) on M 3 orthogonalize its components: M 3 (W,W,W) = h = h w h (W a h ) 3 w h v h 3, v h v h, h h 44/75

Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, 45/75

Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. 45/75

Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. When {a h } K h=1 Rd K has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2 W 45/75

Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T + + + + 46/75

Using Whitening to Obtain Orthogonal Tensor Tensor M3 + + Tensor T + + Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. 46/75

Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T + + + + Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] 46/75

Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T + + + + Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] Dimensionality reduction when K d, as M 3 R d d d and T R K K K. 46/75

Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 W v 3 v 1 v 2 47/75

Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W v 3 v 1 v 2 47/75

Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W = UDiag( λ 1/2 ), where Eigen-decomposition M 2 = UDiag( λ)u. V := W ADiag(w) 1/2 is an orthogonal matrix. W v 3 v 1 v 2 T = M 3 (W,W,W) = h = h w 1/2 h (W a h wh ) 3 λ h v h 3, λ h := w 1/2 h. T is an orthogonal tensor. 47/75

Initial Ideas In many applications, we estimate moments of the form M 3 = h w h a h 3, where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 48/75

Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. 49/75

Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i 49/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. = + 50/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = + 50/75

Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for 51/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T R K K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i T (I,u i 1,u i 1 ) T (I,u i 1,u i 1 ) 5: end for 6: v h u T, λ h T ( v h, v h, v h ) 7: Deflate T T λ h v h 3 8: end for 52/75

Analysis of Consistency of Matrix Power Method Order eigenvectors {v h } K h=1 such that corresponding eigenvalues satisfy λ 1 λ 2... λ K. Project initial point u 0 onto eigenvectors {v h } K h=1 Convergence properties i c h = u 0,v h, h Unique (identifiable) i.f.f. {λ h } K h=1 are distinct. If gap λ 2 λ 1 < 1 and c 1 0, matrix power method converges to v 1. Converges linearly to v 1 assuming gap λ 2 /λ 1 < 1. Linear transform permits M(I,u0 ) = h λ ) h( v h u 0 vh = h λ hc h v h, i.e., projection in ( v h direction ) is scaled by λ h. 2 ( ) 2t. v In t iterations, 1 (v v ) λ 2 1 K 2 i v λ 1 53/75

Analysis of Consistency of Tensor Power Method Project initial point u 0 onto eigenvectors c h = u 0,v h, h. Order eigenvectors {v h } K h=1 such that Convergence properties λ 1 c 1 > λ 2 c 2 λ K c K. Identifiable i.f.f. {λ h c h } K h=1 are distinct. Initialization dependent. If λ 2 c 2 λ 1 c 1 < 1 and λ 1 c 1 0, tensor power method converges to v 1. Note v 1 is NOT necessarily the largest eigenvector. Converges quadraticly to v 1 assuming gap λ 2 c 2 λ 1 c 1 < 1. Bi-linear transform permits T (I,u0,u 0 ) = h λ ) h( v 2vh h u 0 = h λ hc 2 h v h i.e., projection in v h direction is squared then scaled by λ h. ) (v 1 In t iterations, 2 ( ) 2 t+1 v λ ) i (v i 2 1 k 1 v 2 c 2 2. v max i 1 λ i v 1 c 1 54/75

Matrix vs. tensor power iteration Matrix power iteration: Tensor power iteration: 55/75

Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 55/75

Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 55/75

Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. 3 Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 3 Quadratic convergence. Need O(log log(1/ǫ)) iterations. 55/75

Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. 56/75

Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? 56/75

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition 57/75

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative L(v, λ) = 0 L(v,λ) = 2(M(I,v) λv) = 0 Eigenvectors are stationary points. Power method v M(I,v) is a version M(I,v) of gradient ascent. L(v,λ) = 3(T(I,v,v) λv) = 0 Eigenvectors are stationary points. Power method v T(I,v,v) T(I,v,v) is a version of gradient ascent. 58/75

Question: What about performance under noise? 59/75

Tensor Perturbation Analysis ˆT = T +E, T = h λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 60/75

Tensor Perturbation Analysis ˆT = T +E, T = h Theorem: Let T be number of iterations. If λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 T logk +loglog λmax ǫ, ǫ < λ min K, then output (v, λ) (after polynomial restarts) satisfies ( ) ǫ v v 1 O, λ λ 1 O(ǫ), λ 1 where v 1 is s.t. λ 1 c 1 > λ 2 c 2..., c i := v i,u 0, and u 0 is the (successful) initializer. 60/75

Other tensor decomposition techniques 61/75

Orthogonal Tensor Decomposition Simultaneous Power Method (Wang & Lu, 2017) Simultaneous recovery of eigenvectors Initialization is not optimal Orthogonalized Simultaneous Alternating Least Square (Sharan & Valiant, 2017) Random initialization Proved convergence for symmetric tensor Initialization SVD based initialization (Anandkumar & Janzamin, 2014). State-of-the-art (trace based) initialization (Li & Huang, 2018). 62/75

Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data 64/75

Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data Express Power linear composition vs nonlinear composition shallow network vs deep structure 64/75

Revolution of Depth 28.2 152 layers 25.8 16.4 22 layers 19 layers 6.7 7.3 11.7 3.57 8layers 8layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error(%) Kaiming He, Xiangyu Zhang, Shaoqing Ren,& Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016. 65/75

Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11 conv, 96,/4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016. 65/75

Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11conv, 96,/4, pool/2 5x5conv, 256, pool/2 3x3conv, 384 3x3conv, 384 3x3conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) 3x3conv, 64 3x3conv, 64, pool/2 3x3conv, 128 3x3conv, 128, pool/2 3x3conv, 256 3x3conv, 256 3x3conv, 256 3x3conv, 256, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 fc, 4096 fc, 4096 GoogleNet, 22 layers (ILSVRC 2014) fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016. 65/75