Guaranteed Learning of Latent Variable Models through Tensor Methods

Size: px

Start display at page:

Download "Guaranteed Learning of Latent Variable Models through Tensor Methods"

Nathaniel Gibbs
5 years ago
Views:

1 Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland ACM SIGMETRICS Tutorial /75

2 Tutorial Topic Learning algorithms for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75

3 Tutorial Topic Learning algorithms (parameter estimation) for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75

4 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. 3/75

5 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures). 3/75

6 Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization. 4/75

7 Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of social actors. 5/75

8 Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products. 6/75

9 Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures. 7/75

10 Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups 8/75

11 Application 7: Human Disease Hierarchy Discovery CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases. Scalable Latent TreeModel and its Application to Health Analytics by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun, A. Anandkumar, NIPS 2015 MLHC workshop. 9/75

12 How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model and some advanced models. 10/75

13 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning

14 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent Variable model MCMC Inference Challenge:

15 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent Variable model MCMC Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time 11/75

16 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model L el i ti Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points 11/75

17 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model el Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? 11/75

18 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? Guaranteed and efficient learning through spectral methods 11/75

19 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 12/75

20 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 12/75

21 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 12/75

22 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 12/75

23 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions 12/75

24 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions Decomposition for tensors with linearly independent components Decomposition for tensors with orthogonal components 12/75

25 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions Decomposition for tensors with linearly independent components Decomposition for tensors with orthogonal components 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 12/75

26 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 13/75

27 Gaussian Mixture Model Generative Model Samples are comprised of K different Gaussians according to Cat(π 1,π 2,...,π K ) Each sample is from one of the K Gaussians, N(µ h,σ h ), h [K] H Cat(π 1,π 2,...,π K ) X H=h N(µ h,σ h ), h [K] 14/75

28 Gaussian Mixture Model Generative Model Samples are comprised of K different Gaussians according to Cat(π 1,π 2,...,π K ) Each sample is from one of the K Gaussians, N(µ h,σ h ), h [K] H Cat(π 1,π 2,...,π K ) X H=h N(µ h,σ h ), h [K] Learning Problem Estimate mean vector µ h, covariance matrix Σ h, and mixing weight Cat(π 1,π 2,...,π K ) of each subpopulation from unlabeled data. 14/75

29 Maximum Likelihood Estimator (MLE) Data {x i } n i=1 Likelihood Pr θ (data) iid = n i=1 Pr θ(x i ) Model parameter estimation θ mle := argmax logpr θ (data) θ Θ Latent variable models: some variables are hidden No direct estimators when some variables are hidden Local optimization via Expectation-Maximization (EM) (Dempster, Laird, & Rubin, 1977) 15/75

30 MLE for Gaussian Mixture Models Given data {x i } n i=1 and the number of Gaussian components K, the model parameters to be estimated are θ = {(µ h,σ h,π h )} K h=1. θ mle for Gaussian Mixture Models θ mle := argmax θ ( n K ( π h log exp 1 ) ) det(σ h ) 1/2 2 (x i µ h ) Σ 1 h (x i µ h ) i=1 h=1 Solving MLE estimator is NP-hard (Dasgupta, 2008; Aloise, Deshpande, Hansen, & Popat, 2009; Mahajan, Nimbhorkar, & Varadarajan, 2009; Vattani, 2009; Awasthi, Charikar, Krishnaswamy, & Sinop, 2015). 16/75

31 Definition Consistent Estimator Suppose iid samples {x i } n i=1 are generated by distribution Pr θ(x i ) where the model parameters θ Θ are unknown. An estimator θ is consistent if E θ θ 0 as n Spherical Gaussian Mixtures Σ h = I (as n ) For K = 2 and π h = 1/2: EM is consistent (Xu, H., & Maleki, 2016; Daskalakis, Tzamos, & Zampetakis, 2016). Larger K: easily trapped in local maxima, far from global max (Jin, Zhang, Balakrishnan, Wainwright, & Jordan, 2016). Practitioners often use EM with many (random) restarts, but may take a long time to get near the global max. 17/75

32 Hardness of Parameter Estimation Exponentially difficult computationally or statistically to learn model parameters, even under the parametric setting. Cryptographic hardness Information-theoretic hardness E.g., Mossel & Roch, 2006 E.g., Moitra & Valiant, 2010 May require 2 Ω(K) running time or 2 Ω(K) sample size. 18/75

33 Ways Around the Hardness Separation conditions. µ E.g., assume min i µ j 2 is sufficiently large. i j σi 2+σ2 j (Dasgupta, 1999; Arora & Kannan, 2001; Vempala & Wang, 2002;... ) Structural assumptions. E.g., assume sparsity, separable (anchor words). (Spielman, Wang & Wright, 2012; Arora, Ge & Moitra, 2012;... ) Non-degeneracy conditions. E.g., assume µ1,..., µ K span a K-dimensional space. This tutorial: statistically and computationally efficient learning algorithms for non-degenerate instances via method-of-moments. 19/75

34 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 20/75

35 Method-of-Moments At A Glance 1 Determine function of model parameters θ estimatable from observable data: Moments E θ [f(x)] 2 Form estimates of moments using data (iid samples {x i } n i=1 ): Empirical Moments Ê[f(X)] 3 Solve the approximate equations for parameters θ: Moment matching E θ [f(x)] n = Ê[f(X)] Toy Example How to estimate Gaussian variable, i.e., (µ,σ), given iid samples {x i } n i=1 N(µ,Σ2 )? 21/75

36 What is a tensor? Multi-dimensional Array Tensor - Higher order matrix The number of dimensions is called tensor order. 22/75

37 Tensor Product [a b] i1,i 2 a i1 = b i2 [a b c] i1,i 2,i 3 = c i3 a i1 b i2 [a b] i1,i 2 = a i1 b i2 Rank-1 matrix [a b c] i1,i 2,i 3 = a i1 b i2 c i3 Rank-1 tensor 23/75

38 Slices Horizontal slices Lateral slices Frontal slices 24/75

39 Fiber Mode-1 (column) fibers Mode-2 (row) fibers Mode-3 (tube) fibers 25/75

40 CP decomposition X = R a h b h c h h=1 Rank: Minimum number of rank-1 tensors whose sum generates the tensor. 26/75

41 Multi-linear Transform Multi-linear Operation If T = R a h b h c h, a multi-linear operation using matrices h=1 (X,Y,Z) is as follows K T (X,Y,Z) := (X a h ) (Y b h ) (Z c h ). h=1 Similarly for a multi-linear operation using vectors (x, y, z) K T (x,y,z) := (x a h ) (y b h ) (z c h ). h=1 27/75

42 Tensors in Method of Moments Matrix: Pair-wise relationship Signal or data observed x R d Rank 1 matrix: [x x] i,j = x i x j Aggregated pair-wise relationship [x x] i,j x i = x j M 2 = E[x x] Tensor: Triple-wise relationship or higher Signal or data observed x R d Rank 1 tensor: [x x x] i,j,k = x i x j x k Aggregated triple-wise relationship M 3 = E[x x x] = E[x 3 ] [x x x] i,j,k x k = x i x j 28/75

43 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap e = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 e 1 u 1 = [ u 2 = [ 2 2, 2 2 ] 2 2, 2 2 ] 29/75

44 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] 29/75

45 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed = + 29/75

46 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + 29/75

47 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + 29/75

48 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 30/75

49 Topic Modeling General Topic Model (e.g., Latent Dirichlet Allocation) K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) K 1 Document iid mixture of topics Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75

50 Topic Modeling Topic Model for Single-topic Documents K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) {e 1,...,e K } Document iid a h Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75

51 Model Parameters of Topic Model for Single-topic Documents Estimate Topic Proportion Sports Science Business Polics Topic proportion w = [w 1,...,w K ] w h = P[topic of word = h] Estimate Topic Word Matrix Sports play game season Science Business Polics Topic-word matrix A = [a 1,...,a K ] A jh = P[word = e j topic = h] Goal: to estimate model parameters {(a h,w h )} K h=1, given iid samples of n documents (word count {c (i) } n i=1 ) Frequency vector x (i) = c(i) L, the length of document is L = j c(i) j 32/75

52 Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1p[topic = h]e[x topic = h] E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75

53 Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75

54 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 1 : Distribution of words ( M 1 : Occurrence frequency of words) witness police campus M 1 = E[x] = h w h a h ; M1 = 1 n x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75

55 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 1 : Distribution of words ( M 1 : Occurrence frequency of words) witness police campus M 1 = E[x] = h w h a h ; M1 = 1 n x (i) n i=1 campus police witness campus police witness = + + Educaon Sports No unique decomposition of vectors crime 33/75

56 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus M 2 = E[x x] = h w h a h a h ; M2 = 1 n x (i) x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75

57 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Matrix decomposition recovers subspace, not actual model Sports crime 33/75

58 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Find a W W W W such that M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Many such W s, find one such that v h = W a h orthogonal Sports crime 33/75

59 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) witness police campus campus police witness campus police witness M 3 = E[x 3 ] = h w h a h 3 ; M3 = 1 n x (i) 3 n i=1 = + + Educaon Orthogonalize the tensor, project data with W: M 3 (W,W,W) Sports crime 33/75

60 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W Unique orthogonal tensor decomposition { v h } K h=1 33/75

61 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n h i=1 W = W + + W Model parameter estimation: â h = (W ) v h 33/75

62 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W L 3: Learning Topic Models through Matrix/Tensor Decomposition 33/75

63 Take Away Message Consider topic models satisfying linear independent word distributions under different topics. Parameters of topic model for single-topic documents can be efficiently recovered from distribution of three-word documents. Distribution of three-word documents (word triples) M 3 = E[x x x] = h w h a h a h a h M3 : Co-occurrence of word triples Two-word documents are not sufficient for identifiability. 34/75

64 Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity Tensor Variational Running Time (s) /75

65 Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity Tensor Variational Running Time (s) Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group FB YP DBLPsub DBLP Running Times (s)106 FB YP DBLPsub DBLP 35/75

66 Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity Tensor Variational Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group Orders of Magnitude Faster & More Accurate Running Time (s) Running Times (s) FB YP DBLPsub DBLP 10 2 FB YP DBLPsub DBLP Online Tensor Methods for Learning Latent Variable Models, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. Tensor Methods on Apache Spark, by F. Huang, A. Anandkumar, Oct /75

67 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 36/75

68 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). = + 37/75

69 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + 37/75

70 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = Intuitions for Jennrich s Algorithm + 37/75

71 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + Intuitions for Jennrich s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors 37/75

72 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + Intuitions for Jennrich s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors The shared eigenvectors are tensor components {µ h } K h=1 37/75

73 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) 38/75

74 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) Consistency of Jennrich s Algorithm? Estimators { µ h } K h=1 unknown components {µ h} K h=1 (up to scaling)? 38/75

75 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. 39/75

76 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 39/75

77 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 39/75

78 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 3 Diagonal entries of D c D 1 c are distinct (a.s.); 39/75

79 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 3 Diagonal entries of D c D 1 c are distinct (a.s.); So {µ i } K i=1 are the eigenvectors of T (I,I,c)T (I,I,c) with distinct non-zero eigenvalues. Jennrich s algorithm is consistent 39/75

80 Error-tolerant algorithms for tensor decompositions 40/75

81 Moment Estimator: Empirical Moments 41/75

82 Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only 41/75

83 Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only Example Third Order Moment: distribution of word triples E[x x x] = h w ha h a h a h Empirical Third Order Moment: co-occurrence frequency of word triples n Ê[x x x] = 1 n x i x i x i i=1 41/75

84 Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only Example Third Order Moment: distribution of word triples E[x x x] = h w ha h a h a h Empirical Third Order Moment: co-occurrence frequency of word triples n Ê[x x x] = 1 n x i x i x i i=1 Inevitably expect error of order n 1 2 in some norm, e.g., Operator norm: E[x x x] Ê[x x x] n 1 2 where T := sup T (x,y,z) x,y,z S d 1 Frobenius norm: E[x x x] Ê[x x x] F n 1 2 where T F := i,j,k T 2 i,j,k 41/75

85 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) 42/75

86 Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) Challenge: Only have access to T such that T T n /75

87 Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Challenge: Only have access to T such that T T n /75

88 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps 42/75

89 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. 42/75

90 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. Ultimately, T T F 1 polyd is required. 42/75

91 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. Ultimately, T T F 1 polyd is required. A different approach? 42/75

92 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? 43/75

93 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. 43/75

94 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. 43/75

95 Initial Ideas In many applications, we estimate moments of the form M 3 = K w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. 43/75

96 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 43/75

97 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 43/75

98 Whitening is the process of finding a whitening matrix W such that multi-linear operation (using W) on M 3 orthogonalize its components: M 3 (W,W,W) = h = h w h (W a h ) 3 w h v h 3, v h v h, h h 44/75

99 Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, 45/75

100 Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. 45/75

101 Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. When {a h } K h=1 Rd K has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2 W 45/75

102 Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T /75

103 Using Whitening to Obtain Orthogonal Tensor Tensor M3 + + Tensor T + + Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. 46/75

104 Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] 46/75

105 Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] Dimensionality reduction when K d, as M 3 R d d d and T R K K K. 46/75

106 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 W v 3 v 1 v 2 47/75

107 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W v 3 v 1 v 2 47/75

108 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W = UDiag( λ 1/2 ), where Eigen-decomposition M 2 = UDiag( λ)u. W v 3 v 1 v 2 47/75

109 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W = UDiag( λ 1/2 ), where Eigen-decomposition M 2 = UDiag( λ)u. V := W ADiag(w) 1/2 is an orthogonal matrix. W v 3 v 1 v 2 T = M 3 (W,W,W) = h = h w 1/2 h (W a h wh ) 3 λ h v h 3, λ h := w 1/2 h. T is an orthogonal tensor. 47/75

110 Initial Ideas In many applications, we estimate moments of the form M 3 = h w h a h 3, where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 48/75

111 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. 49/75

112 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i 49/75

113 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i Intuitions for Matrix Power Method 49/75

114 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i Intuitions for Matrix Power Method Linear transform on eigenvectors {v h } K h=1 preserve direction 49/75

115 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. = + 50/75

116 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = + 50/75

117 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = Intuitions for Tensor Power Method + 50/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors.

118 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = Intuitions for Tensor Power Method + Bilinear transform on eigenvectors {v h } K h=1 preserve direction 50/75

119 Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for 51/75

120 Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for Consistency of Matrix Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? 51/75

121 Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for Consistency of Matrix Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? Does the convergence depend on initialization? 51/75

122 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T R K K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i T (I,u i 1,u i 1 ) T (I,u i 1,u i 1 ) 5: end for 6: v h u T, λ h T ( v h, v h, v h ) 7: Deflate T T λ h v h 3 8: end for 52/75

123 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T R K K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i T (I,u i 1,u i 1 ) T (I,u i 1,u i 1 ) 5: end for 6: v h u T, λ h T ( v h, v h, v h ) 7: Deflate T T λ h v h 3 8: end for Consistency of Tensor Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? Does the convergence depend on initialization? 52/75

124 Analysis of Consistency of Matrix Power Method Order eigenvectors {v h } K h=1 such that corresponding eigenvalues satisfy λ 1 λ 2... λ K. Project initial point u 0 onto eigenvectors {v h } K h=1 Convergence properties i c h = u 0,v h, h Unique (identifiable) i.f.f. {λ h } K h=1 are distinct. If gap λ 2 λ 1 < 1 and c 1 0, matrix power method converges to v 1. Converges linearly to v 1 assuming gap λ 2 /λ 1 < 1. Linear transform permits M(I,u0 ) = h λ ) h( v h u 0 vh = h λ hc h v h, i.e., projection in ( v h direction ) is scaled by λ h. 2 ( ) 2t. v In t iterations, 1 (v v ) λ 2 1 K 2 i v λ 1 53/75

125 Analysis of Consistency of Tensor Power Method Project initial point u 0 onto eigenvectors c h = u 0,v h, h. Order eigenvectors {v h } K h=1 such that Convergence properties λ 1 c 1 > λ 2 c 2 λ K c K. Identifiable i.f.f. {λ h c h } K h=1 are distinct. Initialization dependent. If λ 2 c 2 λ 1 c 1 < 1 and λ 1 c 1 0, tensor power method converges to v 1. Note v 1 is NOT necessarily the largest eigenvector. Converges quadraticly to v 1 assuming gap λ 2 c 2 λ 1 c 1 < 1. Bi-linear transform permits T (I,u0,u 0 ) = h λ ) h( v 2vh h u 0 = h λ hc 2 h v h i.e., projection in v h direction is squared then scaled by λ h. ) (v 1 In t iterations, 2 ( ) 2 t+1 v λ ) i (v i 2 1 k 1 v 2 c 2 2. v max i 1 λ i v 1 c 1 54/75

126 Matrix vs. tensor power iteration Matrix power iteration: Tensor power iteration: 55/75

127 Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 55/75

128 Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 55/75

129 Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. 3 Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 3 Quadratic convergence. Need O(log log(1/ǫ)) iterations. 55/75

130 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. 56/75

131 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v /75

132 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? 56/75

133 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? Optimization viewpoint of tensor Eigen decomposition will help. 56/75

134 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? Optimization viewpoint of tensor Eigen decomposition will help. All spurious eigenvectors are saddle points. 56/75

135 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition 57/75

136 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). 58/75

137 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} 58/75

138 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative L(v, λ) = 0 L(v,λ) = 2(M(I,v) λv) = 0 Eigenvectors are stationary points. Power method v M(I,v) is a version M(I,v) of gradient ascent. L(v,λ) = 3(T(I,v,v) λv) = 0 Eigenvectors are stationary points. Power method v T(I,v,v) T(I,v,v) is a version of gradient ascent. 58/75

139 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative L(v, λ) = 0 L(v,λ) = 2(M(I,v) λv) = 0 Eigenvectors are stationary points. Power method v M(I,v) is a version M(I,v) of gradient ascent. L(v,λ) = 3(T(I,v,v) λv) = 0 Eigenvectors are stationary points. Power method v T(I,v,v) T(I,v,v) is a version of gradient ascent. Local Optima: w 2 L(v,λ)w < 0 for all w v, at a stationary point v v 1 is the only local optimum. All other eigenvectors are saddle points. {v h } K h=1 are the only local optima. All spurious eigenvectors are saddle points. 58/75

140 Question: What about performance under noise? 59/75

141 Tensor Perturbation Analysis ˆT = T +E, T = h λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 60/75

142 Tensor Perturbation Analysis ˆT = T +E, T = h Theorem: Let T be number of iterations. If λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 T logk +loglog λmax ǫ, ǫ < λ min K, then output (v, λ) (after polynomial restarts) satisfies ( ) ǫ v v 1 O, λ λ 1 O(ǫ), λ 1 where v 1 is s.t. λ 1 c 1 > λ 2 c 2..., c i := v i,u 0, and u 0 is the (successful) initializer. 60/75

143 Tensor Perturbation Analysis ˆT = T +E, T = h Theorem: Let T be number of iterations. If λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 T logk +loglog λmax ǫ, ǫ < λ min K, then output (v, λ) (after polynomial restarts) satisfies ( ) ǫ v v 1 O, λ λ 1 O(ǫ), λ 1 where v 1 is s.t. λ 1 c 1 > λ 2 c 2..., c i := v i,u 0, and u 0 is the (successful) initializer. Careful analysis of deflation: avoid buildup of errors. Implies polynomial sample complexity for learning. 60/75

144 Other tensor decomposition techniques 61/75

145 Orthogonal Tensor Decomposition Simultaneous Power Method (Wang & Lu, 2017) Simultaneous recovery of eigenvectors Initialization is not optimal Orthogonalized Simultaneous Alternating Least Square (Sharan & Valiant, 2017) Random initialization Proved convergence for symmetric tensor Initialization SVD based initialization (Anandkumar & Janzamin, 2014). State-of-the-art (trace based) initialization (Li & Huang, 2018). 62/75

146 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 63/75

147 Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data 64/75

labeled data Express Power linear composition

148 Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data Express Power linear composition vs nonlinear composition shallow network vs deep structure 64/75

149 Revolution of Depth layers layers 19 layers layers 8layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error(%) Kaiming He, Xiangyu Zhang, Shaoqing Ren,& Jian Sun. Deep Residual Learning for Image Recognition. CVPR /75

150 Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11 conv, 96,/4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR /75

Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11conv, 96,/4, pool/2 5x5conv, 256, pool/2 3x3conv, 384 3x3conv, 384 3x3conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC

151 Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11conv, 96,/4, pool/2 5x5conv, 256, pool/2 3x3conv, 384 3x3conv, 384 3x3conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) 3x3conv, 64 3x3conv, 64, pool/2 3x3conv, 128 3x3conv, 128, pool/2 3x3conv, 256 3x3conv, 256 3x3conv, 256 3x3conv, 256, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 fc, 4096 fc, 4096 GoogleNet, 22 layers (ILSVRC 2014) fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR /75

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point