Guaranteed Learning of Latent Variable Models through Tensor Methods

Size: px
Start display at page:

Download "Guaranteed Learning of Latent Variable Models through Tensor Methods"

Transcription

1 Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland ACM SIGMETRICS Tutorial /75

2 Tutorial Topic Learning algorithms for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75

3 Tutorial Topic Learning algorithms (parameter estimation) for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75

4 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. 3/75

5 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures). 3/75

6 Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization. 4/75

7 Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of social actors. 5/75

8 Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products. 6/75

9 Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures. 7/75

10 Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups 8/75

11 Application 7: Human Disease Hierarchy Discovery CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases. Scalable Latent TreeModel and its Application to Health Analytics by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun, A. Anandkumar, NIPS 2015 MLHC workshop. 9/75

12 How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model and some advanced models. 10/75

13 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference 11/75

14 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? 11/75

15 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent Variable model MCMC Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time 11/75

16 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model L el i ti Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points 11/75

17 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model el Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? 11/75

18 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? Guaranteed and efficient learning through spectral methods 11/75

19 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 12/75

20 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 12/75

21 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 12/75

22 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 12/75

23 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions 12/75

24 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions Decomposition for tensors with linearly independent components Decomposition for tensors with orthogonal components 12/75

25 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions Decomposition for tensors with linearly independent components Decomposition for tensors with orthogonal components 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 12/75

26 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 13/75

27 Gaussian Mixture Model Generative Model Samples are comprised of K different Gaussians according to Cat(π 1,π 2,...,π K ) Each sample is from one of the K Gaussians, N(µ h,σ h ), h [K] H Cat(π 1,π 2,...,π K ) X H=h N(µ h,σ h ), h [K] 14/75

28 Gaussian Mixture Model Generative Model Samples are comprised of K different Gaussians according to Cat(π 1,π 2,...,π K ) Each sample is from one of the K Gaussians, N(µ h,σ h ), h [K] H Cat(π 1,π 2,...,π K ) X H=h N(µ h,σ h ), h [K] Learning Problem Estimate mean vector µ h, covariance matrix Σ h, and mixing weight Cat(π 1,π 2,...,π K ) of each subpopulation from unlabeled data. 14/75

29 Maximum Likelihood Estimator (MLE) Data {x i } n i=1 Likelihood Pr θ (data) iid = n i=1 Pr θ(x i ) Model parameter estimation θ mle := argmax logpr θ (data) θ Θ Latent variable models: some variables are hidden No direct estimators when some variables are hidden Local optimization via Expectation-Maximization (EM) (Dempster, Laird, & Rubin, 1977) 15/75

30 MLE for Gaussian Mixture Models Given data {x i } n i=1 and the number of Gaussian components K, the model parameters to be estimated are θ = {(µ h,σ h,π h )} K h=1. θ mle for Gaussian Mixture Models θ mle := argmax θ ( n K ( π h log exp 1 ) ) det(σ h ) 1/2 2 (x i µ h ) Σ 1 h (x i µ h ) i=1 h=1 Solving MLE estimator is NP-hard (Dasgupta, 2008; Aloise, Deshpande, Hansen, & Popat, 2009; Mahajan, Nimbhorkar, & Varadarajan, 2009; Vattani, 2009; Awasthi, Charikar, Krishnaswamy, & Sinop, 2015). 16/75

31 Definition Consistent Estimator Suppose iid samples {x i } n i=1 are generated by distribution Pr θ(x i ) where the model parameters θ Θ are unknown. An estimator θ is consistent if E θ θ 0 as n Spherical Gaussian Mixtures Σ h = I (as n ) For K = 2 and π h = 1/2: EM is consistent (Xu, H., & Maleki, 2016; Daskalakis, Tzamos, & Zampetakis, 2016). Larger K: easily trapped in local maxima, far from global max (Jin, Zhang, Balakrishnan, Wainwright, & Jordan, 2016). Practitioners often use EM with many (random) restarts, but may take a long time to get near the global max. 17/75

32 Hardness of Parameter Estimation Exponentially difficult computationally or statistically to learn model parameters, even under the parametric setting. Cryptographic hardness Information-theoretic hardness E.g., Mossel & Roch, 2006 E.g., Moitra & Valiant, 2010 May require 2 Ω(K) running time or 2 Ω(K) sample size. 18/75

33 Ways Around the Hardness Separation conditions. µ E.g., assume min i µ j 2 is sufficiently large. i j σi 2+σ2 j (Dasgupta, 1999; Arora & Kannan, 2001; Vempala & Wang, 2002;... ) Structural assumptions. E.g., assume sparsity, separable (anchor words). (Spielman, Wang & Wright, 2012; Arora, Ge & Moitra, 2012;... ) Non-degeneracy conditions. E.g., assume µ1,..., µ K span a K-dimensional space. This tutorial: statistically and computationally efficient learning algorithms for non-degenerate instances via method-of-moments. 19/75

34 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 20/75

35 Method-of-Moments At A Glance 1 Determine function of model parameters θ estimatable from observable data: Moments E θ [f(x)] 2 Form estimates of moments using data (iid samples {x i } n i=1 ): Empirical Moments Ê[f(X)] 3 Solve the approximate equations for parameters θ: Moment matching E θ [f(x)] n = Ê[f(X)] Toy Example How to estimate Gaussian variable, i.e., (µ,σ), given iid samples {x i } n i=1 N(µ,Σ2 )? 21/75

36 What is a tensor? Multi-dimensional Array Tensor - Higher order matrix The number of dimensions is called tensor order. 22/75

37 Tensor Product [a b] i1,i 2 a i1 = b i2 [a b c] i1,i 2,i 3 = c i3 a i1 b i2 [a b] i1,i 2 = a i1 b i2 Rank-1 matrix [a b c] i1,i 2,i 3 = a i1 b i2 c i3 Rank-1 tensor 23/75

38 Slices Horizontal slices Lateral slices Frontal slices 24/75

39 Fiber Mode-1 (column) fibers Mode-2 (row) fibers Mode-3 (tube) fibers 25/75

40 CP decomposition X = R a h b h c h h=1 Rank: Minimum number of rank-1 tensors whose sum generates the tensor. 26/75

41 Multi-linear Transform Multi-linear Operation If T = R a h b h c h, a multi-linear operation using matrices h=1 (X,Y,Z) is as follows K T (X,Y,Z) := (X a h ) (Y b h ) (Z c h ). h=1 Similarly for a multi-linear operation using vectors (x, y, z) K T (x,y,z) := (x a h ) (y b h ) (z c h ). h=1 27/75

42 Tensors in Method of Moments Matrix: Pair-wise relationship Signal or data observed x R d Rank 1 matrix: [x x] i,j = x i x j Aggregated pair-wise relationship [x x] i,j x i = x j M 2 = E[x x] Tensor: Triple-wise relationship or higher Signal or data observed x R d Rank 1 tensor: [x x x] i,j,k = x i x j x k Aggregated triple-wise relationship M 3 = E[x x x] = E[x 3 ] [x x x] i,j,k x k = x i x j 28/75

43 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap e = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 e 1 u 1 = [ u 2 = [ 2 2, 2 2 ] 2 2, 2 2 ] 29/75

44 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] 29/75

45 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed = + 29/75

46 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + 29/75

47 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + 29/75

48 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 30/75

49 Topic Modeling General Topic Model (e.g., Latent Dirichlet Allocation) K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) K 1 Document iid mixture of topics Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75

50 Topic Modeling Topic Model for Single-topic Documents K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) {e 1,...,e K } Document iid a h Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75

51 Model Parameters of Topic Model for Single-topic Documents Estimate Topic Proportion Sports Science Business Polics Topic proportion w = [w 1,...,w K ] w h = P[topic of word = h] Estimate Topic Word Matrix Sports play game season Science Business Polics Topic-word matrix A = [a 1,...,a K ] A jh = P[word = e j topic = h] Goal: to estimate model parameters {(a h,w h )} K h=1, given iid samples of n documents (word count {c (i) } n i=1 ) Frequency vector x (i) = c(i) L, the length of document is L = j c(i) j 32/75

52 Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1p[topic = h]e[x topic = h] E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75

53 Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75

54 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 1 : Distribution of words ( M 1 : Occurrence frequency of words) witness police campus M 1 = E[x] = h w h a h ; M1 = 1 n x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75

55 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 1 : Distribution of words ( M 1 : Occurrence frequency of words) witness police campus M 1 = E[x] = h w h a h ; M1 = 1 n x (i) n i=1 campus police witness campus police witness = + + Educaon Sports No unique decomposition of vectors crime 33/75

56 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus M 2 = E[x x] = h w h a h a h ; M2 = 1 n x (i) x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75

57 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Matrix decomposition recovers subspace, not actual model Sports crime 33/75

58 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Find a W W W W such that M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Many such W s, find one such that v h = W a h orthogonal Sports crime 33/75

59 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) witness police campus campus police witness campus police witness M 3 = E[x 3 ] = h w h a h 3 ; M3 = 1 n x (i) 3 n i=1 = + + Educaon Orthogonalize the tensor, project data with W: M 3 (W,W,W) Sports crime 33/75

60 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W Unique orthogonal tensor decomposition { v h } K h=1 33/75

61 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n h i=1 W = W + + W Model parameter estimation: â h = (W ) v h 33/75

62 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W L 3: Learning Topic Models through Matrix/Tensor Decomposition 33/75

63 Take Away Message Consider topic models satisfying linear independent word distributions under different topics. Parameters of topic model for single-topic documents can be efficiently recovered from distribution of three-word documents. Distribution of three-word documents (word triples) M 3 = E[x x x] = h w h a h a h a h M3 : Co-occurrence of word triples Two-word documents are not sufficient for identifiability. 34/75

64 Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity Tensor Variational Running Time (s) /75

65 Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity Tensor Variational Running Time (s) Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group FB YP DBLPsub DBLP Running Times (s)106 FB YP DBLPsub DBLP 35/75

66 Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity Tensor Variational Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group Orders of Magnitude Faster & More Accurate Running Time (s) Running Times (s) FB YP DBLPsub DBLP 10 2 FB YP DBLPsub DBLP Online Tensor Methods for Learning Latent Variable Models, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. Tensor Methods on Apache Spark, by F. Huang, A. Anandkumar, Oct /75

67 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 36/75

68 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). = + 37/75

69 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + 37/75

70 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = Intuitions for Jennrich s Algorithm + 37/75

71 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + Intuitions for Jennrich s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors 37/75

72 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + Intuitions for Jennrich s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors The shared eigenvectors are tensor components {µ h } K h=1 37/75

73 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) 38/75

74 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) Consistency of Jennrich s Algorithm? Estimators { µ h } K h=1 unknown components {µ h} K h=1 (up to scaling)? 38/75

75 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. 39/75

76 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 39/75

77 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 39/75

78 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 3 Diagonal entries of D c D 1 c are distinct (a.s.); 39/75

79 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 3 Diagonal entries of D c D 1 c are distinct (a.s.); So {µ i } K i=1 are the eigenvectors of T (I,I,c)T (I,I,c) with distinct non-zero eigenvalues. Jennrich s algorithm is consistent 39/75

80 Error-tolerant algorithms for tensor decompositions 40/75

81 Moment Estimator: Empirical Moments 41/75

82 Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only 41/75

83 Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only Example Third Order Moment: distribution of word triples E[x x x] = h w ha h a h a h Empirical Third Order Moment: co-occurrence frequency of word triples n Ê[x x x] = 1 n x i x i x i i=1 41/75

84 Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only Example Third Order Moment: distribution of word triples E[x x x] = h w ha h a h a h Empirical Third Order Moment: co-occurrence frequency of word triples n Ê[x x x] = 1 n x i x i x i i=1 Inevitably expect error of order n 1 2 in some norm, e.g., Operator norm: E[x x x] Ê[x x x] n 1 2 where T := sup T (x,y,z) x,y,z S d 1 Frobenius norm: E[x x x] Ê[x x x] F n 1 2 where T F := i,j,k T 2 i,j,k 41/75

85 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) 42/75

86 Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) Challenge: Only have access to T such that T T n /75

87 Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Challenge: Only have access to T such that T T n /75

88 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps 42/75

89 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. 42/75

90 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. Ultimately, T T F 1 polyd is required. 42/75

91 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. Ultimately, T T F 1 polyd is required. A different approach? 42/75

92 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? 43/75

93 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. 43/75

94 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. 43/75

95 Initial Ideas In many applications, we estimate moments of the form M 3 = K w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. 43/75

96 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 43/75

97 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 43/75

98 Whitening is the process of finding a whitening matrix W such that multi-linear operation (using W) on M 3 orthogonalize its components: M 3 (W,W,W) = h = h w h (W a h ) 3 w h v h 3, v h v h, h h 44/75

99 Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, 45/75

100 Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. 45/75

101 Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. When {a h } K h=1 Rd K has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2 W 45/75

102 Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T /75

103 Using Whitening to Obtain Orthogonal Tensor Tensor M3 + + Tensor T + + Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. 46/75

104 Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] 46/75

105 Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] Dimensionality reduction when K d, as M 3 R d d d and T R K K K. 46/75

106 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 W v 3 v 1 v 2 47/75

107 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W v 3 v 1 v 2 47/75

108 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W = UDiag( λ 1/2 ), where Eigen-decomposition M 2 = UDiag( λ)u. W v 3 v 1 v 2 47/75

109 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W = UDiag( λ 1/2 ), where Eigen-decomposition M 2 = UDiag( λ)u. V := W ADiag(w) 1/2 is an orthogonal matrix. W v 3 v 1 v 2 T = M 3 (W,W,W) = h = h w 1/2 h (W a h wh ) 3 λ h v h 3, λ h := w 1/2 h. T is an orthogonal tensor. 47/75

110 Initial Ideas In many applications, we estimate moments of the form M 3 = h w h a h 3, where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 48/75

111 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. 49/75

112 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i 49/75

113 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i Intuitions for Matrix Power Method 49/75

114 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i Intuitions for Matrix Power Method Linear transform on eigenvectors {v h } K h=1 preserve direction 49/75

115 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. = + 50/75

116 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = + 50/75

117 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = Intuitions for Tensor Power Method + 50/75

118 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = Intuitions for Tensor Power Method + Bilinear transform on eigenvectors {v h } K h=1 preserve direction 50/75

119 Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for 51/75

120 Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for Consistency of Matrix Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? 51/75

121 Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for Consistency of Matrix Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? Does the convergence depend on initialization? 51/75

122 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T R K K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i T (I,u i 1,u i 1 ) T (I,u i 1,u i 1 ) 5: end for 6: v h u T, λ h T ( v h, v h, v h ) 7: Deflate T T λ h v h 3 8: end for 52/75

123 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T R K K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i T (I,u i 1,u i 1 ) T (I,u i 1,u i 1 ) 5: end for 6: v h u T, λ h T ( v h, v h, v h ) 7: Deflate T T λ h v h 3 8: end for Consistency of Tensor Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? Does the convergence depend on initialization? 52/75

124 Analysis of Consistency of Matrix Power Method Order eigenvectors {v h } K h=1 such that corresponding eigenvalues satisfy λ 1 λ 2... λ K. Project initial point u 0 onto eigenvectors {v h } K h=1 Convergence properties i c h = u 0,v h, h Unique (identifiable) i.f.f. {λ h } K h=1 are distinct. If gap λ 2 λ 1 < 1 and c 1 0, matrix power method converges to v 1. Converges linearly to v 1 assuming gap λ 2 /λ 1 < 1. Linear transform permits M(I,u0 ) = h λ ) h( v h u 0 vh = h λ hc h v h, i.e., projection in ( v h direction ) is scaled by λ h. 2 ( ) 2t. v In t iterations, 1 (v v ) λ 2 1 K 2 i v λ 1 53/75

125 Analysis of Consistency of Tensor Power Method Project initial point u 0 onto eigenvectors c h = u 0,v h, h. Order eigenvectors {v h } K h=1 such that Convergence properties λ 1 c 1 > λ 2 c 2 λ K c K. Identifiable i.f.f. {λ h c h } K h=1 are distinct. Initialization dependent. If λ 2 c 2 λ 1 c 1 < 1 and λ 1 c 1 0, tensor power method converges to v 1. Note v 1 is NOT necessarily the largest eigenvector. Converges quadraticly to v 1 assuming gap λ 2 c 2 λ 1 c 1 < 1. Bi-linear transform permits T (I,u0,u 0 ) = h λ ) h( v 2vh h u 0 = h λ hc 2 h v h i.e., projection in v h direction is squared then scaled by λ h. ) (v 1 In t iterations, 2 ( ) 2 t+1 v λ ) i (v i 2 1 k 1 v 2 c 2 2. v max i 1 λ i v 1 c 1 54/75

126 Matrix vs. tensor power iteration Matrix power iteration: Tensor power iteration: 55/75

127 Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 55/75

128 Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 55/75

129 Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. 3 Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 3 Quadratic convergence. Need O(log log(1/ǫ)) iterations. 55/75

130 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. 56/75

131 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v /75

132 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? 56/75

133 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? Optimization viewpoint of tensor Eigen decomposition will help. 56/75

134 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? Optimization viewpoint of tensor Eigen decomposition will help. All spurious eigenvectors are saddle points. 56/75

135 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition 57/75

136 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). 58/75

137 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} 58/75

138 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative L(v, λ) = 0 L(v,λ) = 2(M(I,v) λv) = 0 Eigenvectors are stationary points. Power method v M(I,v) is a version M(I,v) of gradient ascent. L(v,λ) = 3(T(I,v,v) λv) = 0 Eigenvectors are stationary points. Power method v T(I,v,v) T(I,v,v) is a version of gradient ascent. 58/75

139 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative L(v, λ) = 0 L(v,λ) = 2(M(I,v) λv) = 0 Eigenvectors are stationary points. Power method v M(I,v) is a version M(I,v) of gradient ascent. L(v,λ) = 3(T(I,v,v) λv) = 0 Eigenvectors are stationary points. Power method v T(I,v,v) T(I,v,v) is a version of gradient ascent. Local Optima: w 2 L(v,λ)w < 0 for all w v, at a stationary point v v 1 is the only local optimum. All other eigenvectors are saddle points. {v h } K h=1 are the only local optima. All spurious eigenvectors are saddle points. 58/75

140 Question: What about performance under noise? 59/75

141 Tensor Perturbation Analysis ˆT = T +E, T = h λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 60/75

142 Tensor Perturbation Analysis ˆT = T +E, T = h Theorem: Let T be number of iterations. If λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 T logk +loglog λmax ǫ, ǫ < λ min K, then output (v, λ) (after polynomial restarts) satisfies ( ) ǫ v v 1 O, λ λ 1 O(ǫ), λ 1 where v 1 is s.t. λ 1 c 1 > λ 2 c 2..., c i := v i,u 0, and u 0 is the (successful) initializer. 60/75

143 Tensor Perturbation Analysis ˆT = T +E, T = h Theorem: Let T be number of iterations. If λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 T logk +loglog λmax ǫ, ǫ < λ min K, then output (v, λ) (after polynomial restarts) satisfies ( ) ǫ v v 1 O, λ λ 1 O(ǫ), λ 1 where v 1 is s.t. λ 1 c 1 > λ 2 c 2..., c i := v i,u 0, and u 0 is the (successful) initializer. Careful analysis of deflation: avoid buildup of errors. Implies polynomial sample complexity for learning. 60/75

144 Other tensor decomposition techniques 61/75

145 Orthogonal Tensor Decomposition Simultaneous Power Method (Wang & Lu, 2017) Simultaneous recovery of eigenvectors Initialization is not optimal Orthogonalized Simultaneous Alternating Least Square (Sharan & Valiant, 2017) Random initialization Proved convergence for symmetric tensor Initialization SVD based initialization (Anandkumar & Janzamin, 2014). State-of-the-art (trace based) initialization (Li & Huang, 2018). 62/75

146 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 63/75

147 Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data 64/75

148 Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data Express Power linear composition vs nonlinear composition shallow network vs deep structure 64/75

149 Revolution of Depth layers layers 19 layers layers 8layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error(%) Kaiming He, Xiangyu Zhang, Shaoqing Ren,& Jian Sun. Deep Residual Learning for Image Recognition. CVPR /75

150 Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11 conv, 96,/4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR /75

151 Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11conv, 96,/4, pool/2 5x5conv, 256, pool/2 3x3conv, 384 3x3conv, 384 3x3conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) 3x3conv, 64 3x3conv, 64, pool/2 3x3conv, 128 3x3conv, 128, pool/2 3x3conv, 256 3x3conv, 256 3x3conv, 256 3x3conv, 256, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 fc, 4096 fc, 4096 GoogleNet, 22 layers (ILSVRC 2014) fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR /75

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point

More information

Dictionary Learning Using Tensor Methods

Dictionary Learning Using Tensor Methods Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone

More information

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature

More information

An Introduction to Spectral Learning

An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

ECE 598: Representation Learning: Algorithms and Models Fall 2017

ECE 598: Representation Learning: Algorithms and Models Fall 2017 ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors

More information

Recent Advances in Non-Convex Optimization and its Implications to Learning

Recent Advances in Non-Convex Optimization and its Implications to Learning Recent Advances in Non-Convex Optimization and its Implications to Learning Anima Anandkumar.. U.C. Irvine.. ICML 2016 Tutorial Optimization at the heart of Machine Learning Unsupervised Learning Most

More information

Orthogonal tensor decomposition

Orthogonal tensor decomposition Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.

More information

Tensor Methods for Feature Learning

Tensor Methods for Feature Learning Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard

Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard Journal of Machine Learning Research 18 018 1-11 Submitted 1/16; Revised 1/16; Published 4/18 Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard Christopher Tosh Sanjoy Dasgupta

More information

Global Analysis of Expectation Maximization for Mixtures of Two Gaussians

Global Analysis of Expectation Maximization for Mixtures of Two Gaussians Global Analysis of Expectation Maximization for Mixtures of Two Gaussians Ji Xu Columbia University jixu@cs.columbia.edu Daniel Hsu Columbia University djhsu@cs.columbia.edu Arian Maleki Columbia University

More information

Efficient algorithms for estimating multi-view mixture models

Efficient algorithms for estimating multi-view mixture models Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Estimating Latent Variable Graphical Models with Moments and Likelihoods Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Appendix A. Proof to Theorem 1

Appendix A. Proof to Theorem 1 Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component

More information

Computational Lower Bounds for Statistical Estimation Problems

Computational Lower Bounds for Statistical Estimation Problems Computational Lower Bounds for Statistical Estimation Problems Ilias Diakonikolas (USC) (joint with Daniel Kane (UCSD) and Alistair Stewart (USC)) Workshop on Local Algorithms, MIT, June 2018 THIS TALK

More information

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.

More information

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12 STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality

More information

Efficient Spectral Methods for Learning Mixture Models!

Efficient Spectral Methods for Learning Mixture Models! Efficient Spectral Methods for Learning Mixture Models Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint works with Munther Dahleh, Rong Ge, Sham Kakade, Greg Valiant.

More information

The Informativeness of k-means for Learning Mixture Models

The Informativeness of k-means for Learning Mixture Models The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics) COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,

More information

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures Learning Tractable Graphical Models: Latent Trees and Tree Mixtures Anima Anandkumar U.C. Irvine Joint work with Furong Huang, U.N. Niranjan, Daniel Hsu and Sham Kakade. High-Dimensional Graphical Modeling

More information

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis

More information

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

Sketching for Large-Scale Learning of Mixture Models

Sketching for Large-Scale Learning of Mixture Models Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical

More information

HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS

HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS MAT 217 Linear Algebra CREDIT HOURS: 4.0 EQUATED HOURS: 4.0 CLASS HOURS: 4.0 PREREQUISITE: PRE/COREQUISITE: MAT 210 Calculus I MAT 220 Calculus II RECOMMENDED

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY OUTLINE Why We Need Matrix Decomposition SVD (Singular Value Decomposition) NMF (Nonnegative

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview

More information

Faloutsos, Tong ICDE, 2009

Faloutsos, Tong ICDE, 2009 Large Graph Mining: Patterns, Tools and Case Studies Christos Faloutsos Hanghang Tong CMU Copyright: Faloutsos, Tong (29) 2-1 Outline Part 1: Patterns Part 2: Matrix and Tensor Tools Part 3: Proximity

More information

Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints

Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints Animashree Anandkumar 1, Daniel Hsu 2, Adel Javanmard 3, and Sham M. Kakade 2 1 Department of EECS, University of California,

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences

Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences Chi Jin UC Berkeley chijin@cs.berkeley.edu Yuchen Zhang UC Berkeley yuczhang@berkeley.edu Sivaraman

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

A Practical Algorithm for Topic Modeling with Provable Guarantees

A Practical Algorithm for Topic Modeling with Provable Guarantees 1 A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu Reviewed by Zhao Song December

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

ELE 538B: Mathematics of High-Dimensional Data. Spectral methods. Yuxin Chen Princeton University, Fall 2018

ELE 538B: Mathematics of High-Dimensional Data. Spectral methods. Yuxin Chen Princeton University, Fall 2018 ELE 538B: Mathematics of High-Dimensional Data Spectral methods Yuxin Chen Princeton University, Fall 2018 Outline A motivating application: graph clustering Distance and angles between two subspaces Eigen-space

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Chapter 7: Symmetric Matrices and Quadratic Forms

Chapter 7: Symmetric Matrices and Quadratic Forms Chapter 7: Symmetric Matrices and Quadratic Forms (Last Updated: December, 06) These notes are derived primarily from Linear Algebra and its applications by David Lay (4ed). A few theorems have been moved

More information

1 Singular Value Decomposition and Principal Component

1 Singular Value Decomposition and Principal Component Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational

More information

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the

More information

Overlapping Variable Clustering with Statistical Guarantees and LOVE

Overlapping Variable Clustering with Statistical Guarantees and LOVE with Statistical Guarantees and LOVE Department of Statistical Science Cornell University WHOA-PSI, St. Louis, August 2017 Joint work with Mike Bing, Yang Ning and Marten Wegkamp Cornell University, Department

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

1 Matrix notation and preliminaries from spectral graph theory

1 Matrix notation and preliminaries from spectral graph theory Graph clustering (or community detection or graph partitioning) is one of the most studied problems in network analysis. One reason for this is that there are a variety of ways to define a cluster or community.

More information

Review problems for MA 54, Fall 2004.

Review problems for MA 54, Fall 2004. Review problems for MA 54, Fall 2004. Below are the review problems for the final. They are mostly homework problems, or very similar. If you are comfortable doing these problems, you should be fine on

More information

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Estimating Covariance Using Factorial Hidden Markov Models

Estimating Covariance Using Factorial Hidden Markov Models Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu

More information

Compressed Sensing and Neural Networks

Compressed Sensing and Neural Networks and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications

More information

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING Luis Rademacher, Ohio State University, Computer Science and Engineering. Joint work with Mikhail Belkin and James Voss This talk A new approach to multi-way

More information

Lecture 8: Linear Algebra Background

Lecture 8: Linear Algebra Background CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS

METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS Y. Cem Subakan, Johannes Traa, Paris Smaragdis,,, Daniel Hsu UIUC Computer Science Department, Columbia University Computer Science Department

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

Matrix and Tensor Factorization from a Machine Learning Perspective

Matrix and Tensor Factorization from a Machine Learning Perspective Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of Hildesheim Research Seminar, Vienna University of

More information

EM for Spherical Gaussians

EM for Spherical Gaussians EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery Tim Roughgarden & Gregory Valiant May 3, 2017 Last lecture discussed singular value decomposition (SVD), and we

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information