Guaranteed Learning of Latent Variable Models through Tensor Methods
|
|
- Nathaniel Gibbs
- 5 years ago
- Views:
Transcription
1 Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland ACM SIGMETRICS Tutorial /75
2 Tutorial Topic Learning algorithms for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75
3 Tutorial Topic Learning algorithms (parameter estimation) for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75
4 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. 3/75
5 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures). 3/75
6 Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization. 4/75
7 Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of social actors. 5/75
8 Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products. 6/75
9 Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures. 7/75
10 Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups 8/75
11 Application 7: Human Disease Hierarchy Discovery CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases. Scalable Latent TreeModel and its Application to Health Analytics by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun, A. Anandkumar, NIPS 2015 MLHC workshop. 9/75
12 How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model and some advanced models. 10/75
13 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference 11/75
14 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? 11/75
15 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent Variable model MCMC Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time 11/75
16 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model L el i ti Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points 11/75
17 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model el Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? 11/75
18 Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? Guaranteed and efficient learning through spectral methods 11/75
19 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 12/75
20 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 12/75
21 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 12/75
22 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 12/75
23 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions 12/75
24 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions Decomposition for tensors with linearly independent components Decomposition for tensors with orthogonal components 12/75
25 What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions Decomposition for tensors with linearly independent components Decomposition for tensors with orthogonal components 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 12/75
26 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 13/75
27 Gaussian Mixture Model Generative Model Samples are comprised of K different Gaussians according to Cat(π 1,π 2,...,π K ) Each sample is from one of the K Gaussians, N(µ h,σ h ), h [K] H Cat(π 1,π 2,...,π K ) X H=h N(µ h,σ h ), h [K] 14/75
28 Gaussian Mixture Model Generative Model Samples are comprised of K different Gaussians according to Cat(π 1,π 2,...,π K ) Each sample is from one of the K Gaussians, N(µ h,σ h ), h [K] H Cat(π 1,π 2,...,π K ) X H=h N(µ h,σ h ), h [K] Learning Problem Estimate mean vector µ h, covariance matrix Σ h, and mixing weight Cat(π 1,π 2,...,π K ) of each subpopulation from unlabeled data. 14/75
29 Maximum Likelihood Estimator (MLE) Data {x i } n i=1 Likelihood Pr θ (data) iid = n i=1 Pr θ(x i ) Model parameter estimation θ mle := argmax logpr θ (data) θ Θ Latent variable models: some variables are hidden No direct estimators when some variables are hidden Local optimization via Expectation-Maximization (EM) (Dempster, Laird, & Rubin, 1977) 15/75
30 MLE for Gaussian Mixture Models Given data {x i } n i=1 and the number of Gaussian components K, the model parameters to be estimated are θ = {(µ h,σ h,π h )} K h=1. θ mle for Gaussian Mixture Models θ mle := argmax θ ( n K ( π h log exp 1 ) ) det(σ h ) 1/2 2 (x i µ h ) Σ 1 h (x i µ h ) i=1 h=1 Solving MLE estimator is NP-hard (Dasgupta, 2008; Aloise, Deshpande, Hansen, & Popat, 2009; Mahajan, Nimbhorkar, & Varadarajan, 2009; Vattani, 2009; Awasthi, Charikar, Krishnaswamy, & Sinop, 2015). 16/75
31 Definition Consistent Estimator Suppose iid samples {x i } n i=1 are generated by distribution Pr θ(x i ) where the model parameters θ Θ are unknown. An estimator θ is consistent if E θ θ 0 as n Spherical Gaussian Mixtures Σ h = I (as n ) For K = 2 and π h = 1/2: EM is consistent (Xu, H., & Maleki, 2016; Daskalakis, Tzamos, & Zampetakis, 2016). Larger K: easily trapped in local maxima, far from global max (Jin, Zhang, Balakrishnan, Wainwright, & Jordan, 2016). Practitioners often use EM with many (random) restarts, but may take a long time to get near the global max. 17/75
32 Hardness of Parameter Estimation Exponentially difficult computationally or statistically to learn model parameters, even under the parametric setting. Cryptographic hardness Information-theoretic hardness E.g., Mossel & Roch, 2006 E.g., Moitra & Valiant, 2010 May require 2 Ω(K) running time or 2 Ω(K) sample size. 18/75
33 Ways Around the Hardness Separation conditions. µ E.g., assume min i µ j 2 is sufficiently large. i j σi 2+σ2 j (Dasgupta, 1999; Arora & Kannan, 2001; Vempala & Wang, 2002;... ) Structural assumptions. E.g., assume sparsity, separable (anchor words). (Spielman, Wang & Wright, 2012; Arora, Ge & Moitra, 2012;... ) Non-degeneracy conditions. E.g., assume µ1,..., µ K span a K-dimensional space. This tutorial: statistically and computationally efficient learning algorithms for non-degenerate instances via method-of-moments. 19/75
34 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 20/75
35 Method-of-Moments At A Glance 1 Determine function of model parameters θ estimatable from observable data: Moments E θ [f(x)] 2 Form estimates of moments using data (iid samples {x i } n i=1 ): Empirical Moments Ê[f(X)] 3 Solve the approximate equations for parameters θ: Moment matching E θ [f(x)] n = Ê[f(X)] Toy Example How to estimate Gaussian variable, i.e., (µ,σ), given iid samples {x i } n i=1 N(µ,Σ2 )? 21/75
36 What is a tensor? Multi-dimensional Array Tensor - Higher order matrix The number of dimensions is called tensor order. 22/75
37 Tensor Product [a b] i1,i 2 a i1 = b i2 [a b c] i1,i 2,i 3 = c i3 a i1 b i2 [a b] i1,i 2 = a i1 b i2 Rank-1 matrix [a b c] i1,i 2,i 3 = a i1 b i2 c i3 Rank-1 tensor 23/75
38 Slices Horizontal slices Lateral slices Frontal slices 24/75
39 Fiber Mode-1 (column) fibers Mode-2 (row) fibers Mode-3 (tube) fibers 25/75
40 CP decomposition X = R a h b h c h h=1 Rank: Minimum number of rank-1 tensors whose sum generates the tensor. 26/75
41 Multi-linear Transform Multi-linear Operation If T = R a h b h c h, a multi-linear operation using matrices h=1 (X,Y,Z) is as follows K T (X,Y,Z) := (X a h ) (Y b h ) (Z c h ). h=1 Similarly for a multi-linear operation using vectors (x, y, z) K T (x,y,z) := (x a h ) (y b h ) (z c h ). h=1 27/75
42 Tensors in Method of Moments Matrix: Pair-wise relationship Signal or data observed x R d Rank 1 matrix: [x x] i,j = x i x j Aggregated pair-wise relationship [x x] i,j x i = x j M 2 = E[x x] Tensor: Triple-wise relationship or higher Signal or data observed x R d Rank 1 tensor: [x x x] i,j,k = x i x j x k Aggregated triple-wise relationship M 3 = E[x x x] = E[x 3 ] [x x x] i,j,k x k = x i x j 28/75
43 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap e = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 e 1 u 1 = [ u 2 = [ 2 2, 2 2 ] 2 2, 2 2 ] 29/75
44 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] 29/75
45 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed = + 29/75
46 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + 29/75
47 Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + 29/75
48 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 30/75
49 Topic Modeling General Topic Model (e.g., Latent Dirichlet Allocation) K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) K 1 Document iid mixture of topics Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75
50 Topic Modeling Topic Model for Single-topic Documents K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) {e 1,...,e K } Document iid a h Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75
51 Model Parameters of Topic Model for Single-topic Documents Estimate Topic Proportion Sports Science Business Polics Topic proportion w = [w 1,...,w K ] w h = P[topic of word = h] Estimate Topic Word Matrix Sports play game season Science Business Polics Topic-word matrix A = [a 1,...,a K ] A jh = P[word = e j topic = h] Goal: to estimate model parameters {(a h,w h )} K h=1, given iid samples of n documents (word count {c (i) } n i=1 ) Frequency vector x (i) = c(i) L, the length of document is L = j c(i) j 32/75
52 Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1p[topic = h]e[x topic = h] E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75
53 Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75
54 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 1 : Distribution of words ( M 1 : Occurrence frequency of words) witness police campus M 1 = E[x] = h w h a h ; M1 = 1 n x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75
55 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 1 : Distribution of words ( M 1 : Occurrence frequency of words) witness police campus M 1 = E[x] = h w h a h ; M1 = 1 n x (i) n i=1 campus police witness campus police witness = + + Educaon Sports No unique decomposition of vectors crime 33/75
56 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus M 2 = E[x x] = h w h a h a h ; M2 = 1 n x (i) x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75
57 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Matrix decomposition recovers subspace, not actual model Sports crime 33/75
58 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Find a W W W W such that M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Many such W s, find one such that v h = W a h orthogonal Sports crime 33/75
59 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) witness police campus campus police witness campus police witness M 3 = E[x 3 ] = h w h a h 3 ; M3 = 1 n x (i) 3 n i=1 = + + Educaon Orthogonalize the tensor, project data with W: M 3 (W,W,W) Sports crime 33/75
60 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W Unique orthogonal tensor decomposition { v h } K h=1 33/75
61 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n h i=1 W = W + + W Model parameter estimation: â h = (W ) v h 33/75
62 Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W L 3: Learning Topic Models through Matrix/Tensor Decomposition 33/75
63 Take Away Message Consider topic models satisfying linear independent word distributions under different topics. Parameters of topic model for single-topic documents can be efficiently recovered from distribution of three-word documents. Distribution of three-word documents (word triples) M 3 = E[x x x] = h w h a h a h a h M3 : Co-occurrence of word triples Two-word documents are not sufficient for identifiability. 34/75
64 Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity Tensor Variational Running Time (s) /75
65 Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity Tensor Variational Running Time (s) Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group FB YP DBLPsub DBLP Running Times (s)106 FB YP DBLPsub DBLP 35/75
66 Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity Tensor Variational Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group Orders of Magnitude Faster & More Accurate Running Time (s) Running Times (s) FB YP DBLPsub DBLP 10 2 FB YP DBLPsub DBLP Online Tensor Methods for Learning Latent Variable Models, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. Tensor Methods on Apache Spark, by F. Huang, A. Anandkumar, Oct /75
67 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 36/75
68 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). = + 37/75
69 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + 37/75
70 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = Intuitions for Jennrich s Algorithm + 37/75
71 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + Intuitions for Jennrich s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors 37/75
72 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + Intuitions for Jennrich s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors The shared eigenvectors are tensor components {µ h } K h=1 37/75
73 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) 38/75
74 Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) Consistency of Jennrich s Algorithm? Estimators { µ h } K h=1 unknown components {µ h} K h=1 (up to scaling)? 38/75
75 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. 39/75
76 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 39/75
77 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 39/75
78 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 3 Diagonal entries of D c D 1 c are distinct (a.s.); 39/75
79 Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 3 Diagonal entries of D c D 1 c are distinct (a.s.); So {µ i } K i=1 are the eigenvectors of T (I,I,c)T (I,I,c) with distinct non-zero eigenvalues. Jennrich s algorithm is consistent 39/75
80 Error-tolerant algorithms for tensor decompositions 40/75
81 Moment Estimator: Empirical Moments 41/75
82 Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only 41/75
83 Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only Example Third Order Moment: distribution of word triples E[x x x] = h w ha h a h a h Empirical Third Order Moment: co-occurrence frequency of word triples n Ê[x x x] = 1 n x i x i x i i=1 41/75
84 Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only Example Third Order Moment: distribution of word triples E[x x x] = h w ha h a h a h Empirical Third Order Moment: co-occurrence frequency of word triples n Ê[x x x] = 1 n x i x i x i i=1 Inevitably expect error of order n 1 2 in some norm, e.g., Operator norm: E[x x x] Ê[x x x] n 1 2 where T := sup T (x,y,z) x,y,z S d 1 Frobenius norm: E[x x x] Ê[x x x] F n 1 2 where T F := i,j,k T 2 i,j,k 41/75
85 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) 42/75
86 Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) Challenge: Only have access to T such that T T n /75
87 Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Challenge: Only have access to T such that T T n /75
88 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps 42/75
89 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. 42/75
90 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. Ultimately, T T F 1 polyd is required. 42/75
91 Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. Ultimately, T T F 1 polyd is required. A different approach? 42/75
92 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? 43/75
93 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. 43/75
94 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. 43/75
95 Initial Ideas In many applications, we estimate moments of the form M 3 = K w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. 43/75
96 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 43/75
97 Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 43/75
98 Whitening is the process of finding a whitening matrix W such that multi-linear operation (using W) on M 3 orthogonalize its components: M 3 (W,W,W) = h = h w h (W a h ) 3 w h v h 3, v h v h, h h 44/75
99 Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, 45/75
100 Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. 45/75
101 Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. When {a h } K h=1 Rd K has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2 W 45/75
102 Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T /75
103 Using Whitening to Obtain Orthogonal Tensor Tensor M3 + + Tensor T + + Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. 46/75
104 Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] 46/75
105 Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] Dimensionality reduction when K d, as M 3 R d d d and T R K K K. 46/75
106 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 W v 3 v 1 v 2 47/75
107 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W v 3 v 1 v 2 47/75
108 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W = UDiag( λ 1/2 ), where Eigen-decomposition M 2 = UDiag( λ)u. W v 3 v 1 v 2 47/75
109 Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W = UDiag( λ 1/2 ), where Eigen-decomposition M 2 = UDiag( λ)u. V := W ADiag(w) 1/2 is an orthogonal matrix. W v 3 v 1 v 2 T = M 3 (W,W,W) = h = h w 1/2 h (W a h wh ) 3 λ h v h 3, λ h := w 1/2 h. T is an orthogonal tensor. 47/75
110 Initial Ideas In many applications, we estimate moments of the form M 3 = h w h a h 3, where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 48/75
111 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. 49/75
112 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i 49/75
113 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i Intuitions for Matrix Power Method 49/75
114 Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i Intuitions for Matrix Power Method Linear transform on eigenvectors {v h } K h=1 preserve direction 49/75
115 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. = + 50/75
116 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = + 50/75
117 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = Intuitions for Tensor Power Method + 50/75
118 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = Intuitions for Tensor Power Method + Bilinear transform on eigenvectors {v h } K h=1 preserve direction 50/75
119 Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for 51/75
120 Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for Consistency of Matrix Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? 51/75
121 Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for Consistency of Matrix Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? Does the convergence depend on initialization? 51/75
122 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T R K K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i T (I,u i 1,u i 1 ) T (I,u i 1,u i 1 ) 5: end for 6: v h u T, λ h T ( v h, v h, v h ) 7: Deflate T T λ h v h 3 8: end for 52/75
123 Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T R K K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i T (I,u i 1,u i 1 ) T (I,u i 1,u i 1 ) 5: end for 6: v h u T, λ h T ( v h, v h, v h ) 7: Deflate T T λ h v h 3 8: end for Consistency of Tensor Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? Does the convergence depend on initialization? 52/75
124 Analysis of Consistency of Matrix Power Method Order eigenvectors {v h } K h=1 such that corresponding eigenvalues satisfy λ 1 λ 2... λ K. Project initial point u 0 onto eigenvectors {v h } K h=1 Convergence properties i c h = u 0,v h, h Unique (identifiable) i.f.f. {λ h } K h=1 are distinct. If gap λ 2 λ 1 < 1 and c 1 0, matrix power method converges to v 1. Converges linearly to v 1 assuming gap λ 2 /λ 1 < 1. Linear transform permits M(I,u0 ) = h λ ) h( v h u 0 vh = h λ hc h v h, i.e., projection in ( v h direction ) is scaled by λ h. 2 ( ) 2t. v In t iterations, 1 (v v ) λ 2 1 K 2 i v λ 1 53/75
125 Analysis of Consistency of Tensor Power Method Project initial point u 0 onto eigenvectors c h = u 0,v h, h. Order eigenvectors {v h } K h=1 such that Convergence properties λ 1 c 1 > λ 2 c 2 λ K c K. Identifiable i.f.f. {λ h c h } K h=1 are distinct. Initialization dependent. If λ 2 c 2 λ 1 c 1 < 1 and λ 1 c 1 0, tensor power method converges to v 1. Note v 1 is NOT necessarily the largest eigenvector. Converges quadraticly to v 1 assuming gap λ 2 c 2 λ 1 c 1 < 1. Bi-linear transform permits T (I,u0,u 0 ) = h λ ) h( v 2vh h u 0 = h λ hc 2 h v h i.e., projection in v h direction is squared then scaled by λ h. ) (v 1 In t iterations, 2 ( ) 2 t+1 v λ ) i (v i 2 1 k 1 v 2 c 2 2. v max i 1 λ i v 1 c 1 54/75
126 Matrix vs. tensor power iteration Matrix power iteration: Tensor power iteration: 55/75
127 Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 55/75
128 Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 55/75
129 Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. 3 Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 3 Quadratic convergence. Need O(log log(1/ǫ)) iterations. 55/75
130 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. 56/75
131 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v /75
132 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? 56/75
133 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? Optimization viewpoint of tensor Eigen decomposition will help. 56/75
134 Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? Optimization viewpoint of tensor Eigen decomposition will help. All spurious eigenvectors are saddle points. 56/75
135 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition 57/75
136 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). 58/75
137 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} 58/75
138 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative L(v, λ) = 0 L(v,λ) = 2(M(I,v) λv) = 0 Eigenvectors are stationary points. Power method v M(I,v) is a version M(I,v) of gradient ascent. L(v,λ) = 3(T(I,v,v) λv) = 0 Eigenvectors are stationary points. Power method v T(I,v,v) T(I,v,v) is a version of gradient ascent. 58/75
139 Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative L(v, λ) = 0 L(v,λ) = 2(M(I,v) λv) = 0 Eigenvectors are stationary points. Power method v M(I,v) is a version M(I,v) of gradient ascent. L(v,λ) = 3(T(I,v,v) λv) = 0 Eigenvectors are stationary points. Power method v T(I,v,v) T(I,v,v) is a version of gradient ascent. Local Optima: w 2 L(v,λ)w < 0 for all w v, at a stationary point v v 1 is the only local optimum. All other eigenvectors are saddle points. {v h } K h=1 are the only local optima. All spurious eigenvectors are saddle points. 58/75
140 Question: What about performance under noise? 59/75
141 Tensor Perturbation Analysis ˆT = T +E, T = h λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 60/75
142 Tensor Perturbation Analysis ˆT = T +E, T = h Theorem: Let T be number of iterations. If λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 T logk +loglog λmax ǫ, ǫ < λ min K, then output (v, λ) (after polynomial restarts) satisfies ( ) ǫ v v 1 O, λ λ 1 O(ǫ), λ 1 where v 1 is s.t. λ 1 c 1 > λ 2 c 2..., c i := v i,u 0, and u 0 is the (successful) initializer. 60/75
143 Tensor Perturbation Analysis ˆT = T +E, T = h Theorem: Let T be number of iterations. If λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 T logk +loglog λmax ǫ, ǫ < λ min K, then output (v, λ) (after polynomial restarts) satisfies ( ) ǫ v v 1 O, λ λ 1 O(ǫ), λ 1 where v 1 is s.t. λ 1 c 1 > λ 2 c 2..., c i := v i,u 0, and u 0 is the (successful) initializer. Careful analysis of deflation: avoid buildup of errors. Implies polynomial sample complexity for learning. 60/75
144 Other tensor decomposition techniques 61/75
145 Orthogonal Tensor Decomposition Simultaneous Power Method (Wang & Lu, 2017) Simultaneous recovery of eigenvectors Initialization is not optimal Orthogonalized Simultaneous Alternating Least Square (Sharan & Valiant, 2017) Random initialization Proved convergence for symmetric tensor Initialization SVD based initialization (Anandkumar & Janzamin, 2014). State-of-the-art (trace based) initialization (Li & Huang, 2018). 62/75
146 Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 63/75
147 Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data 64/75
148 Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data Express Power linear composition vs nonlinear composition shallow network vs deep structure 64/75
149 Revolution of Depth layers layers 19 layers layers 8layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error(%) Kaiming He, Xiangyu Zhang, Shaoqing Ren,& Jian Sun. Deep Residual Learning for Image Recognition. CVPR /75
150 Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11 conv, 96,/4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR /75
151 Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11conv, 96,/4, pool/2 5x5conv, 256, pool/2 3x3conv, 384 3x3conv, 384 3x3conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) 3x3conv, 64 3x3conv, 64, pool/2 3x3conv, 128 3x3conv, 128, pool/2 3x3conv, 256 3x3conv, 256 3x3conv, 256 3x3conv, 256, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 fc, 4096 fc, 4096 GoogleNet, 22 layers (ILSVRC 2014) fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR /75
Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods
Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point
More informationDictionary Learning Using Tensor Methods
Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone
More informationTensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia
Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature
More informationAn Introduction to Spectral Learning
An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,
More informationNon-convex Robust PCA: Provable Bounds
Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing
More informationECE 598: Representation Learning: Algorithms and Models Fall 2017
ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors
More informationRecent Advances in Non-Convex Optimization and its Implications to Learning
Recent Advances in Non-Convex Optimization and its Implications to Learning Anima Anandkumar.. U.C. Irvine.. ICML 2016 Tutorial Optimization at the heart of Machine Learning Unsupervised Learning Most
More informationOrthogonal tensor decomposition
Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.
More informationTensor Methods for Feature Learning
Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationDonald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016
Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationMaximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard
Journal of Machine Learning Research 18 018 1-11 Submitted 1/16; Revised 1/16; Published 4/18 Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard Christopher Tosh Sanjoy Dasgupta
More informationGlobal Analysis of Expectation Maximization for Mixtures of Two Gaussians
Global Analysis of Expectation Maximization for Mixtures of Two Gaussians Ji Xu Columbia University jixu@cs.columbia.edu Daniel Hsu Columbia University djhsu@cs.columbia.edu Arian Maleki Columbia University
More informationEfficient algorithms for estimating multi-view mixture models
Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationEstimating Latent Variable Graphical Models with Moments and Likelihoods
Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationAppendix A. Proof to Theorem 1
Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation
More informationapproximation algorithms I
SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationFourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)
Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component
More informationComputational Lower Bounds for Statistical Estimation Problems
Computational Lower Bounds for Statistical Estimation Problems Ilias Diakonikolas (USC) (joint with Daniel Kane (UCSD) and Alistair Stewart (USC)) Workshop on Local Algorithms, MIT, June 2018 THIS TALK
More informationIdentifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints
Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.
More informationSTATS 306B: Unsupervised Learning Spring Lecture 13 May 12
STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality
More informationEfficient Spectral Methods for Learning Mixture Models!
Efficient Spectral Methods for Learning Mixture Models Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint works with Munther Dahleh, Rong Ge, Sham Kakade, Greg Valiant.
More informationThe Informativeness of k-means for Learning Mixture Models
The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,
More informationSVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)
Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular
More informationCOMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)
COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationSum-of-Squares Method, Tensor Decomposition, Dictionary Learning
Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationCS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization
CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,
More informationLearning Tractable Graphical Models: Latent Trees and Tree Mixtures
Learning Tractable Graphical Models: Latent Trees and Tree Mixtures Anima Anandkumar U.C. Irvine Joint work with Furong Huang, U.N. Niranjan, Daniel Hsu and Sham Kakade. High-Dimensional Graphical Modeling
More informationRanking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization
Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis
More informationSTATS 306B: Unsupervised Learning Spring Lecture 3 April 7th
STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian
More informationApproximating the Covariance Matrix with Low-rank Perturbations
Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu
More informationSketching for Large-Scale Learning of Mixture Models
Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical
More informationHOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS
HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS MAT 217 Linear Algebra CREDIT HOURS: 4.0 EQUATED HOURS: 4.0 CLASS HOURS: 4.0 PREREQUISITE: PRE/COREQUISITE: MAT 210 Calculus I MAT 220 Calculus II RECOMMENDED
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMatrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY
Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY OUTLINE Why We Need Matrix Decomposition SVD (Singular Value Decomposition) NMF (Nonnegative
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview
More informationFaloutsos, Tong ICDE, 2009
Large Graph Mining: Patterns, Tools and Case Studies Christos Faloutsos Hanghang Tong CMU Copyright: Faloutsos, Tong (29) 2-1 Outline Part 1: Patterns Part 2: Matrix and Tensor Tools Part 3: Proximity
More informationLearning Topic Models and Latent Bayesian Networks Under Expansion Constraints
Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints Animashree Anandkumar 1, Daniel Hsu 2, Adel Javanmard 3, and Sham M. Kakade 2 1 Department of EECS, University of California,
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationLocal Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences
Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences Chi Jin UC Berkeley chijin@cs.berkeley.edu Yuchen Zhang UC Berkeley yuczhang@berkeley.edu Sivaraman
More informationCS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works
CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The
More informationA Practical Algorithm for Topic Modeling with Provable Guarantees
1 A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu Reviewed by Zhao Song December
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationELE 538B: Mathematics of High-Dimensional Data. Spectral methods. Yuxin Chen Princeton University, Fall 2018
ELE 538B: Mathematics of High-Dimensional Data Spectral methods Yuxin Chen Princeton University, Fall 2018 Outline A motivating application: graph clustering Distance and angles between two subspaces Eigen-space
More informationProvable Alternating Minimization Methods for Non-convex Optimization
Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationChapter 7: Symmetric Matrices and Quadratic Forms
Chapter 7: Symmetric Matrices and Quadratic Forms (Last Updated: December, 06) These notes are derived primarily from Linear Algebra and its applications by David Lay (4ed). A few theorems have been moved
More information1 Singular Value Decomposition and Principal Component
Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational
More informationLecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the
More informationOverlapping Variable Clustering with Statistical Guarantees and LOVE
with Statistical Guarantees and LOVE Department of Statistical Science Cornell University WHOA-PSI, St. Louis, August 2017 Joint work with Mike Bing, Yang Ning and Marten Wegkamp Cornell University, Department
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More information1 Matrix notation and preliminaries from spectral graph theory
Graph clustering (or community detection or graph partitioning) is one of the most studied problems in network analysis. One reason for this is that there are a variety of ways to define a cluster or community.
More informationReview problems for MA 54, Fall 2004.
Review problems for MA 54, Fall 2004. Below are the review problems for the final. They are mostly homework problems, or very similar. If you are comfortable doing these problems, you should be fine on
More informationFinding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October
Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationDS-GA 1002 Lecture notes 10 November 23, Linear models
DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationEstimating Covariance Using Factorial Hidden Markov Models
Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu
More informationCompressed Sensing and Neural Networks
and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications
More informationTHE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING
THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING Luis Rademacher, Ohio State University, Computer Science and Engineering. Joint work with Mikhail Belkin and James Voss This talk A new approach to multi-way
More informationLecture 8: Linear Algebra Background
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationMETHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS
METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS Y. Cem Subakan, Johannes Traa, Paris Smaragdis,,, Daniel Hsu UIUC Computer Science Department, Columbia University Computer Science Department
More informationScaling Neighbourhood Methods
Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)
More informationMatrix and Tensor Factorization from a Machine Learning Perspective
Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of Hildesheim Research Seminar, Vienna University of
More informationEM for Spherical Gaussians
EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationPrincipal Component Analysis
Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationCS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery
CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery Tim Roughgarden & Gregory Valiant May 3, 2017 Last lecture discussed singular value decomposition (SVD), and we
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More information