Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group.

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures).

Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization.

Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of actors.

Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products.

Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures.

Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups

How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model. Tomorrow: advanced models.

Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning.

Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities?

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures.

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA).

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint. Tomorrow: using tensor methods for learning latent variable models and analysis of tensor decomposition method.

Resources for Today s Talk A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012. Tensor Decompositions for Learning Latent Variable Models. by A., R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky, Oct. 2012. Resources available at http://newport.eecs.uci.edu/anandkumar/mlss.html

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n]

Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n] Result: If S = Cov(X) and S = UΛU is eigen decomposition, we have P = U (k) U (k), where U (k) are top-k eigen vectors. Proof By Pythagorean theorem: i x i Px i 2 = i x i 2 i Px i 2. 1 Maximize: n i Px i 2 = 1 n i Tr[ Px i x i P ] = Tr[PSP ].

Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal.

Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal. Rayleigh Quotient For matrix S with eigenvalues λ 1 λ 2...λ d and corresponding eigenvectors u 1,...u d, then max z =1 z Sz = λ 1, min z =1 z Sz = λ d, and the optimizing vectors are u 1 and u d.

PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. How the above equation is obtained E[(x µ)(x µ) ] = E[(Ah µ)(ah µ) ]+E[zz ] = i [k]w i (a i µ)(a i µ) +σ 2 I.

PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1.

PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1. (k 1)-PCA on covariance matrix {µ} yields span(a). Lowest eigenvalue of covariance matrix yields σ 2.

Learning through Spectral Clustering Learning A through Spectral Clustering Project samples x on to span(a). Distance-based clustering (e.g. k-means). A series of works, e.g. Vempala & Wang.

Learning through Spectral Clustering Learning A through Spectral Clustering Project samples x on to span(a). Distance-based clustering (e.g. k-means). A series of works, e.g. Vempala & Wang. Failure to cluster under large variance. Learning Gaussian Mixtures Without Separation Constraints?

Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints?

Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models?

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Tensor Notation for Higher Order Moments Multi-variate higher order moments form tensors. Are there spectral operations on tensors akin to PCA? Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ].

Third order moment for Gaussian mixtures Consider mixture of k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) Intuition behind equation E[x x x] = E[(Ah) (Ah) (Ah)]+E[(Ah) z z]+... = i w i a i a i a i +σ 2 i µ e i e i +... How to recover parameters A and w from third order moment?

Simplifications E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) σ 2 is obtained from σ min (Cov(x)). E[x x] = i w ia i a i +σ 2 I.

Simplifications E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) σ 2 is obtained from σ min (Cov(x)). E[x x] = i w ia i a i +σ 2 I. Can obtain M 3 = i M 2 = i w i a i a i a i w i a i a i. How to obtain parameters A and w from M 2 and M 3?

Tensor Slices M 3 = i w i a i a i a i, M 2 = i w i a i a i. Multilinear transformation of tensor M 3 (B,C,D) := i w i (B a i ) (C a i ) (D a i )

Tensor Slices M 3 = i w i a i a i a i, M 2 = i w i a i a i. Multilinear transformation of tensor M 3 (B,C,D) := i w i (B a i ) (C a i ) (D a i ) Slice of a tensor M 3 (I,I,r) = i w ia i a i a i,r = ADiag(w)Diag(A r)a M 3 (I,I,r) = A Diag(w)Diag(A r) A M 2 = A Diag(w) A

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A.

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank.

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible.

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1. Substitution: X = (U A)Diag(A r)(u A) 1. We have v i U a i.

Learning Gaussian Mixtures through Eigen-decomposition of Tensor Slices M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. ( U M 3 (I,I,r)U )( U M 2 U ) 1 = VDiag( λ)v 1. Recovery of A (method 1) Since v i U a i, recover A upto scale: a i Uv i. Recovery of A (method 2) Let Θ R k k be a random rotation matrix. Consider k slices M 3 (I,I,θ i ) and find eigen-decomposition above. Let Λ = [ λ 1... λ k ] be the matrix of eigenvalues of all slices. Recover A as : A = UΘ 1 Λ.

Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012. Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice.

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms. When is it the most compact representation? (Identifiability). Can we recover the decomposition? (Algorithm?)

Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns?

Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1.

Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv.

Whitening M 3 = i w i a i a i a i, M 2 = i w i a i a i. Find whitening matrix W s.t. W A = V is an orthogonal matrix. When A R d k has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2

Using Whitening to Obtain Orthogonal Tensor Tensor M 3 Tensor T Multi-linear transform M 3 R d d d and T R k k k. T = M 3 (W,W,W) = i w i(w a i ) 3. T = w i v i v i v i is orthogonal. i [k] Dimensionality reduction when k d.

How to Find Whitening Matrix? M 3 = i w i a i a i a i, M 2 = i w i a i a i. a 1a2 W v 1 v 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. Eigen-decomposition of M 2 = UDiag( λ)u, then W = UDiag( λ 1/2 ). V := W ADiag(w) 1/2 is an orthogonal matrix. v 3 T = M 3 (W,W,W) = i = i w 1/2 i (W a i wi ) 3 λ i v i v i v i, λ i := w 1/2 i. T is an orthogonal tensor.

Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T.

Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T. Tensor Power Method Start from an initial vector v. v T(I,v,v) T(I,v,v). Canonical recovery method Randomly initialize the power method. Run to convergence to obtain v with eigenvalue λ. Deflate: T λv v v and repeat.

Putting it together Gaussian mixture: x = Ah+z, where E[h] = w. z N(0,σ 2 I). M 2 = i w i a i a i, M 3 = i w i a i a i a i. Obtain whitening matrix W from SVD of M 2. Use W for multilinear transform: T = M 3 (W,W,W). Find eigenvectors of T through power method and deflation. What about learning other latent variable models?

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Topic Modeling

Probabilistic Topic Models Useful abstraction for automatic categorization of documents Observed: words. Hidden: topics. Bag of words: order of words does not matter Graphical model representation l words in a document x 1,...,x l. h: proportions of topics in a document. Word x i generated from topic y i. Exchangeability: x 1 x 2... h A(i,j) := P[x m = i y m = j] : topic-word matrix. Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 A A A A A x 1 x 2 x 3 x 4 x 5 Words

Distribution of the topic proportions vector h If there are k topics, distribution over the simplex k 1 k 1 := {h R k,h i [0,1], i h i = 1}. Simplification for today: there is only one topic in each document. h {e1,...,e k }: basis vectors. Tomorrow: Latent Dirichlet allocation.

Formulation as Linear Models Distribution of the words x 1,x 2,... Order d words in vocabulary. If x 1 is j th word, assign e j R d. Distribution of each x i : supported on vertices of d 1. Properties Pr[x 1 topic h = i] = E[x 1 topic h = i] = a i Linear Model: E[x i h] = Ah. Multiview model: h is fixed and multiple words (x i ) are generated.

Geometric Picture for Topic Models Topic proportions vector (h) Document

Geometric Picture for Topic Models Single topic (h)

Geometric Picture for Topic Models Single topic (h) A A A x 2 x 1 x 3 Word generation (x 1,x 2,...)

Gaussian Mixtures vs. Topic Models (spherical) Mixture of Gaussian: (single) Topic Models k means: a 1,...a k Component h = i with prob. w i observe x, with spherical noise, x = a i +z, z N(0,σ 2 ii) k topics: a 1,...a k Topic h = i with prob. w i observe l (exchangeable) words x 1,x 2,...x l i.i.d. from a i Unified Linear Model: E[x h] = Ah Gaussian mixture: single view, spherical noise. Topic model: multi-view, heteroskedastic noise. Three words per document suffice for learning. M 3 = i w i a i a i a i, M 2 = i w i a i a i.

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Recap: Basic Tensor Decomposition Method Toy Example in MATLAB Simulated Samples: Exchangeable Model Whiten The Samples Second Order Moments Matrix Decomposition Orthogonal Tensor Eigen Decomposition Third Order Moments Power Iteration

Simulated Samples: Exchangeable Model Model Parameters Hidden State: h basis {e 1,...,e k } k = 2 Observed States: x i basis {e 1,...,e d } d = 3 Conditional Independency: x 1 x 2 x 3 h Transition Matrix: A Exchangeability: E[x i h] = Ah, i 1,2,3 Generate Samples Snippet for t = 1 : n % generate h for this sample h category=(rand()>0.5) + 1; h(t,h category)=1; transition cum=cumsum(a true(:,h category)); % generate x1 for this sample h x category=find(transition cum> rand(),1); x1(t,x category)=1; % generate x2 for this sample h x category=find(transition cum >rand(),1); x2(t,x category)=1; % generate x3 for this sample h x category=find(transition cum > rand(),1); x3(t,x category)=1; end

Whiten The Samples Second Order Moments M 2 = 1 n t xt 1 xt 2 Whitening Matrix W = U w L 0.5 w, [U w,l w ] =k-svd(m 2 ) Whitening Snippet fprintf( The second order moment M2: ); M2 = x1 *x2/n [Uw, Lw, Vw]= svd(m2); fprintf( M2 singular values: ); Lw W = Uw(:,1:k)* sqrt(pinv(lw(1:k,1:k))); y1 = x1 * W; y2 = x2 * W; y3 = x3 * W; Whiten Data a 1 a 2 v 1 v 2 y t 1 = W x t 1 Orthogonal Basis V = W A V V = I

Orthogonal Tensor Eigen Decomposition Third Order Moments T = 1 1 n t [n]y t yt 2 yt 3 λ i v i v i v i, V V = I i [k] Gradient Ascent T(I,v 1,v 1 ) = 1 1,y n 2 v t [n] v t 1,y3 y t 1 t λ i v i,v 1 2 v i = λ 1 v 1. i v i are eigenvectors of tensor T.

Orthogonal Tensor Eigen Decomposition T T j λ j v 3 j, v T(I,v,v) T(I,v,v) Power Iteration Snippet.V = zeros(k,k); Lambda = zeros(k,1);.for i = 1:k. v old = rand(k,1); v old = normc(v old);. for iter = 1 : Maxiter. v new = (y1 * ((y2*v old).*(y3*v old)))/n;. if i >1. % deflation. for j = 1: i-1. v new=v new-(v(:,j)*(v old *V(:,j))2)* Lambda(j);. end. end. lambda = norm(v new);v new = normc(v new);. if norm(v old - v new) < TOL. fprintf( Converged at iteration %d., iter);. V(:,i) = v new; Lambda(i,1) = lambda;. break;. end. v old = v new;. end.end

Conclusion h x 1 x 2 x 3 Gaussian mixtures and topic models. Unified linear representation. Learning through higher order moments. Tensor decomposition via whitening and power method. Tomorrow s lecture Latent variable models and moments. Analysis of tensor power method. Wednesday s lecture: Implementation of tensor method. Code snippet available at http://newport.eecs.uci.edu/anandkumar/mlss.html