Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Similar documents
Dictionary Learning Using Tensor Methods

Tensor Methods for Feature Learning

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Probabilistic Latent Semantic Analysis

Orthogonal tensor decomposition

ECE 598: Representation Learning: Algorithms and Models Fall 2017

Guaranteed Learning of Latent Variable Models through Tensor Methods

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Lecture 7 Spectral methods

PCA and admixture models

CS281 Section 4: Factor Analysis and PCA

Estimating Covariance Using Factorial Hidden Markov Models

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Non-convex Robust PCA: Provable Bounds

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

PCA, Kernel PCA, ICA

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Efficient algorithms for estimating multi-view mixture models

Mathematical foundations - linear algebra

Dimensionality Reduction

Unsupervised Learning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Principal Component Analysis

Probabilistic Time Series Classification

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Spectral Learning of Hidden Markov Models with Group Persistence

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Factor Analysis (10/2/13)

Principal Component Analysis

Machine learning for pervasive systems Classification in high-dimensional spaces

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Lecture 8: Linear Algebra Background

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

ECE 5984: Introduction to Machine Learning

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Information Retrieval

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

Collaborative Filtering: A Machine Learning Perspective

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

STA 414/2104: Machine Learning

Unsupervised Learning: Dimensionality Reduction

Dimensionality Reduction

Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints

CS 572: Information Retrieval

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Clustering using Mixture Models

14 Singular Value Decomposition

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Expectation Maximization

L11: Pattern recognition principles

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Appendix A. Proof to Theorem 1

DS-GA 1002 Lecture notes 10 November 23, Linear models

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

An Introduction to Spectral Learning

Introduction to Machine Learning

Linear Algebra Background

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Robust Principal Component Analysis

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Computational functional genomics

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Probabilistic Graphical Models

Principal components analysis COMS 4771

Machine Learning Techniques for Computer Vision

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Dimension Reduction (PCA, ICA, CCA, FLD,

Clustering VS Classification

Mathematical Formulation of Our Example


Singular Value Decomposition and Principal Component Analysis (PCA) I

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Pattern Classification

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Lecture: Face Recognition and Feature Reduction

What is Principal Component Analysis?

CS6220: DATA MINING TECHNIQUES

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Pattern Recognition. Parameter Estimation of Probability Density Functions

Principal Component Analysis

Lecture 2: Linear Algebra Review

CSC 411 Lecture 12: Principal Component Analysis

Efficient Spectral Methods for Learning Mixture Models!

Announcements (repeat) Principal Components Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Probabilistic Graphical Models

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

Machine Learning for Data Science (CS4786) Lecture 12

Transcription:

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group.

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures).

Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization.

Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of actors.

Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products.

Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures.

Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups

How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model. Tomorrow: advanced models.

Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning.

Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities?

Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? In this series: guaranteed and efficient learning through spectral methods

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures.

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA).

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments.

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors.

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures.

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint.

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint. Tomorrow: using tensor methods for learning latent variable models and analysis of tensor decomposition method.

What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint. Tomorrow: using tensor methods for learning latent variable models and analysis of tensor decomposition method. Wednesday: implementation of tensor methods.

Resources for Today s Talk A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012. Tensor Decompositions for Learning Latent Variable Models. by A., R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky, Oct. 2012. Resources available at http://newport.eecs.uci.edu/anandkumar/mlss.html

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n]

Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n] Result: If S = Cov(X) and S = UΛU is eigen decomposition, we have P = U (k) U (k), where U (k) are top-k eigen vectors.

Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n] Result: If S = Cov(X) and S = UΛU is eigen decomposition, we have P = U (k) U (k), where U (k) are top-k eigen vectors. Proof By Pythagorean theorem: i x i Px i 2 = i x i 2 i Px i 2. 1 Maximize: n i Px i 2 = 1 n i Tr[ Px i x i P ] = Tr[PSP ].

Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal.

Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal. Rayleigh Quotient For matrix S with eigenvalues λ 1 λ 2...λ d and corresponding eigenvectors u 1,...u d, then max z =1 z Sz = λ 1, min z =1 z Sz = λ d, and the optimizing vectors are u 1 and u d.

Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal. Rayleigh Quotient For matrix S with eigenvalues λ 1 λ 2...λ d and corresponding eigenvectors u 1,...u d, then max z =1 z Sz = λ 1, min z =1 z Sz = λ d, and the optimizing vectors are u 1 and u d. Optimal Projection max Tr(P SP) = λ 1 +λ 2...+λ k and P spans {u 1,...,u k }. P:P 2 =I Rank(P)=k

PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise.

PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I.

PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. How the above equation is obtained E[(x µ)(x µ) ] = E[(Ah µ)(ah µ) ]+E[zz ] = i [k]w i (a i µ)(a i µ) +σ 2 I.

PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1.

PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1. (k 1)-PCA on covariance matrix {µ} yields span(a). Lowest eigenvalue of covariance matrix yields σ 2.

PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1. (k 1)-PCA on covariance matrix {µ} yields span(a). Lowest eigenvalue of covariance matrix yields σ 2. How to Learn A?

Learning through Spectral Clustering Learning A through Spectral Clustering Project samples x on to span(a). Distance-based clustering (e.g. k-means). A series of works, e.g. Vempala & Wang.

Learning through Spectral Clustering Learning A through Spectral Clustering Project samples x on to span(a). Distance-based clustering (e.g. k-means). A series of works, e.g. Vempala & Wang. Failure to cluster under large variance. Learning Gaussian Mixtures Without Separation Constraints?

Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints?

Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful?

Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d?

Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models?

Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models? What if the data is not Gaussian?

Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models? What if the data is not Gaussian? Moment-based Estimation of probabilistic latent variable models?

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Tensor Notation for Higher Order Moments Multi-variate higher order moments form tensors. Are there spectral operations on tensors akin to PCA? Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ].

Third order moment for Gaussian mixtures Consider mixture of k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) Intuition behind equation E[x x x] = E[(Ah) (Ah) (Ah)]+E[(Ah) z z]+... = i w i a i a i a i +σ 2 i µ e i e i +... How to recover parameters A and w from third order moment?

Simplifications E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) σ 2 is obtained from σ min (Cov(x)). E[x x] = i w ia i a i +σ 2 I.

Simplifications E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) σ 2 is obtained from σ min (Cov(x)). E[x x] = i w ia i a i +σ 2 I. Can obtain M 3 = i M 2 = i w i a i a i a i w i a i a i. How to obtain parameters A and w from M 2 and M 3?

Tensor Slices M 3 = i w i a i a i a i, M 2 = i w i a i a i. Multilinear transformation of tensor M 3 (B,C,D) := i w i (B a i ) (C a i ) (D a i )

Tensor Slices M 3 = i w i a i a i a i, M 2 = i w i a i a i. Multilinear transformation of tensor M 3 (B,C,D) := i w i (B a i ) (C a i ) (D a i ) Slice of a tensor M 3 (I,I,r) = i w ia i a i a i,r = ADiag(w)Diag(A r)a M 3 (I,I,r) = A Diag(w)Diag(A r) A M 2 = A Diag(w) A

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A.

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank.

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible.

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1.

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1. Substitution: X = (U A)Diag(A r)(u A) 1. We have v i U a i.

Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1. Substitution: X = (U A)Diag(A r)(u A) 1. We have v i U a i. Technical Detail r = Uθ and θ drawn uniformly from sphere to ensure eigen gap.

Learning Gaussian Mixtures through Eigen-decomposition of Tensor Slices M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. ( U M 3 (I,I,r)U )( U M 2 U ) 1 = VDiag( λ)v 1. Recovery of A (method 1) Since v i U a i, recover A upto scale: a i Uv i.

Learning Gaussian Mixtures through Eigen-decomposition of Tensor Slices M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. ( U M 3 (I,I,r)U )( U M 2 U ) 1 = VDiag( λ)v 1. Recovery of A (method 1) Since v i U a i, recover A upto scale: a i Uv i. Recovery of A (method 2) Let Θ R k k be a random rotation matrix. Consider k slices M 3 (I,I,θ i ) and find eigen-decomposition above. Let Λ = [ λ 1... λ k ] be the matrix of eigenvalues of all slices. Recover A as : A = UΘ 1 Λ.

Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012. Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice.

Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012. Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice.

Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012. Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice. M 3 (I,I,r) is only a (random) slice of the tensor. Full information is not utilized.

Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012. Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice. M 3 (I,I,r) is only a (random) slice of the tensor. Full information is not utilized. More efficient learning methods using higher order moments?

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms.

Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms. When is it the most compact representation? (Identifiability). Can we recover the decomposition? (Algorithm?)

Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms. When is it the most compact representation? (Identifiability). Can we recover the decomposition? (Algorithm?) The most compact representation is known as CP-decomposition (CANDECOMP/PARAFAC).

Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns?

Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1.

Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv.

Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Two Problems How to find eigenvectors of a tensor? A is not orthogonal in general.

Whitening M 3 = i w i a i a i a i, M 2 = i w i a i a i. Find whitening matrix W s.t. W A = V is an orthogonal matrix. When A R d k has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2

Using Whitening to Obtain Orthogonal Tensor Tensor M 3 Tensor T Multi-linear transform M 3 R d d d and T R k k k. T = M 3 (W,W,W) = i w i(w a i ) 3. T = w i v i v i v i is orthogonal. i [k] Dimensionality reduction when k d.

How to Find Whitening Matrix? M 3 = i w i a i a i a i, M 2 = i w i a i a i. a 1a2 W v 1 v 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. Eigen-decomposition of M 2 = UDiag( λ)u, then W = UDiag( λ 1/2 ). V := W ADiag(w) 1/2 is an orthogonal matrix. v 3 T = M 3 (W,W,W) = i = i w 1/2 i (W a i wi ) 3 λ i v i v i v i, λ i := w 1/2 i. T is an orthogonal tensor.

Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T.

Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T. Tensor Power Method Start from an initial vector v. v T(I,v,v) T(I,v,v).

Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T. Tensor Power Method Start from an initial vector v. v T(I,v,v) T(I,v,v). Canonical recovery method Randomly initialize the power method. Run to convergence to obtain v with eigenvalue λ. Deflate: T λv v v and repeat.

Putting it together Gaussian mixture: x = Ah+z, where E[h] = w. z N(0,σ 2 I). M 2 = i w i a i a i, M 3 = i w i a i a i a i. Obtain whitening matrix W from SVD of M 2. Use W for multilinear transform: T = M 3 (W,W,W). Find eigenvectors of T through power method and deflation. What about learning other latent variable models?

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Topic Modeling

Probabilistic Topic Models Useful abstraction for automatic categorization of documents Observed: words. Hidden: topics. Bag of words: order of words does not matter Graphical model representation l words in a document x 1,...,x l. h: proportions of topics in a document. Word x i generated from topic y i. Exchangeability: x 1 x 2... h A(i,j) := P[x m = i y m = j] : topic-word matrix. Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 A A A A A x 1 x 2 x 3 x 4 x 5 Words

Distribution of the topic proportions vector h If there are k topics, distribution over the simplex k 1 k 1 := {h R k,h i [0,1], i h i = 1}. Simplification for today: there is only one topic in each document. h {e1,...,e k }: basis vectors. Tomorrow: Latent Dirichlet allocation.

Distribution of the topic proportions vector h If there are k topics, distribution over the simplex k 1 k 1 := {h R k,h i [0,1], i h i = 1}. Simplification for today: there is only one topic in each document. h {e1,...,e k }: basis vectors. Tomorrow: Latent Dirichlet allocation.

Formulation as Linear Models Distribution of the words x 1,x 2,... Order d words in vocabulary. If x 1 is j th word, assign e j R d. Distribution of each x i : supported on vertices of d 1. Properties Pr[x 1 topic h = i] = E[x 1 topic h = i] = a i Linear Model: E[x i h] = Ah. Multiview model: h is fixed and multiple words (x i ) are generated.

Geometric Picture for Topic Models Topic proportions vector (h) Document

Geometric Picture for Topic Models Single topic (h)

Geometric Picture for Topic Models Single topic (h) A A A x 2 x 1 x 3 Word generation (x 1,x 2,...)

Gaussian Mixtures vs. Topic Models (spherical) Mixture of Gaussian: (single) Topic Models k means: a 1,...a k Component h = i with prob. w i observe x, with spherical noise, x = a i +z, z N(0,σ 2 ii) k topics: a 1,...a k Topic h = i with prob. w i observe l (exchangeable) words x 1,x 2,...x l i.i.d. from a i Unified Linear Model: E[x h] = Ah Gaussian mixture: single view, spherical noise. Topic model: multi-view, heteroskedastic noise. Three words per document suffice for learning. M 3 = i w i a i a i a i, M 2 = i w i a i a i.

Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

Recap: Basic Tensor Decomposition Method Toy Example in MATLAB Simulated Samples: Exchangeable Model Whiten The Samples Second Order Moments Matrix Decomposition Orthogonal Tensor Eigen Decomposition Third Order Moments Power Iteration

Simulated Samples: Exchangeable Model Model Parameters Hidden State: h basis {e 1,...,e k } k = 2 Observed States: x i basis {e 1,...,e d } d = 3 Conditional Independency: x 1 x 2 x 3 h Transition Matrix: A Exchangeability: E[x i h] = Ah, i 1,2,3 h x 1 x 2 x 3

Simulated Samples: Exchangeable Model Model Parameters Hidden State: h basis {e 1,...,e k } k = 2 Observed States: x i basis {e 1,...,e d } d = 3 Conditional Independency: x 1 x 2 x 3 h Transition Matrix: A Exchangeability: E[x i h] = Ah, i 1,2,3 Generate Samples Snippet for t = 1 : n % generate h for this sample h category=(rand()>0.5) + 1; h(t,h category)=1; transition cum=cumsum(a true(:,h category)); % generate x1 for this sample h x category=find(transition cum> rand(),1); x1(t,x category)=1; % generate x2 for this sample h x category=find(transition cum >rand(),1); x2(t,x category)=1; % generate x3 for this sample h x category=find(transition cum > rand(),1); x3(t,x category)=1; end

Whiten The Samples Second Order Moments M 2 = 1 n t xt 1 xt 2 Whitening Matrix W = U w L 0.5 w, [U w,l w ] =k-svd(m 2 ) Whitening Snippet fprintf( The second order moment M2: ); M2 = x1 *x2/n [Uw, Lw, Vw]= svd(m2); fprintf( M2 singular values: ); Lw W = Uw(:,1:k)* sqrt(pinv(lw(1:k,1:k))); y1 = x1 * W; y2 = x2 * W; y3 = x3 * W; Whiten Data a 1 a 2 v 1 v 2 y t 1 = W x t 1 Orthogonal Basis V = W A V V = I

Orthogonal Tensor Eigen Decomposition Third Order Moments T = 1 1 n t [n]y t yt 2 yt 3 λ i v i v i v i, V V = I i [k] Gradient Ascent T(I,v 1,v 1 ) = 1 1,y n 2 v t [n] v t 1,y3 y t 1 t λ i v i,v 1 2 v i = λ 1 v 1. i v i are eigenvectors of tensor T.

Orthogonal Tensor Eigen Decomposition T T j λ j v 3 j, v T(I,v,v) T(I,v,v) Power Iteration Snippet.V = zeros(k,k); Lambda = zeros(k,1);.for i = 1:k. v old = rand(k,1); v old = normc(v old);. for iter = 1 : Maxiter. v new = (y1 * ((y2*v old).*(y3*v old)))/n;. if i >1. % deflation. for j = 1: i-1. v new=v new-(v(:,j)*(v old *V(:,j))2)* Lambda(j);. end. end. lambda = norm(v new);v new = normc(v new);. if norm(v old - v new) < TOL. fprintf( Converged at iteration %d., iter);. V(:,i) = v new; Lambda(i,1) = lambda;. break;. end. v old = v new;. end.end

Orthogonal Tensor Eigen Decomposition T T j λ j v 3 j, v T(I,v,v) T(I,v,v) Power Iteration Snippet.V = zeros(k,k); Lambda = zeros(k,1);.for i = 1:k. v old = rand(k,1); v old = normc(v old);. for iter = 1 : Maxiter. v new = (y1 * ((y2*v old).*(y3*v old)))/n;. if i >1. % deflation. for j = 1: i-1. v new=v new-(v(:,j)*(v old *V(:,j))2)* Lambda(j);. end. end. lambda = norm(v new);v new = normc(v new);. if norm(v old - v new) < TOL. fprintf( Converged at iteration %d., iter);. V(:,i) = v new; Lambda(i,1) = lambda;. break;. end. v old = v new;. end.end iter 0 iter iter 23 45 iter 1 Green: Groundtruth Red: Estimation at each iteration

Conclusion h x 1 x 2 x 3 Gaussian mixtures and topic models. Unified linear representation. Learning through higher order moments. Tensor decomposition via whitening and power method. Tomorrow s lecture Latent variable models and moments. Analysis of tensor power method. Wednesday s lecture: Implementation of tensor method. Code snippet available at http://newport.eecs.uci.edu/anandkumar/mlss.html