Guaranteed Learning of Latent Variable Models through Tensor Methods

Similar documents
Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Dictionary Learning Using Tensor Methods

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

An Introduction to Spectral Learning

Non-convex Robust PCA: Provable Bounds

ECE 598: Representation Learning: Algorithms and Models Fall 2017

Recent Advances in Non-Convex Optimization and its Implications to Learning

Orthogonal tensor decomposition

Tensor Methods for Feature Learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard

Global Analysis of Expectation Maximization for Mixtures of Two Gaussians

Efficient algorithms for estimating multi-view mixture models

Latent Variable Models and EM algorithm

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

PCA, Kernel PCA, ICA

Probabilistic Latent Semantic Analysis

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Lecture 21: Spectral Learning for Graphical Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Unsupervised Learning

Appendix A. Proof to Theorem 1

approximation algorithms I

Latent Semantic Analysis. Hongning Wang

CS281 Section 4: Factor Analysis and PCA

Probabilistic Time Series Classification

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Computational Lower Bounds for Statistical Estimation Problems

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

Efficient Spectral Methods for Learning Mixture Models!

The Informativeness of k-means for Learning Mixture Models

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Sparse Covariance Selection using Semidefinite Programming

Neural Network Training

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Expectation Maximization

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Approximating the Covariance Matrix with Low-rank Perturbations

Sketching for Large-Scale Learning of Mixture Models

HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Latent Semantic Analysis. Hongning Wang

ECE521 week 3: 23/26 January 2017

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Gaussian Mixture Models

EECS 275 Matrix Computation

Faloutsos, Tong ICDE, 2009

Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

A Practical Algorithm for Topic Modeling with Provable Guarantees

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

ELE 538B: Mathematics of High-Dimensional Data. Spectral methods. Yuxin Chen Princeton University, Fall 2018

Provable Alternating Minimization Methods for Non-convex Optimization

PCA and admixture models

Chapter 7: Symmetric Matrices and Quadratic Forms

1 Singular Value Decomposition and Principal Component

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Overlapping Variable Clustering with Statistical Guarantees and LOVE

Principal Component Analysis

1 Matrix notation and preliminaries from spectral graph theory

Review problems for MA 54, Fall 2004.

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

STA414/2104 Statistical Methods for Machine Learning II

Data Mining Techniques

Machine learning for pervasive systems Classification in high-dimensional spaces

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

DS-GA 1002 Lecture notes 10 November 23, Linear models

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning


Estimating Covariance Using Factorial Hidden Markov Models

Compressed Sensing and Neural Networks

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

Lecture 8: Linear Algebra Background

Mathematical foundations - linear algebra

METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS

Scaling Neighbourhood Methods

Matrix and Tensor Factorization from a Machine Learning Perspective

EM for Spherical Gaussians

Probabilistic Graphical Models

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Principal Component Analysis

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

Linear Regression and Its Applications

Graphical Models for Collaborative Filtering

Transcription:

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018 1/75

Tutorial Topic Learning algorithms for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75

Tutorial Topic Learning algorithms (parameter estimation) for latent variable models based on decompositions of moment tensors. h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Method-of-moments (Pearson, 1894) 2/75

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. 3/75

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures). 3/75

Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization. 4/75

Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of social actors. 5/75

Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products. 6/75

Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures. 7/75

Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups 8/75

Application 7: Human Disease Hierarchy Discovery CMS: 1.6 million patients, 168 million diagnostic events, 11 k diseases. Scalable Latent TreeModel and its Application to Health Analytics by F. Huang, N. U.Niranjan, I. Perros, R. Chen, J. Sun, A. Anandkumar, NIPS 2015 MLHC workshop. 9/75

How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model and some advanced models. 10/75

Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Learning Algorithm Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent Variable model MCMC Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model L el i ti Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 A A A A A Topics Words life gene data DNA RNA Unlabeled data Latent variable model el Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? 11/75

Challenges in Learning find hidden structure in data h Choice Variable k1 k2 k3 k4 k5 Topics = + + A A A A A Words life gene data DNA RNA Unlabeled data Latent variable model Tensor Decomposition Inference Challenge: Conditions for Identifiability Whether can model be identified given infinite computation and data? Are there tractable algorithms under identifiability? Challenge: Efficient Learning of Latent Variable Models MCMC: random sampling, slow Exponential mixing time Likelihood: non-convex, not scalable Exponential critical points Efficient computational and sample complexities? Guaranteed and efficient learning through spectral methods 11/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 12/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 12/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 12/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 12/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions 12/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions Decomposition for tensors with linearly independent components Decomposition for tensors with orthogonal components 12/75

What this tutorial will cover Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents Identifiability Parameter recovery via decomposition of exact moments 5 Error-tolerant Algorithms for Tensor Decompositions Decomposition for tensors with linearly independent components Decomposition for tensors with orthogonal components 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 12/75

Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 13/75

Gaussian Mixture Model Generative Model Samples are comprised of K different Gaussians according to Cat(π 1,π 2,...,π K ) Each sample is from one of the K Gaussians, N(µ h,σ h ), h [K] H Cat(π 1,π 2,...,π K ) X H=h N(µ h,σ h ), h [K] 14/75

Gaussian Mixture Model Generative Model Samples are comprised of K different Gaussians according to Cat(π 1,π 2,...,π K ) Each sample is from one of the K Gaussians, N(µ h,σ h ), h [K] H Cat(π 1,π 2,...,π K ) X H=h N(µ h,σ h ), h [K] Learning Problem Estimate mean vector µ h, covariance matrix Σ h, and mixing weight Cat(π 1,π 2,...,π K ) of each subpopulation from unlabeled data. 14/75

Maximum Likelihood Estimator (MLE) Data {x i } n i=1 Likelihood Pr θ (data) iid = n i=1 Pr θ(x i ) Model parameter estimation θ mle := argmax logpr θ (data) θ Θ Latent variable models: some variables are hidden No direct estimators when some variables are hidden Local optimization via Expectation-Maximization (EM) (Dempster, Laird, & Rubin, 1977) 15/75

MLE for Gaussian Mixture Models Given data {x i } n i=1 and the number of Gaussian components K, the model parameters to be estimated are θ = {(µ h,σ h,π h )} K h=1. θ mle for Gaussian Mixture Models θ mle := argmax θ ( n K ( π h log exp 1 ) ) det(σ h ) 1/2 2 (x i µ h ) Σ 1 h (x i µ h ) i=1 h=1 Solving MLE estimator is NP-hard (Dasgupta, 2008; Aloise, Deshpande, Hansen, & Popat, 2009; Mahajan, Nimbhorkar, & Varadarajan, 2009; Vattani, 2009; Awasthi, Charikar, Krishnaswamy, & Sinop, 2015). 16/75

Definition Consistent Estimator Suppose iid samples {x i } n i=1 are generated by distribution Pr θ(x i ) where the model parameters θ Θ are unknown. An estimator θ is consistent if E θ θ 0 as n Spherical Gaussian Mixtures Σ h = I (as n ) For K = 2 and π h = 1/2: EM is consistent (Xu, H., & Maleki, 2016; Daskalakis, Tzamos, & Zampetakis, 2016). Larger K: easily trapped in local maxima, far from global max (Jin, Zhang, Balakrishnan, Wainwright, & Jordan, 2016). Practitioners often use EM with many (random) restarts, but may take a long time to get near the global max. 17/75

Hardness of Parameter Estimation Exponentially difficult computationally or statistically to learn model parameters, even under the parametric setting. Cryptographic hardness Information-theoretic hardness E.g., Mossel & Roch, 2006 E.g., Moitra & Valiant, 2010 May require 2 Ω(K) running time or 2 Ω(K) sample size. 18/75

Ways Around the Hardness Separation conditions. µ E.g., assume min i µ j 2 is sufficiently large. i j σi 2+σ2 j (Dasgupta, 1999; Arora & Kannan, 2001; Vempala & Wang, 2002;... ) Structural assumptions. E.g., assume sparsity, separable (anchor words). (Spielman, Wang & Wright, 2012; Arora, Ge & Moitra, 2012;... ) Non-degeneracy conditions. E.g., assume µ1,..., µ K span a K-dimensional space. This tutorial: statistically and computationally efficient learning algorithms for non-degenerate instances via method-of-moments. 19/75

Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 20/75

Method-of-Moments At A Glance 1 Determine function of model parameters θ estimatable from observable data: Moments E θ [f(x)] 2 Form estimates of moments using data (iid samples {x i } n i=1 ): Empirical Moments Ê[f(X)] 3 Solve the approximate equations for parameters θ: Moment matching E θ [f(x)] n = Ê[f(X)] Toy Example How to estimate Gaussian variable, i.e., (µ,σ), given iid samples {x i } n i=1 N(µ,Σ2 )? 21/75

What is a tensor? Multi-dimensional Array Tensor - Higher order matrix The number of dimensions is called tensor order. 22/75

Tensor Product [a b] i1,i 2 a i1 = b i2 [a b c] i1,i 2,i 3 = c i3 a i1 b i2 [a b] i1,i 2 = a i1 b i2 Rank-1 matrix [a b c] i1,i 2,i 3 = a i1 b i2 c i3 Rank-1 tensor 23/75

Slices Horizontal slices Lateral slices Frontal slices 24/75

Fiber Mode-1 (column) fibers Mode-2 (row) fibers Mode-3 (tube) fibers 25/75

CP decomposition X = R a h b h c h h=1 Rank: Minimum number of rank-1 tensors whose sum generates the tensor. 26/75

Multi-linear Transform Multi-linear Operation If T = R a h b h c h, a multi-linear operation using matrices h=1 (X,Y,Z) is as follows K T (X,Y,Z) := (X a h ) (Y b h ) (Z c h ). h=1 Similarly for a multi-linear operation using vectors (x, y, z) K T (x,y,z) := (x a h ) (y b h ) (z c h ). h=1 27/75

Tensors in Method of Moments Matrix: Pair-wise relationship Signal or data observed x R d Rank 1 matrix: [x x] i,j = x i x j Aggregated pair-wise relationship [x x] i,j x i = x j M 2 = E[x x] Tensor: Triple-wise relationship or higher Signal or data observed x R d Rank 1 tensor: [x x x] i,j,k = x i x j x k Aggregated triple-wise relationship M 3 = E[x x x] = E[x 3 ] [x x x] i,j,k x k = x i x j 28/75

Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap e 2 1 0 = e 0 1 1 e 1 +e 2e 2 = u 1u 1 +u 2u 2 e 1 u 1 = [ u 2 = [ 2 2, 2 2 ] 2 2, 2 2 ] 29/75

Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e 0 1 1 e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] 29/75

Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e 0 1 1 e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed = + 29/75

Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e 0 1 1 e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + 29/75

Why are tensors powerful? Matrix Orthogonal Decomposition Not [ unique ] without eigenvalue gap 1 0 = e 0 1 1 e 1 +e 2e 2 = u 1u 1 +u 2u 2 Unique with eigenvalue gap e 2 u 2 = [ 2 2, 2 e 1 u 1 = [ 2 ] 2 2, 2 2 ] Tensor Orthogonal Decomposition (Harshman, 1970) Unique: eigenvalue gap not needed Slice of tensor has eigenvalue gap = + 29/75

Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 30/75

Topic Modeling General Topic Model (e.g., Latent Dirichlet Allocation) K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) K 1 Document iid mixture of topics Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75

Topic Modeling Topic Model for Single-topic Documents K topics each associated with a distribution over vocab words {a h } K h=1 Hidden topic proportion w per document i, w (i) {e 1,...,e K } Document iid a h 1.0 0 0 0 Topic Word Matrix play game season Word Count per Document Sports play game season Science Business Polics 31/75

Model Parameters of Topic Model for Single-topic Documents Estimate Topic Proportion Sports Science Business Polics Topic proportion w = [w 1,...,w K ] w h = P[topic of word = h] Estimate Topic Word Matrix Sports play game season Science Business Polics Topic-word matrix A = [a 1,...,a K ] A jh = P[word = e j topic = h] Goal: to estimate model parameters {(a h,w h )} K h=1, given iid samples of n documents (word count {c (i) } n i=1 ) Frequency vector x (i) = c(i) L, the length of document is L = j c(i) j 32/75

Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1p[topic = h]e[x topic = h] E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75

Moment Matching Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 1 : Distribution of words ( M 1 : Occurrence frequency of words) witness police campus M 1 = E[x] = h w h a h ; M1 = 1 n x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 1 : Distribution of words ( M 1 : Occurrence frequency of words) witness police campus M 1 = E[x] = h w h a h ; M1 = 1 n x (i) n i=1 campus police witness campus police witness = + + Educaon Sports No unique decomposition of vectors crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus M 2 = E[x x] = h w h a h a h ; M2 = 1 n x (i) x (i) n i=1 campus police witness campus police witness = + + Educaon Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Sports play game season Science Business Polics Generative process: Choose h Cat(w 1,...,w K ) Generate L words a h E[x] = K h=1 P[topic = h]e[x topic = h] = K h=1 w ha h E[x topic = h] = j P[word = e j topic = h]e j = a h M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Matrix decomposition recovers subspace, not actual model Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Find a W W W W such that M 2 : Distribution of word pairs ( M 2 : Co-occurrence of word pairs) witness police campus campus police witness M 2 = E[x x] = h campus police witness w h a h a h ; M2 = 1 n x (i) x (i) n i=1 = + + Educaon Many such W s, find one such that v h = W a h orthogonal Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) witness police campus campus police witness campus police witness M 3 = E[x 3 ] = h w h a h 3 ; M3 = 1 n x (i) 3 n i=1 = + + Educaon Orthogonalize the tensor, project data with W: M 3 (W,W,W) Sports crime 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W Unique orthogonal tensor decomposition { v h } K h=1 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n h i=1 W = W + + W Model parameter estimation: â h = (W ) v h 33/75

Identifiability: how long must the documents be? Nondegenerate model (linearly independent topic-word matrix) Know a W W W W such that M 3 : Distribution of word triples ( M 3 : Co-occurrence of word triples) M 3 (W,W,W) = E[(W x) 3 ] = h W W w h (W a h ) 3 ; M3 (W,W,W) = 1 n (W x (i) ) 3 n i=1 = + + W L 3: Learning Topic Models through Matrix/Tensor Decomposition 33/75

Take Away Message Consider topic models satisfying linear independent word distributions under different topics. Parameters of topic model for single-topic documents can be efficiently recovered from distribution of three-word documents. Distribution of three-word documents (word triples) M 3 = E[x x x] = h w h a h a h a h M3 : Co-occurrence of word triples Two-word documents are not sufficient for identifiability. 34/75

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity 10 5 10 4 Tensor Variational Running Time (s) 10 104 8 6 4 2 10 3 0 35/75

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity 10 5 10 4 Tensor Variational Running Time (s) 10 104 8 6 4 2 10 3 0 Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group 10 1 10 0 10-1 10-2 FB YP DBLPsub DBLP 10 2 10 5 10 4 10 3 Running Times (s)106 FB YP DBLPsub DBLP 35/75

Tensor Methods Compared with Variational Inference Learning Topics from PubMed on Spark: 8 million docs Perplexity 10 5 10 4 10 3 Tensor Variational 10 104 Learning Communities from Graph Connectivity Facebook: n 20k Yelp: n 40k DBLPsub: n 0.1m DBLP: n 1m Error /group 10 1 10 0 10-1 Orders of Magnitude Faster & More Accurate Running Time (s) 8 6 4 2 0 10 5 10 4 10 3 Running Times (s)106 10-2 FB YP DBLPsub DBLP 10 2 FB YP DBLPsub DBLP Online Tensor Methods for Learning Latent Variable Models, F. Huang, U. Niranjan, M. Hakeem, A. Anandkumar, JMLR14. Tensor Methods on Apache Spark, by F. Huang, A. Anandkumar, Oct. 2015. 35/75

Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 36/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). = + 37/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + 37/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = Intuitions for Jennrich s Algorithm + 37/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + Intuitions for Jennrich s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors 37/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Properties of Tensor Slices Linear combination of slices T (I,I,c) = h < µ h,c > µ h µ h = + Intuitions for Jennrich s Algorithm Linear comb. of slices of a tensor share the same set of eigenvectors The shared eigenvectors are tensor components {µ h } K h=1 37/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) 38/75

Jennrich s Algorithm (Simplified) Task: Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) Consistency of Jennrich s Algorithm? Estimators { µ h } K h=1 unknown components {µ h} K h=1 (up to scaling)? 38/75

Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. 39/75

Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 39/75

Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 39/75

Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 3 Diagonal entries of D c D 1 c are distinct (a.s.); 39/75

Analysis of Consistency of Jennrich s algorithm Recall: Linear comb. of slices share eigenvectors {µ h } K h=1, i.e., T (I,I,c)T (I,I,c ) a.s. = UD c U (U ) D 1 c U a.s. = U(D c D 1 c )U, where U = [µ 1... µ ( K ] are the linearly independent ) tensor components and D c = Diag < µ 1,c >,...,< µ K,c > is diagonal. By linear independence of {µ i } K i=1 and random choice of c and c : 1 U has rank K; 2 D c and D c are invertible (a.s.); 3 Diagonal entries of D c D 1 c are distinct (a.s.); So {µ i } K i=1 are the eigenvectors of T (I,I,c)T (I,I,c) with distinct non-zero eigenvalues. Jennrich s algorithm is consistent 39/75

Error-tolerant algorithms for tensor decompositions 40/75

Moment Estimator: Empirical Moments 41/75

Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only 41/75

Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only Example Third Order Moment: distribution of word triples E[x x x] = h w ha h a h a h Empirical Third Order Moment: co-occurrence frequency of word triples n Ê[x x x] = 1 n x i x i x i i=1 41/75

Moment Estimator: Empirical Moments Moments E θ [f(x)] are functions of model parameters θ Empirical Moments Ê[f(X)] are computed using iid samples {x i} n i=1 only Example Third Order Moment: distribution of word triples E[x x x] = h w ha h a h a h Empirical Third Order Moment: co-occurrence frequency of word triples n Ê[x x x] = 1 n x i x i x i i=1 Inevitably expect error of order n 1 2 in some norm, e.g., Operator norm: E[x x x] Ê[x x x] n 1 2 where T := sup T (x,y,z) x,y,z S d 1 Frobenius norm: E[x x x] Ê[x x x] F n 1 2 where T F := i,j,k T 2 i,j,k 41/75

Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) 42/75

Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1 1: Sample c and c independently & uniformly at random from S d 1 2: Return { µ h } K h=1 eigenvectors of ( T (I,I,c)T (I,I,c ) ) Challenge: Only have access to T such that T T n 1 2 42/75

Stability of Jennrich s Algorithm Recall Jennrich s algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Challenge: Only have access to T such that T T n 1 2 42/75

Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps 42/75

Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. 42/75

Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. Ultimately, T T F 1 polyd is required. 42/75

Recall Jennrich s algorithm Stability of Jennrich s Algorithm Given tensor T = K h=1 µ h 3 with linearly independent components {µ h } K h=1, find the components (up to scaling). Algorithm Jennrich s Algorithm Require: Tensor T R d d d Ensure: Components { µ h } K a.s. h=1 = {µ h } K h=1? 1: Sample c and c independently & uniformly at random from) S d 1 2: Return { µ h } K h=1 ( T eigenvectors of (I,I,c) T (I,I,c ) Stability of eigenvectors requires eigenvalue gaps To ensure eigenvalue gaps for T (,,c) T (,,c), T (,,c) T (,,c) T (,,c)t (,,c) is needed. Ultimately, T T F 1 polyd is required. A different approach? 42/75

Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? 43/75

Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. 43/75

Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. 43/75

Initial Ideas In many applications, we estimate moments of the form M 3 = K w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. 43/75

Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 43/75

Initial Ideas In many applications, we estimate moments of the form K M 3 = w h a h 3, h=1 where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 43/75

Whitening is the process of finding a whitening matrix W such that multi-linear operation (using W) on M 3 orthogonalize its components: M 3 (W,W,W) = h = h w h (W a h ) 3 w h v h 3, v h v h, h h 44/75

Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, 45/75

Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. 45/75

Whitening Given M 3 = h w h a h 3, M 2 = h w h a h a h, Find whitening matrix W s.t. W a h = v h are orthogonal. When {a h } K h=1 Rd K has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2 W 45/75

Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T + + + + 46/75

Using Whitening to Obtain Orthogonal Tensor Tensor M3 + + Tensor T + + Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. 46/75

Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T + + + + Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] 46/75

Using Whitening to Obtain Orthogonal Tensor Tensor M3 Tensor T + + + + Multi-linear transform T = M 3 (W,W,W) = h w h(w a h ) 3. T = w h v h 3 has orthogonal components. h [K] Dimensionality reduction when K d, as M 3 R d d d and T R K K K. 46/75

Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 W v 3 v 1 v 2 47/75

Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W v 3 v 1 v 2 47/75

Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W = UDiag( λ 1/2 ), where Eigen-decomposition M 2 = UDiag( λ)u. W v 3 v 1 v 2 47/75

Given How to Find Whitening Matrix? M 3 = w h a h 3, M 2 = w h a h a h, h h Goal: W such that a 1 a 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. W = UDiag( λ 1/2 ), where Eigen-decomposition M 2 = UDiag( λ)u. V := W ADiag(w) 1/2 is an orthogonal matrix. W v 3 v 1 v 2 T = M 3 (W,W,W) = h = h w 1/2 h (W a h wh ) 3 λ h v h 3, λ h := w 1/2 h. T is an orthogonal tensor. 47/75

Initial Ideas In many applications, we estimate moments of the form M 3 = h w h a h 3, where {a h } K h=1 are assumed to be linearly independent. What if {a h } K h=1 has orthonormal columns? M 3 (I,a i,a i ) = h w h a h,a i 2 a h = w i a i, i. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Define orthonormal {a h } K h=1 as eigenvectors of tensor M 3. Two Problems {a h } K h=1 is not orthogonal in general. How to find eigenvectors of a tensor? 48/75

Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. 49/75

Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i 49/75

Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i Intuitions for Matrix Power Method 49/75

Review: Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h v h with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Matrix Eigenvectors Fixed point: linear transform M(I,v i ) = h λ h v i,v h v h = λ i v i Intuitions for Matrix Power Method Linear transform on eigenvectors {v h } K h=1 preserve direction 49/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. = + 50/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = + 50/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = Intuitions for Tensor Power Method + 50/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Properties of Tensor Eigenvectors Fixed point: bi-linear transform T (I,v i,v i ) = h λ h v i,v h 2 v h = λ i v i = Intuitions for Tensor Power Method + Bilinear transform on eigenvectors {v h } K h=1 preserve direction 50/75

Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for 51/75

Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for Consistency of Matrix Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? 51/75

Orthogonal Matrix Eigen Decomposition Task: Given matrix M = K h=1 λ hv h 2 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Matrix Power Method Require: Matrix M R K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i M(I,u i 1) M(I,u i 1 ) 5: end for 6: v h u T, λ h M( v h, v h ) 7: Deflate M M λ h v h 2 8: end for Consistency of Matrix Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? Does the convergence depend on initialization? 51/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T R K K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i T (I,u i 1,u i 1 ) T (I,u i 1,u i 1 ) 5: end for 6: v h u T, λ h T ( v h, v h, v h ) 7: Deflate T T λ h v h 3 8: end for 52/75

Orthogonal Tensor Eigen Decomposition Task: Given tensor T = K h=1 λ hv h 3 with orthonormal components {v h } K h=1 (v h v h, h h ), find the components/eigenvectors. Algorithm Tensor Power Method Require: Tensor T R K K K w.h.p. = {v h } K h=1 Ensure: Components { v h } K h=1 1: for h = 1 : K do 2: Sample u 0 uniformly at random from S K 1 3: for i = 1 : T do 4: u i T (I,u i 1,u i 1 ) T (I,u i 1,u i 1 ) 5: end for 6: v h u T, λ h T ( v h, v h, v h ) 7: Deflate T T λ h v h 3 8: end for Consistency of Tensor Power Method? Is there convergence? { v h } K h=1 {v h} K h=1 w.h.p.? Does the convergence depend on initialization? 52/75

Analysis of Consistency of Matrix Power Method Order eigenvectors {v h } K h=1 such that corresponding eigenvalues satisfy λ 1 λ 2... λ K. Project initial point u 0 onto eigenvectors {v h } K h=1 Convergence properties i c h = u 0,v h, h Unique (identifiable) i.f.f. {λ h } K h=1 are distinct. If gap λ 2 λ 1 < 1 and c 1 0, matrix power method converges to v 1. Converges linearly to v 1 assuming gap λ 2 /λ 1 < 1. Linear transform permits M(I,u0 ) = h λ ) h( v h u 0 vh = h λ hc h v h, i.e., projection in ( v h direction ) is scaled by λ h. 2 ( ) 2t. v In t iterations, 1 (v v ) λ 2 1 K 2 i v λ 1 53/75

Analysis of Consistency of Tensor Power Method Project initial point u 0 onto eigenvectors c h = u 0,v h, h. Order eigenvectors {v h } K h=1 such that Convergence properties λ 1 c 1 > λ 2 c 2 λ K c K. Identifiable i.f.f. {λ h c h } K h=1 are distinct. Initialization dependent. If λ 2 c 2 λ 1 c 1 < 1 and λ 1 c 1 0, tensor power method converges to v 1. Note v 1 is NOT necessarily the largest eigenvector. Converges quadraticly to v 1 assuming gap λ 2 c 2 λ 1 c 1 < 1. Bi-linear transform permits T (I,u0,u 0 ) = h λ ) h( v 2vh h u 0 = h λ hc 2 h v h i.e., projection in v h direction is squared then scaled by λ h. ) (v 1 In t iterations, 2 ( ) 2 t+1 v λ ) i (v i 2 1 k 1 v 2 c 2 2. v max i 1 λ i v 1 c 1 54/75

Matrix vs. tensor power iteration Matrix power iteration: Tensor power iteration: 55/75

Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 55/75

Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 55/75

Matrix power iteration: Matrix vs. tensor power iteration 1 Requires gap between largest and second-largest eigenvalue. Property of the matrix only. 2 Converges to top eigenvector. 3 Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration: 1 Requires gap between largest and second-largest λ h c h. Property of the tensor and initialization u 0. 2 Converges to v i which is the largest v h c h. Not necessarily the largest eigenvector. 3 Quadratic convergence. Need O(log log(1/ǫ)) iterations. 55/75

Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. 56/75

Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 56/75

Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? 56/75

Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? Optimization viewpoint of tensor Eigen decomposition will help. 56/75

Spurious Eigenvectors for Tensor Eigen Decomposition T = h [K] λ h v h 3. Characterization of eigenvectors: T (I, v, v) = λv? {v h } K h=1 are eigenvectors as T (I,v h,v h ) = λ h v h. Bad news: There can be other eigenvectors (unlike matrix case). E.g., when {λh } K h=1 1 v = v 1 +v 2 satisfies T (I,v,v) = 1 v. 2 2 How do we avoid spurious solutions (not components {v h } K h=1 )? Optimization viewpoint of tensor Eigen decomposition will help. All spurious eigenvectors are saddle points. 56/75

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition 57/75

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). 58/75

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} 58/75

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative L(v, λ) = 0 L(v,λ) = 2(M(I,v) λv) = 0 Eigenvectors are stationary points. Power method v M(I,v) is a version M(I,v) of gradient ascent. L(v,λ) = 3(T(I,v,v) λv) = 0 Eigenvectors are stationary points. Power method v T(I,v,v) T(I,v,v) is a version of gradient ascent. 58/75

Optimization Viewpoint of Matrix/Tensor Eigen Decomposition Optimization Problem Matrix: maxm(v,v) s.t. v = 1. v Lagrangian: L(v,λ) := M(v,v) λ(v v 1). Tensor: maxt(v,v,v) s.t. v = 1. v Lagrangian: L(v,λ) := T(v,v,v) 1.5λ(v v 1). Non-convex: stationary points = {global optima, local optima, saddle point} Stationary Points: first derivative L(v, λ) = 0 L(v,λ) = 2(M(I,v) λv) = 0 Eigenvectors are stationary points. Power method v M(I,v) is a version M(I,v) of gradient ascent. L(v,λ) = 3(T(I,v,v) λv) = 0 Eigenvectors are stationary points. Power method v T(I,v,v) T(I,v,v) is a version of gradient ascent. Local Optima: w 2 L(v,λ)w < 0 for all w v, at a stationary point v v 1 is the only local optimum. All other eigenvectors are saddle points. {v h } K h=1 are the only local optima. All spurious eigenvectors are saddle points. 58/75

Question: What about performance under noise? 59/75

Tensor Perturbation Analysis ˆT = T +E, T = h λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 60/75

Tensor Perturbation Analysis ˆT = T +E, T = h Theorem: Let T be number of iterations. If λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 T logk +loglog λmax ǫ, ǫ < λ min K, then output (v, λ) (after polynomial restarts) satisfies ( ) ǫ v v 1 O, λ λ 1 O(ǫ), λ 1 where v 1 is s.t. λ 1 c 1 > λ 2 c 2..., c i := v i,u 0, and u 0 is the (successful) initializer. 60/75

Tensor Perturbation Analysis ˆT = T +E, T = h Theorem: Let T be number of iterations. If λ h v h 3, E := max E(x,x,x) ǫ. x: x =1 T logk +loglog λmax ǫ, ǫ < λ min K, then output (v, λ) (after polynomial restarts) satisfies ( ) ǫ v v 1 O, λ λ 1 O(ǫ), λ 1 where v 1 is s.t. λ 1 c 1 > λ 2 c 2..., c i := v i,u 0, and u 0 is the (successful) initializer. Careful analysis of deflation: avoid buildup of errors. Implies polynomial sample complexity for learning. 60/75

Other tensor decomposition techniques 61/75

Orthogonal Tensor Decomposition Simultaneous Power Method (Wang & Lu, 2017) Simultaneous recovery of eigenvectors Initialization is not optimal Orthogonalized Simultaneous Alternating Least Square (Sharan & Valiant, 2017) Random initialization Proved convergence for symmetric tensor Initialization SVD based initialization (Anandkumar & Janzamin, 2014). State-of-the-art (trace based) initialization (Li & Huang, 2018). 62/75

Outline 1 Introduction 2 Motivation: Challenges of MLE for Gaussian Mixtures 3 Introduction of Method of Moments and Tensor Notations 4 Topic Model for Single-topic Documents 5 Algorithms for Tensor Decompositions 6 Tensor Decomposition for Neural Network Compression 7 Conclusion 63/75

Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data 64/75

Neural Network - Nonlinear Function Approximation Success of Deep Neural Networks Image classification Speech recognition Text processing computation power growth enormous labeled data Express Power linear composition vs nonlinear composition shallow network vs deep structure 64/75

Revolution of Depth 28.2 152 layers 25.8 16.4 22 layers 19 layers 6.7 7.3 11.7 3.57 8layers 8layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error(%) Kaiming He, Xiangyu Zhang, Shaoqing Ren,& Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016. 65/75

Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11 conv, 96,/4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016. 65/75

Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11conv, 96,/4, pool/2 5x5conv, 256, pool/2 3x3conv, 384 3x3conv, 384 3x3conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) 3x3conv, 64 3x3conv, 64, pool/2 3x3conv, 128 3x3conv, 128, pool/2 3x3conv, 256 3x3conv, 256 3x3conv, 256 3x3conv, 256, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 3x3conv, 512 3x3conv, 512 3x3conv, 512 3x3conv, 512, pool/2 fc, 4096 fc, 4096 GoogleNet, 22 layers (ILSVRC 2014) fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016. 65/75