Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Size: px

Start display at page:

Download "Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016"

Everett Benson
6 years ago
Views:

1 Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19,

2 Tensors Matrix Tensor: higher-order matrix three-way tensor: p-way tensor: 2

3 Tensor == Vector? Fibers? Tensors can be viewed as a collection of fibers (Kolda & Bader, 2009) 3

4 Tensor == Matrix? Slices? Tensors can be viewed as a collection of slices (Kolda & Bader, 2009) 4

5 Tensors Why tensors? tensors capture multilinear structure more flexible and powerful models e.g. parameter estimation in latent variable modelling (to be discussed shortly) Tensor: object in its own right its own geometrical, statistical and computational issues much harder to work with than a matrix 5

6 Example: Single Topic Model Model: prob. k topics (dists. over d words) sample topic h: prob h = i = w i (i [k]) each document has m words x 1, x 2,, x m sampled i.i.d. from μ i o voc. Dataset: m-word documents Goal: learn parameters 6

7 Example: Single Topic Model Method of moments: Key idea: find parameters (approx.) consistent with observed moments (Pearson, 1894) Karl Pearson (1857~1936) Procedure: setup equations Moment population θ solve approximately = Moment sample 7

Example: Single Topic Model Goal: model the topics of the documents in a given corpus θ Samples of documents generated based on single topic model with params θ = ( μ i, w i

8 Example: Single Topic Model Goal: model the topics of the documents in a given corpus θ Samples of documents generated based on single topic model with params θ = ( μ i, w i ) Unsupervised learning alg. Method of moments setup equations solve them approx. Model parameters Which moments to use? 1 st -order? 2 nd -order? 3 rd -order? p th -order? 8

9 Example: Single Topic Model Binary encoding: x t = e i the t-th word in the document is i-th word in the vocabulary t, i m [d] E.g. topic: animal 3-word document 7-word vocabulary vocab. x 1 x 2 x 3 cats dogs fear I like raise want

10 Example: Single Topic Model Binary encoding: x t = e i First moment the t-th word in the document is i-th word in the vocabulary t, i m [d] Not identifiable: only d numbers for d + 1 k parameters. 10

11 Example: Single Topic Model Second moment Matrix-mode: still not identifiable even though d d+1 2 > d + 1 k. (Why?) 11

12 Example: Single Topic Model Identifiable? NO! U R n k is a solution M = UU T R n n M = UQ(UQ) T, for any Q O k (QQ T = Q T Q = I k ) U R n k is a solution for any Q O k AMBUIGUITY! Matrix mode: INSUFFICIENT to identify parameters.! 12

13 Example: Single Topic Model Third-order moment Tensor mode 13

14 Example: Single Topic Model Claim: uniquely determine the parameters θ 14

15 Example: Single Topic Model Reduction to orthogonal case via whitening Whiten: project to k dimensions transform to orthogonality M = UDU T W = UD 1/2 Reduced Eigen- Decomp. apply W to M and T v i i [k] forms an orthonormal basis for R k 15

16 Example: Single Topic Model Spectral theorem and eigen-decompositions v i i [k] forms an orthnormal basis for R k symmetric matrix symmetric tensor eigen-decomp. is unique iff λ i λ j if such decomp. exists, then it is always unique (even if λ i s all same) Uniqueness of orthogonal decomp. uniquely determine the parameters θ Identifiability issue is resolved via tensor mode! 16

17 2 nd Example: Mixture of Spherical Gaussians Model: k means Sample cluster h = i with probability w i (i [k]) Observe x, with i.i.d. homogeneous spherical noise Dataset: multiple points Goal: learn parameters 17

18 2 nd Example: Mixture of Spherical Gaussians Identifiable using 1 st, 2 nd and 3 rd order moments together [Hsu & Kakade, 13] Same approach as simple topic model! 18

19 General Principle Similar structures prevail in latent variable models Latent Dirichlet Allocatioin (LDA) Mixed Multinomial Logit Model Mixture of Gaussians (MoG) Hidden Markov Models (HMMs) 19

20 Orthogonal Decomposition How to find (λ i, v i ) i [k]? ( < v i, v j > = δ ij ) symmetric matrix symmetric tensor successive rank-one approximation (SROA) generalized SROA? 20

21 Successive Rank-One Approximation Symmetric matrix SROA: rank-one approximation + deflation Repeat k times (rank-one approx.) (deflation) Recover exactly ( ) 21

22 Successive Rank-One Approximation Symmetric orthogonal decomposable (SOD) tensor SROA: rank-one approximation + deflation Repeat k times (rank-one approx.) (deflation) Recover exactly up to sign flips [Zhang & Golub, 01] N.B.: no requirement on λ s to be distinct 22

23 Successive Rank-One Approximation Sampling error # samples = # samples < Other potential sources of perturbation Model misspecification Numerical error Is SROA robust to the perturbation? Recall matrix perturbation theory (e.g. Davis-Kahan), which requires perturbation matrix < min i j λ i λ j. 23

24 Successive Rank-One Approximation Perturbed SOD tensor SROA: rank-one approximation + deflation Repeat k times (rank-one approx.) (deflation) 24

25 Successive Rank-One Approximation Input: the perturbed SOD tensor Repeat k times Output: (rank-one approx.) (deflation) Theorem (MHG, 15) Output perm. on s.t. N.B. generalizes matrix perturbation analysis NO spectral gap quantity involved 25

26 Best Rank-One Tensor Approximation eliminate, then solve & 26

27 SDP Relaxation [Jiang, Ma & Zhang, 14] (*) reshape to induced symmetry & convex relaxation (*) The p = 3 tensor problem can be transformed into a p = 4 tensor problem [JMZ, 14]. 27

28 SDP Relaxation Linear constraints E.g. v = v 1, v 2 T X = vec(v v) vec v v T = = 4 v 1 v 3 1 v 2 3 v 2 v 1 v v 2 v 3 1 v 2 v v 2 v v 2 3 v 1 v 2 3 v 2 v 1 v v 2 v v 2 3 v 1 v 2 v v 2 3 v 1 v 2 3 v 1 v 2 4 v 2 v 1 2 v 1 v 2 v 2 v 1 v 2 2 v 1 2 v 1 v 2 v 2 v 1 v 2 2 spherical constr. 1 = v 2 2 = v v 2 2 = v v = trace(x) hyper- symmetry X = a d d b d b b e d b b e b e e c 28

29 SDP Relaxation N.B. SDP relaxation proposed by [Jiang, Ma & Zhang, 14] Square reshaping trick for low-rank tensor recovery [MHWG, 14] Equivalent SDP (of reduced size) proposed by [Nie & Wang, 14] using moment-based convex relaxation Empirically, observed almost always! i.e., SDP relaxation solves nonconvex problem! 29

30 Solving SDP Semidefinite programming (SDP) Linear constraints: 30

31 Deriving the dual problem Lagrangian function Dual problem 31

32 Augmented Lagrangian Method Augmented Lagrangian function Augmented Lagrangian method (ALM) k-th iteration: compute update minimizing jointly over is hard! 32

33 Alternating Direction Method of Multipliers (ADMM) Remedy: alternating direction minimize alternatively along y-direction and S-direction ADMM k-th iteration: y-update: S-update: X-update: Each step is (relatively) easy to compute! 33

34 Update y First-order optimality condition: assume is onto (surjective) 34

35 Update S with ( Eigen-Decomp.) 35

36 Update muliplier X 36

37 Convergence Semidefinite programming (SDP) Assumption: is onto s.t. and (Slater s condition) Convergence result Theorem (WGY, 10) From any starting point, a primal and dual solution 37

38 Another ADMM for Tensor Problem Tensor robust principal component anaysis (T-RPCA) X = L + S R n 1 n 2 n K Structural assumptions: L: low rank in each mode ( ) S: sparsely supported L (k) i.e. rank(l (k) ) is small for all k [K] i.e. cardinality S: S 0 is small Problem: Given X, recover L and S. 38

39 T-RPCA Convex surrogates Convex optimization rank(l (i) ) L (i) σ j (L (i) ) cardinality(s) S 1 S i1 i 2 i K min L,S s.t. λ i L i + S 1 L + S = X N.B. generalizes matrix robust PCA [CLMW, 11] theoretical guarantees [MHWG, 14] [HMGW, 15] 39

40 T-RPCA Variable Splitting i λ i L (i) L L 1 = L 2 = = L K i λ i L i, i 40

41 T-RPCA Reformulated problem min L i, S s.t. λ i L i, i + S 1 L i + S = X, i [K] Augmented Lagrangian function ADMM k-th iteration: L i -update: S-update: Λ i -update: 41

42 References [ZG] Zhang, T. and Golub G. H.. "Rank-one approximation to high order tensors." SIAM Journal on Matrix Analysis and Applications 23.2 (2001): [AGHKT] Anandkumar, A., et al. "Tensor decompositions for learning latent variable models." The Journal of Machine Learning Research 15.1 (2014): [HS] Hsu, D. and Sham M. K.. "Learning mixtures of spherical gaussians: moment methods and spectral decompositions." Proceedings of the 4th conference on Innovations in Theoretical Computer Science. ACM, [TB] Kolda, T. G. and Bader B. W. Bader.. "Tensor decompositions and applications." SIAM review 51.3 (2009): [JMZ] Jiang, B., Ma, S. and Zhang S.. "Tensor principal component analysis via convex optimization." Mathematical Programming150.2 (2015): [NW] Nie, J. and Wang, L.. "Semidefinite relaxations for best rank-1 tensor approximations." SIAM Journal on Matrix Analysis and Applications 35.3 (2014): [CLMW] Candès, E. J., et al. "Robust principal component analysis?."journal of the ACM (JACM) 58.3 (2011):

References [WGY] Wen, Z., Goldfarb, D., & Yin, W. (2010). Alternating direction augmented Lagrangian methods for semidefinite programming. Mathematical Programming Computation, 2(3-4), 203-230.

43 References [WGY] Wen, Z., Goldfarb, D., & Yin, W. (2010). Alternating direction augmented Lagrangian methods for semidefinite programming. Mathematical Programming Computation, 2(3-4), [MHWG] Mu, C., et al. "Square deal: Lower bounds and improved relaxations for tensor recovery." Proceedings of the international conference on Machine learning. ACM, 2014 [GQ] Goldfarb, D., & Qin, Z.. "Robust low-rank tensor recovery: Models and algorithms." SIAM Journal on Matrix Analysis and Applications 35.1 (2014): [HMGW] Huang, B., et al. "Provable models for robust low-rank tensor completion." Pacific Journal of Optimization 11 (2015): [MHG] Mu, C., Hsu, D., & Goldfarb, D.. "Successive Rank-One Approximations for Nearly Orthogonally Decomposable Symmetric Tensors." SIAM Journal on Matrix Analysis and Applications 36.4 (2015):

Robust Principal Component Analysis

ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M