Dictionary Learning Using Tensor Methods

Size: px

Start display at page:

Download "Dictionary Learning Using Tensor Methods"

Annis Gardner
6 years ago
Views:

1 Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang.

2 Feature learning as cornerstone of ML ML Practice

3 Feature learning as cornerstone of ML ML Practice ML Papers

4 Feature learning as cornerstone of ML Find efficient representation of data, e.g. based on sparsity, Invariances, low dimensional structures etc. ML Practice ML Papers Feature engineering typically critical for good performance Deep learning has shown considerable promise for feature learning

5 Feature learning as cornerstone of ML Find efficient representation of data, e.g. based on sparsity, Invariances, low dimensional structures etc. ML Practice ML Papers Feature engineering typically critical for good performance Deep learning has shown considerable promise for feature learning Can we provide principled approaches which are guaranteed to learn good features?

6 Applications of Representation Learning Compressed sensing Extensive literature on compressed sensing Few linear measurements to recover sparse signals What if the signal is not sparse in input representation? What if the dictionary has invariances, e.g. shift, rotation.

7 Applications of Representation Learning Compressed sensing Extensive literature on compressed sensing Few linear measurements to recover sparse signals What if the signal is not sparse in input representation? What if the dictionary has invariances, e.g. shift, rotation. Can we learn a representation where the signal is sparse?

8 Applications of Representation Learning Compressed sensing Extensive literature on compressed sensing Few linear measurements to recover sparse signals What if the signal is not sparse in input representation? What if the dictionary has invariances, e.g. shift, rotation. Can we learn a representation where the signal is sparse? Topic Modeling Unsupervised learning of admixtures. In text documents, social networks (community modeling), biological models,...

9 Dictionary Learning Model Goal: Find dictionary A with k elements such that each data point is a linear combination of sparse combination of dictionary elements. X A H =

10 Dictionary Learning Model Goal: Find dictionary A with k elements such that each data point is a linear combination of sparse combination of dictionary elements. X A H = Topic models: x i is a document, A contains topics, h i gives topics in document i

11 Dictionary Learning Model Goal: Find dictionary A with k elements such that each data point is a linear combination of sparse combination of dictionary elements. X A H = Topic models: x i is a document, A contains topics, h i gives topics in document i Compressed sensing: x i are the signals, A is a basis with sparse representation

12 Dictionary Learning Model Goal: Find dictionary A with k elements such that each data point is a linear combination of sparse combination of dictionary elements. X A H = Topic models: x i is a document, A contains topics, h i gives topics in document i Compressed sensing: x i are the signals, A is a basis with sparse representation Images: x i is an image, A contains filters, h i gives filters present in image i (also need to incorporate invariances)

13 Outline 1 Introduction 2 Tensor Methods for Dictionary Learning 3 Convolutional Dictionary Models 4 Conclusion

14 Computational Challenges Learning Dictionary Models Maximum likelihood: non-convex optimization. NP-hard. Practice: Local search approaches such as gradient descent, EM, Variational Bayes have no consistency guarantees. Can get stuck in bad local optima. Poor convergence rates and hard to parallelize. Tensor methods can yield guaranteed learning

15 Moment Matrices and Tensors Multivariate Moments M 1 := E[x], M 2 := E[x x], M 3 := E[x x x]. Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ].

16 Spectral Decomposition of Tensors M 2 = i λ i u i v i = +... Matrix M 2 λ 1 u 1 v 1 λ 2 u 2 v 2

17 Spectral Decomposition of Tensors M 2 = i λ i u i v i = +... Matrix M 2 λ 1 u 1 v 1 λ 2 u 2 v 2 M 3 = i λ i u i v i w i = +... Tensor M 3 λ 1 u 1 v 1 w 1 λ 2 u 2 v 2 w 2 u v w is a rank-1 tensor since its (i 1,i 2,i 3 ) th entry is u i1 v i2 w i3.

18 Moment forms for Dictionary Models x i = Ah i, i [n]. Independent components analysis (ICA) h i are independent, e.g. Bernoulli Gaussian M 4 := E[x x x x] T, where T i1,i 2,i 3,i 4 := E[x i1 x i2 ]E[x i3 x i4 ]+E[x i1 x i3 ]E[x i2 x i4 ]+E[x i1 x i4 ]E[x i2 x i3 ], Let κ j := E[h 4 j ] 3E2 [h 2 j ], j [k]. Then, we have M 4 = j [k]κ j a j a j a j a j.

19 Moment forms for Dictionary Models General (sparse) coefficients x i = Ah i, i [n], E[h i ] = s. E [ h 4 i] = E [ h 2 i ] = βs/k, E [ h 2 i h2 j] τ, i j, E [ h 3 i h j] = 0, i j, E[x x x x] = j [k]κ j a j a j a j a j +E, where E τ A 4.

20 Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w a b c T(i,j,l) = w a(i) b(j) c(l).

21 Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w a b c T(i,j,l) = w a(i) b(j) c(l). CANDECOMP/PARAFAC (CP) Decomposition T = j [k]w j a j b j c j R d d d, a j,b j,c j S d 1. = +... Tensor T w 1 a 1 b 1 c 1 w 2 a 2 b 2 c 2

22 Tensor Rank and Tensor Decomposition Rank-1 tensor: T = w a b c T(i,j,l) = w a(i) b(j) c(l). CANDECOMP/PARAFAC (CP) Decomposition T = j [k]w j a j b j c j R d d d, a j,b j,c j S d 1. = +... Tensor T w 1 a 1 b 1 c 1 w 2 a 2 b 2 c 2 k: tensor rank, d: ambient dimension. k d: undercomplete and k > d: overcomplete.

23 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i.

24 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v).

25 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v).

26 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)?

27 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)? {v i} s are the only robust fixed points.

28 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)? {v i} s are the only robust fixed points. All other eigenvectors are saddle points.

29 Orthogonal Tensor Power Method Symmetric orthogonal tensor T R d d d : T = i [k]λ i v i v i v i. Recall matrix power method: v M(I,v) M(I,v). Algorithm: tensor power method: v T(I,v,v) T(I,v,v). How do we avoid spurious solutions (not part of decomposition)? {v i} s are the only robust fixed points. All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima!

30 Putting it together Non-orthogonal tensor M 3 = i w ia i a i a i, M 2 = i w ia i a i. Whitening matrix W: a 1a2a3 W v 1 v 2 Multilinear transform: T = M 3 (W,W,W) v 3 Tensor M 3 Tensor T

31 Putting it together Non-orthogonal tensor M 3 = i w ia i a i a i, M 2 = i w ia i a i. Whitening matrix W: a 1a2a3 W v 1 v 2 Multilinear transform: T = M 3 (W,W,W) v 3 Tensor M 3 Tensor T Tensor Decomposition in Undercomplete Case: Solved!

32 Overcomplete Setting In general, tensor decomposition NP-hard. Tractable when A is incoherence, i.e. a i,a j 1 d for i j.

33 Overcomplete Setting In general, tensor decomposition NP-hard. Tractable when A is incoherence, i.e. a i,a j 1 d for i j. SVD Initialization Find the top singular vectors of T(I,I,θ) for θ N(0,I). Use them for initialization of power method. L trials.

34 Overcomplete Setting In general, tensor decomposition NP-hard. Tractable when A is incoherence, i.e. a i,a j 1 d for i j. SVD Initialization Find the top singular vectors of T(I,I,θ) for θ N(0,I). Use them for initialization of power method. L trials. Assumptions Number of initializations: L k Ω(k/d)2, Tensor Rank: k = O(d) No. of Iterations: N = Θ(log(1/ E )). Recall E : recovery error. Theorem (Global Convergence)[AGJ-COLT2015]: a 1 â (N) O( E ).

35 Improved Sample Complexity Analysis Dictionary A R d k satisfying RIP, sparse-ica model with sub-gaussian variables. Sparsity level s. Number of samples n. M 4 M 4 = Õ ( s 2 n + s 4 d 3 n ) Careful ǫ-net covering and bucketing.

36 Outline 1 Introduction 2 Tensor Methods for Dictionary Learning 3 Convolutional Dictionary Models 4 Conclusion

37 Convolutional Dictionary Model So far, invariances in dictionary are not incorporated. Convolutional models: incorporate invariances such as shift invariance. Image Dictionary elements

38 Rewriting as a standard dictionary model = = x fi wi x F w (a)convolutional model (b)reformulated model x = i f i w i = i Cir(f i )w i = F w Assume coefficients w i are independent (convolutional ICA model) Cumulant tensor has decomposition with components F i.

39 Moment forms and optimization x = i f i w i = i Cir(f i )w i = F w Assume coefficients w i are independent (convolutional ICA model) Cumulant tensor has decomposition with components F i. Cumulant λ 1 (F 1 ) 3 +λ 2 (F 2 ) 3... =

40 Efficient Optimization Techniques cumulant = j λ j F 3 j or matricization: cumulant = F Λ (F F )

41 Efficient Optimization Techniques cumulant = λ j Fj 3 or matricization: cumulant = F Λ (F F ) j Objective function: min Cumulant FΛ(F F) 2 F F s.t. blk l (F) = UDiag(FFT(f l ))U H, f l 2 = 1.

42 Efficient Optimization Techniques cumulant = λ j Fj 3 or matricization: cumulant = F Λ (F F ) j Objective function: min Cumulant FΛ(F F) 2 F F s.t. blk l (F) = UDiag(FFT(f l ))U H, f l 2 = 1. Alternating minimization: Relax FΛ(F F) to FΛ(H G)

43 Efficient Optimization Techniques cumulant = λ j Fj 3 or matricization: cumulant = F Λ (F F ) j Objective function: min Cumulant FΛ(F F) 2 F F s.t. blk l (F) = UDiag(FFT(f l ))U H, f l 2 = 1. Alternating minimization: Relax FΛ(F F) to FΛ(H G) ( Under full column rank H G, form: T := Cumulant (H G) ).

44 Efficient Optimization Techniques cumulant = λ j Fj 3 or matricization: cumulant = F Λ (F F ) j Objective function: min Cumulant FΛ(F F) 2 F F s.t. blk l (F) = UDiag(FFT(f l ))U H, f l 2 = 1. Alternating minimization: Relax FΛ(F F) to FΛ(H G) ( Under full column rank H G, form: T := Cumulant (H G) ). Main Result: Optimal solution f opt l, p [n],q := (i j) mod n, blk l (T) j 1 blk l (T) i j Iq p 1 f opt i,j [n] l (p) =, I q p 1 i,j [n]

45 Efficient Optimization Techniques ( Under full column rank H G, form: T := Cumulant (H G) ). Optimal solution is then computed in closed form. ( Bottleneck computation: (H G) ). Naive implementation: O(n 6 ) time, where n is the length of signal. Running time of our method: For length-n signals and L number of filters, O(logn+logL) time with O(L 2 n 3 ) processors. Involves 2L FFT s, some matrix multiplications, inverse of diagonal matrices.

46 Experiments (synthetic) Convolutional tensor (CT). Alternating minimization (AM). error CT: f 1 AM: f 1 CT: f 2 AM: f 2 CT: Reconst AM: Reconst seconds CT AM seconds Proposed CT Baseline AM iteration 10 2 (a) Reconstruction Error Number of Filters L (b) Running Times Scale with L Number of Samples N (c) Running Times Scale with N

47 Experiments (NLP) Microsoft paraphrase dataset sentence pairs. Unsupervised convolutional tensor method: no outside information. F score. Method Description Outside Information F score Vector Similarity cosine similarity with tf-idf weights word similarity 75.3% ESA explicit semantic space word semantic profiles 79.3% LSA latent semantic space word semantic profiles 79.9% RMLMG graph subsumption lexical&syntactic&synonymy info 80.5% CT (proposed) convolutional dictionary learning none 80.7% MCS combine word similarity measures word similarity 81.3% STS combine semantic&string similarity semantic similarity 81.3% SSA salient semantic space word semantic profiles 81.4% matrixjcn JCN WordNet similarity with matrix word similarity 82.4% Paraphrase detected: (1) Amrozi accused his brother, whom he called the witness, of deliberately distorting his evidence. (2) Referring to him as only the witness, Amrozi accused his brother of deliberately distorting his evidence. Non-paraphrase detected : (1) I never organised a youth camp for the diocese of Bendigo. (2) I never attended a youth camp organised by that diocese.

48 Outline 1 Introduction 2 Tensor Methods for Dictionary Learning 3 Convolutional Dictionary Models 4 Conclusion

49 Summary and Outlook Summary Method of moments for learning dictionary elements. Invariances in convolutional models can be handled efficiently.

50 Summary Summary and Outlook Method of moments for learning dictionary elements. Invariances in convolutional models can be handled efficiently. Outlook Analyze optimization landscape for convolutional models for tensor methods. Extend to other kinds of invariances (e.g. rotation).

51 Summary Summary and Outlook Method of moments for learning dictionary elements. Invariances in convolutional models can be handled efficiently. Outlook Analyze optimization landscape for convolutional models for tensor methods. Extend to other kinds of invariances (e.g. rotation). How is feature learning useful for classification? Precise characterization for training neural networks: first polynomial time methods! Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods by Majid Janzamin, Hanie Sedghi and A.

Tensor Methods for Feature Learning

Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to