Tensor Methods for Feature Learning

Size: px

Start display at page:

Download "Tensor Methods for Feature Learning"

Leslie Carol Patterson
5 years ago
Views:

1 Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine

2 Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba, et al.

3 Principles Behind Feature Learning y Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. x Feature learning: Learn φ( ) from data.

4 Principles Behind Feature Learning y Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. φ(x) x Feature learning: Learn φ( ) from data.

5 Principles Behind Feature Learning y Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. φ(x) x Feature learning: Learn φ( ) from data. Learning φ(x) from Labeled vs. Unlabeled Samples Labeled samples {x i,y i } and unlabeled samples {x i }. Labeled samples should lead to better feature learning φ( ) but are harder to obtain.

6 Principles Behind Feature Learning y Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. φ(x) x Feature learning: Learn φ( ) from data. Learning φ(x) from Labeled vs. Unlabeled Samples Labeled samples {x i,y i } and unlabeled samples {x i }. Labeled samples should lead to better feature learning φ( ) but are harder to obtain. Learn features φ(x) through latent variables related to x, y.

7 Conditional Latent Variable Models: Two Cases y x y x

8 Conditional Latent Variable Models: Two Cases y φ(φ(x)) φ(x) x y x

9 Conditional Latent Variable Models: Two Cases y Multi-layer Neural Networks φ(φ(x)) φ(x) x y x

10 Conditional Latent Variable Models: Two Cases y Multi-layer Neural Networks E[y x] = σ(a d σ(a d 1 σ( A 2 σ(a 1 x)))) φ(φ(x)) φ(x) x y x

11 Conditional Latent Variable Models: Two Cases y Multi-layer Neural Networks E[y x] = σ(a d σ(a d 1 σ( A 2 σ(a 1 x)))) φ(φ(x)) φ(x) x y Mixture of Classifiers or GLMs G(x) := E[y x,h] = σ( Uh,x + b,h ) x

12 Conditional Latent Variable Models: Two Cases y Multi-layer Neural Networks E[y x] = σ(a d σ(a d 1 σ( A 2 σ(a 1 x)))) φ(φ(x)) φ(x) x y Mixture of Classifiers or GLMs G(x) := E[y x,h] = σ( Uh,x + b,h ) x h

13 Challenges in Learning LVMs Challenge: Identifiability Conditions When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Computational Challenges Maximum likelihood is NP-hard in most scenarios. Practice: Local search approaches such as Back-propagation, EM, Variational Bayes have no consistency guarantees. Sample Complexity Sample complexity needs to be low for high-dimensional regime. Guaranteed and efficient learning through tensor methods

14 Outline 1 Introduction 2 Spectral and Tensor Methods 3 Generative Models for Feature Learning 4 Proposed Framework 5 Conclusion

15 Classical Spectral Methods: Matrix PCA and CCA Unsupervised Setting: PCA For centered samples {x i }, find projection P with Rank(P) = k s.t. min P 1 x i Px i 2. n i [n] Result: Eigen-decomposition of S = Cov(X). Supervised Setting: CCA For centered samples {x i,y i }, find max a,b a Ê[xy ]b. a Ê[xx ]a b Ê[yy ]b Result: Generalized eigen decomposition. x a,x b,y y

16 Beyond SVD: Spectral Methods on Tensors How to learn the mixture models without separation constraints? PCA uses covariance matrix of data. Are higher order moments helpful? Unified framework? Moment-based estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. What are the analogues for tensors?

17 Moment Matrices and Tensors Multivariate Moments M 1 := E[x], M 2 := E[x x], M 3 := E[x x x]. Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ].

18 Spectral Decomposition of Tensors M 2 = i λ i u i v i = +... Matrix M 2 λ 1 u 1 v 1 λ 2 u 2 v 2 M 3 = i λ i u i v i w i = +... Tensor M 3 λ 1 u 1 v 1 w 1 λ 2 u 2 v 2 w 2 u v w is a rank-1 tensor since its (i 1,i 2,i 3 ) th entry is u i1 v i2 w i3. Guaranteed recovery. (Anandkumar et al 2012, Zhang & Golub 2001).

19 Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E[x y],e[x x y],e[φ(x) y]... Feature Transformations of the Input: x φ(x) How to exploit them? Are moments E[φ(x) y] useful? If φ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments.

20 Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E[x y],e[x x y],e[φ(x) y]... Feature Transformations of the Input: x φ(x) How to exploit them? Are moments E[φ(x) y] useful? If φ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments. Construct φ(x) based on input distribution?

21 Outline 1 Introduction 2 Spectral and Tensor Methods 3 Generative Models for Feature Learning 4 Proposed Framework 5 Conclusion

22 Score Function of Input Distribution Score function S(x) := logp(x) 1-d PDF 1-d Score (a) p(x) = 1 Z exp( E(x)) (b) x logp(x) = x E(x) Figures from Alain and Bengio 2014.

23 Score Function of Input Distribution Score function S(x) := log p(x) 1-d PDF (a) p(x) = Z1 exp( E(x)) Figures from Alain and Bengio d Score (b) x log p(x) = x E(x) 2-d Score

24 Why Score Function Features? S(x) := logp(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models.

25 Why Score Function Features? S(x) := logp(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders logp(x) r (x+n) x σ 2

26 Why Score Function Features? S(x) := logp(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders logp(x) r (x+n) x σ 2 Recall our goal: construct moments E[y φ(x)]

27 Why Score Function Features? S(x) := logp(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders logp(x) r (x+n) x σ 2 Recall our goal: construct moments E[y φ(x)] Beyond vector features?

28 Matrix and Tensor-valued Features Higher order score functions S m (x) := ( 1) m (m) p(x) p(x) Can be a matrix or a tensor instead of a vector. Can be used to construct matrix and tensor moments E[y φ(x)].

29 Outline 1 Introduction 2 Spectral and Tensor Methods 3 Generative Models for Feature Learning 4 Proposed Framework 5 Conclusion

30 Operations on Score Function Features Form the cross-moments: E[y S m (x)].

31 Operations on Score Function Features Form the cross-moments: E[y S m (x)]. Our result E[y S m (x)] = E [ ] (m) G(x), G(x) := E[y x].

32 Operations on Score Function Features Form the cross-moments: E[y S m (x)]. Our result E[y S m (x)] = E [ ] (m) G(x), G(x) := E[y x]. Extension of Stein s lemma.

33 Operations on Score Function Features Form the cross-moments: E[y S m (x)]. Our result E[y S m (x)] = E [ ] (m) G(x), G(x) := E[y x]. Extension of Stein s lemma. Extract discriminative directions through spectral decomposition [ ] E[y S m (x)] = E (m) G(x) = j [k]λ j u j u j... u j }{{}. m times

34 Operations on Score Function Features Form the cross-moments: E[y S m (x)]. Our result E[y S m (x)] = E [ ] (m) G(x), G(x) := E[y x]. Extension of Stein s lemma. Extract discriminative directions through spectral decomposition [ ] E[y S m (x)] = E (m) G(x) = j [k]λ j u j u j... u j }{{}. m times Construct σ(u j x) for some nonlinearity σ.

35 Automated Extraction of Discriminative Features

36 Learning Mixtures of Classifiers/GLMs A mixture of r classifiers, hidden choice variable h {e 1,...,e r }. E[y x,h] = g( Uh,x + b,h ) U = [u 1 u 2... u r ] are the weight vectors of GLMs. b is the vector of biases. M 3 = E[y S 3 (x)] = i [r]λ i u i u i u i. First results for learning non-linear mixtures using spectral methods

37 Learning Multi-layer Neural Networks E[y x] = a 2,σ(A 1 x). y a 2 h A 1 x Our result M 3 = E[y S 3 (x)] = i [r]λ i A 1,i A 1,i A 1,i.

38 Unlabeled data: {x i } Framework Applied to MNIST Auto-Encoder Train SVM with σ( x i,u j ) r(x) Labeled data: {(x i,y i )} u j s Compute score function S m(x) using r(x) Form cross-moment: E[y S m(x)] Spectral/tensor method: = +... Tensor T = u 3 + u 3 1 2

39 Outline 1 Introduction 2 Spectral and Tensor Methods 3 Generative Models for Feature Learning 4 Proposed Framework 5 Conclusion

40 Conclusion: Learning Conditional Models using Tensor Methods Tensor Decomposition Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. Scalable and embarrassingly parallel: handle large datasets. Score function features Score function features crucial for learning conditional models. Related: Guaranteed Non-convex Methods Overcomplete Dictionary Learning/Sparse Coding: Decompose data into a sparse combination of unknown dictionary elements. Non-convex robust PCA: Same guarantees as convex relaxation methods, lower computational complexity. Extensions to tensor setting.

41 Co-authors and Resources Majid Janzamin Hanie Sedghi Niranjan UN Papers available at

Dictionary Learning Using Tensor Methods

Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone