Tensor Methods for Feature Learning

Similar documents
Dictionary Learning Using Tensor Methods

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

FEAST at Play: Feature ExtrAction using Score function Tensors

Non-convex Robust PCA: Provable Bounds

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Probabilistic Time Series Classification

TUTORIAL PART 1 Unsupervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Dimensionality Reduction and Principle Components Analysis

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Statistical Learning Reading Assignments

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture 11: Unsupervised Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Lecture 14. Clustering, K-means, and EM

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

CSCI-567: Machine Learning (Spring 2019)

Introduction to Graphical Models

CS534 Machine Learning - Spring Final Exam

STA 414/2104: Machine Learning

Learning Deep Architectures for AI. Part II - Vijay Chakilam

CS6220: DATA MINING TECHNIQUES

UNSUPERVISED LEARNING

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

An Introduction to Spectral Learning

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

CS6220: DATA MINING TECHNIQUES

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Machine learning - HT Maximum Likelihood

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Estimating Covariance Using Factorial Hidden Markov Models

Clustering VS Classification

CS281 Section 4: Factor Analysis and PCA

Introduction to Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Expectation Maximization

Introduction to Gaussian Process

Probabilistic Latent Semantic Analysis

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Multi-View Dimensionality Reduction via Canonical Correlation Analysis

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Probabilistic Class-Specific Discriminant Analysis

Posterior Regularization

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

STA 414/2104: Lecture 8

Introduction to Machine Learning

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Machine Learning Lecture 5

CS 6375 Machine Learning

L11: Pattern recognition principles

Machine Learning - MT & 14. PCA and MDS

Provable Alternating Minimization Methods for Non-convex Optimization

PCA and admixture models

Computational functional genomics

A Least Squares Formulation for Canonical Correlation Analysis

Lecture 15. Probabilistic Models on Graph

Data Mining Techniques

When Dictionary Learning Meets Classification

CSC 576: Variants of Sparse Learning

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Latent Variable Models and EM algorithm

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Support Vector Machines

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

ECE 598: Representation Learning: Algorithms and Models Fall 2017

Variational Autoencoder

Latent Variable Models

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Brief Introduction of Machine Learning Techniques for Content Analysis

Linear Dimensionality Reduction

Deep Generative Models. (Unsupervised Learning)

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Orthogonal tensor decomposition

Nonparametric Bayesian Methods (Gaussian Processes)

Sketching for Large-Scale Learning of Mixture Models

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

Generative Clustering, Topic Modeling, & Bayesian Inference

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Stochastic optimization in Hilbert spaces

Gaussian Processes for Machine Learning

Transcription:

Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine

Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba, et al.

Principles Behind Feature Learning y Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. x Feature learning: Learn φ( ) from data.

Principles Behind Feature Learning y Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. φ(x) x Feature learning: Learn φ( ) from data.

Principles Behind Feature Learning y Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. φ(x) x Feature learning: Learn φ( ) from data. Learning φ(x) from Labeled vs. Unlabeled Samples Labeled samples {x i,y i } and unlabeled samples {x i }. Labeled samples should lead to better feature learning φ( ) but are harder to obtain.

Principles Behind Feature Learning y Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. φ(x) x Feature learning: Learn φ( ) from data. Learning φ(x) from Labeled vs. Unlabeled Samples Labeled samples {x i,y i } and unlabeled samples {x i }. Labeled samples should lead to better feature learning φ( ) but are harder to obtain. Learn features φ(x) through latent variables related to x, y.

Conditional Latent Variable Models: Two Cases y x y x

Conditional Latent Variable Models: Two Cases y φ(φ(x)) φ(x) x y x

Conditional Latent Variable Models: Two Cases y Multi-layer Neural Networks φ(φ(x)) φ(x) x y x

Conditional Latent Variable Models: Two Cases y Multi-layer Neural Networks E[y x] = σ(a d σ(a d 1 σ( A 2 σ(a 1 x)))) φ(φ(x)) φ(x) x y x

Conditional Latent Variable Models: Two Cases y Multi-layer Neural Networks E[y x] = σ(a d σ(a d 1 σ( A 2 σ(a 1 x)))) φ(φ(x)) φ(x) x y Mixture of Classifiers or GLMs G(x) := E[y x,h] = σ( Uh,x + b,h ) x

Conditional Latent Variable Models: Two Cases y Multi-layer Neural Networks E[y x] = σ(a d σ(a d 1 σ( A 2 σ(a 1 x)))) φ(φ(x)) φ(x) x y Mixture of Classifiers or GLMs G(x) := E[y x,h] = σ( Uh,x + b,h ) x h

Challenges in Learning LVMs Challenge: Identifiability Conditions When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Computational Challenges Maximum likelihood is NP-hard in most scenarios. Practice: Local search approaches such as Back-propagation, EM, Variational Bayes have no consistency guarantees. Sample Complexity Sample complexity needs to be low for high-dimensional regime. Guaranteed and efficient learning through tensor methods

Outline 1 Introduction 2 Spectral and Tensor Methods 3 Generative Models for Feature Learning 4 Proposed Framework 5 Conclusion

Classical Spectral Methods: Matrix PCA and CCA Unsupervised Setting: PCA For centered samples {x i }, find projection P with Rank(P) = k s.t. min P 1 x i Px i 2. n i [n] Result: Eigen-decomposition of S = Cov(X). Supervised Setting: CCA For centered samples {x i,y i }, find max a,b a Ê[xy ]b. a Ê[xx ]a b Ê[yy ]b Result: Generalized eigen decomposition. x a,x b,y y

Beyond SVD: Spectral Methods on Tensors How to learn the mixture models without separation constraints? PCA uses covariance matrix of data. Are higher order moments helpful? Unified framework? Moment-based estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. What are the analogues for tensors?

Moment Matrices and Tensors Multivariate Moments M 1 := E[x], M 2 := E[x x], M 3 := E[x x x]. Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ].

Spectral Decomposition of Tensors M 2 = i λ i u i v i = +... Matrix M 2 λ 1 u 1 v 1 λ 2 u 2 v 2 M 3 = i λ i u i v i w i = +... Tensor M 3 λ 1 u 1 v 1 w 1 λ 2 u 2 v 2 w 2 u v w is a rank-1 tensor since its (i 1,i 2,i 3 ) th entry is u i1 v i2 w i3. Guaranteed recovery. (Anandkumar et al 2012, Zhang & Golub 2001).

Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E[x y],e[x x y],e[φ(x) y]... Feature Transformations of the Input: x φ(x) How to exploit them? Are moments E[φ(x) y] useful? If φ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments.

Moment Tensors for Conditional Models Multivariate Moments: Many possibilities... E[x y],e[x x y],e[φ(x) y]... Feature Transformations of the Input: x φ(x) How to exploit them? Are moments E[φ(x) y] useful? If φ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments. Construct φ(x) based on input distribution?

Outline 1 Introduction 2 Spectral and Tensor Methods 3 Generative Models for Feature Learning 4 Proposed Framework 5 Conclusion

Score Function of Input Distribution Score function S(x) := logp(x) 1-d PDF 1-d Score (a) p(x) = 1 Z exp( E(x)) (b) x logp(x) = x E(x) Figures from Alain and Bengio 2014.

Score Function of Input Distribution Score function S(x) := log p(x) 1-d PDF (a) p(x) = Z1 exp( E(x)) Figures from Alain and Bengio 2014. 1-d Score (b) x log p(x) = x E(x) 2-d Score

Why Score Function Features? S(x) := logp(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models.

Why Score Function Features? S(x) := logp(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders logp(x) r (x+n) x σ 2

Why Score Function Features? S(x) := logp(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders logp(x) r (x+n) x σ 2 Recall our goal: construct moments E[y φ(x)]

Why Score Function Features? S(x) := logp(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models. Approximation of score function using denoising auto-encoders logp(x) r (x+n) x σ 2 Recall our goal: construct moments E[y φ(x)] Beyond vector features?

Matrix and Tensor-valued Features Higher order score functions S m (x) := ( 1) m (m) p(x) p(x) Can be a matrix or a tensor instead of a vector. Can be used to construct matrix and tensor moments E[y φ(x)].

Outline 1 Introduction 2 Spectral and Tensor Methods 3 Generative Models for Feature Learning 4 Proposed Framework 5 Conclusion

Operations on Score Function Features Form the cross-moments: E[y S m (x)].

Operations on Score Function Features Form the cross-moments: E[y S m (x)]. Our result E[y S m (x)] = E [ ] (m) G(x), G(x) := E[y x].

Operations on Score Function Features Form the cross-moments: E[y S m (x)]. Our result E[y S m (x)] = E [ ] (m) G(x), G(x) := E[y x]. Extension of Stein s lemma.

Operations on Score Function Features Form the cross-moments: E[y S m (x)]. Our result E[y S m (x)] = E [ ] (m) G(x), G(x) := E[y x]. Extension of Stein s lemma. Extract discriminative directions through spectral decomposition [ ] E[y S m (x)] = E (m) G(x) = j [k]λ j u j u j... u j }{{}. m times

Operations on Score Function Features Form the cross-moments: E[y S m (x)]. Our result E[y S m (x)] = E [ ] (m) G(x), G(x) := E[y x]. Extension of Stein s lemma. Extract discriminative directions through spectral decomposition [ ] E[y S m (x)] = E (m) G(x) = j [k]λ j u j u j... u j }{{}. m times Construct σ(u j x) for some nonlinearity σ.

Automated Extraction of Discriminative Features

Learning Mixtures of Classifiers/GLMs A mixture of r classifiers, hidden choice variable h {e 1,...,e r }. E[y x,h] = g( Uh,x + b,h ) U = [u 1 u 2... u r ] are the weight vectors of GLMs. b is the vector of biases. M 3 = E[y S 3 (x)] = i [r]λ i u i u i u i. First results for learning non-linear mixtures using spectral methods

Learning Multi-layer Neural Networks E[y x] = a 2,σ(A 1 x). y a 2 h A 1 x Our result M 3 = E[y S 3 (x)] = i [r]λ i A 1,i A 1,i A 1,i.

Unlabeled data: {x i } Framework Applied to MNIST Auto-Encoder Train SVM with σ( x i,u j ) r(x) Labeled data: {(x i,y i )} u j s Compute score function S m(x) using r(x) Form cross-moment: E[y S m(x)] Spectral/tensor method: = +... Tensor T = u 3 + u 3 1 2

Outline 1 Introduction 2 Spectral and Tensor Methods 3 Generative Models for Feature Learning 4 Proposed Framework 5 Conclusion

Conclusion: Learning Conditional Models using Tensor Methods Tensor Decomposition Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. Scalable and embarrassingly parallel: handle large datasets. Score function features Score function features crucial for learning conditional models. Related: Guaranteed Non-convex Methods Overcomplete Dictionary Learning/Sparse Coding: Decompose data into a sparse combination of unknown dictionary elements. Non-convex robust PCA: Same guarantees as convex relaxation methods, lower computational complexity. Extensions to tensor setting.

Co-authors and Resources Majid Janzamin Hanie Sedghi Niranjan UN Papers available at http://newport.eecs.uci.edu/anandkumar/