Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Similar documents
Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

Tensor Methods for Feature Learning

Dictionary Learning Using Tensor Methods

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Tensor Decompositions and Applications

ECE 598: Representation Learning: Algorithms and Models Fall 2017

Multivariate Statistical Analysis

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Orthogonal tensor decomposition

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Computational Linear Algebra

Mathematical foundations - linear algebra

An Introduction to Spectral Learning

Linear Algebra (Review) Volker Tresp 2017

Foundations of Computer Vision

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Faloutsos, Tong ICDE, 2009

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

FEAST at Play: Feature ExtrAction using Score function Tensors

Parallel Tensor Compression for Large-Scale Scientific Data

Large Scale Data Analysis Using Deep Learning

Basic Concepts in Matrix Algebra

Linear Algebra - Part II

Principal Component Analysis (PCA) CSC411/2515 Tutorial

22 : Hilbert Space Embeddings of Distributions

1 Last time: least-squares problems

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

CS281 Section 4: Factor Analysis and PCA

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Background Mathematics (2/2) 1. David Barber

Linear Algebra for Machine Learning. Sargur N. Srihari

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Linear Algebra (Review) Volker Tresp 2018

Lecture 2: Linear Algebra Review

1 Linearity and Linear Systems

Recent Advances in Non-Convex Optimization and its Implications to Learning

Image Registration Lecture 2: Vectors and Matrices

Econ Slides from Lecture 7

Jim Lambers MAT 610 Summer Session Lecture 1 Notes

Quantum Computing Lecture 2. Review of Linear Algebra

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Estimating Covariance Using Factorial Hidden Markov Models

Dimensionality Reduction and Principle Components Analysis

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Chapter 7. Canonical Forms. 7.1 Eigenvalues and Eigenvectors

B553 Lecture 5: Matrix Algebra Review

7 Principal Component Analysis

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Third-Order Tensor Decompositions and Their Application in Quantum Chemistry

Introduction to Machine Learning

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Applied Linear Algebra in Geoscience Using MATLAB

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

1. What is the determinant of the following matrix? a 1 a 2 4a 3 2a 2 b 1 b 2 4b 3 2b c 1. = 4, then det

Hebbian Learning II. Robert Jacobs Department of Brain & Cognitive Sciences University of Rochester. July 20, 2017

Unsupervised Learning: Dimensionality Reduction

CS 143 Linear Algebra Review

Matrix Factorization and Analysis

Principal Component Analysis CS498

Study Notes on Matrices & Determinants for GATE 2017

Previously Monte Carlo Integration

STA 414/2104: Lecture 8

Linear Algebra V = T = ( 4 3 ).

Deriving Principal Component Analysis (PCA)

Probabilistic Time Series Classification

Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti

Example Linear Algebra Competency Test

Lecture 7: Positive Semidefinite Matrices

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Maths for Signals and Systems Linear Algebra in Engineering

Linear Algebra Practice Problems

Lecture II: Linear Algebra Revisited

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Cheng Soon Ong & Christian Walder. Canberra February June 2018

High Dimensional Model Representation (HDMR) Based Folded Vector Decomposition

Lecture 7: Con3nuous Latent Variable Models

Probabilistic Latent Semantic Analysis

Clustering VS Classification

Gaussian Models (9/9/13)

ECS130 Scientific Computing. Lecture 1: Introduction. Monday, January 7, 10:00 10:50 am

Fundamentals of Multilinear Subspace Learning

Lecture 21: Spectral Learning for Graphical Models

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

Motivating the Covariance Matrix

Eigenvalues and Eigenvectors

Pattern Recognition 2

CVPR A New Tensor Algebra - Tutorial. July 26, 2017

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Machine learning for pervasive systems Classification in high-dimensional spaces

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Econ 204 Supplement to Section 3.6 Diagonalization and Quadratic Forms. 1 Diagonalization and Change of Basis

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Transcription:

Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30

Contact information Network Feature s Please inform me of any typos or errors at geoff.roeder@gmail.com 2/30

Outline Network Feature s 1 2 Network Feature s 3/30

Outline Network Feature s 1 2 Network Feature s 4/30

Formal Definition s as Multi-Linear Maps Network Feature s The tensor product of two vector spaces V and W over a field F is another vector space over F. It is denoted V K W or V W when the field is understood. 5/30

Formal Definition s as Multi-Linear Maps Network Feature s The tensor product of two vector spaces V and W over a field F is another vector space over F. It is denoted V K W or V W when the field is understood. If {v i } and {w j } are bases for V and W, then the set {v i w j } is a basis for V W 5/30

Formal Definition s as Multi-Linear Maps Network Feature s The tensor product of two vector spaces V and W over a field F is another vector space over F. It is denoted V K W or V W when the field is understood. If {v i } and {w j } are bases for V and W, then the set {v i w j } is a basis for V W If S : V X and T : W Y are linear maps, then the tensor product of S and T is a linear map defined by S T : V W X Y (S T )(v w) = S(v) T (w) 5/30

Sufficient Definition s as Multidimensional Arrays (adapted from Kolda and Bader 2009) Network Feature s With respect to a given basis, a tensor is a multidimensional array. E.g., a real N th -order tensor X R I 1 I 2... I N w.r.t. the standard basis is an N-dimensional array where X i1,i 2,...,i N is the element at index (i 1, i 2,..., i N ) 6/30

Sufficient Definition s as Multidimensional Arrays (adapted from Kolda and Bader 2009) Network Feature s With respect to a given basis, a tensor is a multidimensional array. E.g., a real N th -order tensor X R I 1 I 2... I N w.r.t. the standard basis is an N-dimensional array where X i1,i 2,...,i N is the element at index (i 1, i 2,..., i N ) The order of a tensor is the number of dimensions or modes or indices required to uniquely identify an element 6/30

Sufficient Definition s as Multidimensional Arrays (adapted from Kolda and Bader 2009) Network Feature s With respect to a given basis, a tensor is a multidimensional array. E.g., a real N th -order tensor X R I 1 I 2... I N w.r.t. the standard basis is an N-dimensional array where X i1,i 2,...,i N is the element at index (i 1, i 2,..., i N ) The order of a tensor is the number of dimensions or modes or indices required to uniquely identify an element So, a scalar is a 0 mode tensor, a vector is a 1 mode tensor, and a matrix is a 2 mode tensor 6/30

s Subarrays Network Feature s fibers: higher-order analogue to matrix rows and columns. For 3 rd -order tensor, fix all but one index. Figure: Fibers of a 3-mode tensor 7/30

s Subarrays Network Feature s fibers: higher-order analogue to matrix rows and columns. For 3 rd -order tensor, fix all but one index. Figure: Fibers of a 3-mode tensor All fibers are treated as columns vectors by convention. 7/30

s Subarrays Network Feature s slices: two-dimensional sections of a tensor, defined by fixing all but two indices: Figure: Slices of a 3 rd -order tensor 8/30

s Subarrays Network Feature s slices: two-dimensional sections of a tensor, defined by fixing all but two indices: Figure: Slices of a 3 rd -order tensor Slices can be horizontal, lateral, or frontal, corresponding to the diagrams above. 8/30

s Operations -Matrix Product Network Feature s Consider multiplying tensor by a matrix or vector in the n th mode. 9/30

s Operations -Matrix Product Network Feature s Consider multiplying tensor by a matrix or vector in the n th mode. The n-mode (matrix) product of a tensor X R I 1 I 2... I N with a matrix S R J In is denoted X n S and lives in R I 1... I n 1 J I n+1... I N 9/30

s Operations -Matrix Product Network Feature s Consider multiplying tensor by a matrix or vector in the n th mode. The n-mode (matrix) product of a tensor X R I 1 I 2... I N with a matrix S R J In is denoted X n S and lives in R I 1... I n 1 J I n+1... I N Elementwise, (X n S) i1,...,i n 1,j,i n+1,...,i N = I n i n=1 x i1,i 2,...,i N s j,in 9/30

s Operations -Matrix Product Network Feature s Example. Consider the tensor X R 3 2 2, with frontal slices 1 4 13 16 [ ] X :,:,1 = 2 5 X :,:,2 = 14 17 1 3 5 U = 2 5 6 3 6 15 18 10/30

s Operations -Matrix Product Network Feature s Example. Consider the tensor X R 3 2 2, with frontal slices 1 4 13 16 [ X :,:,1 = 2 5 X :,:,2 = 14 17 1 3 5 U = 2 5 6 3 6 15 18 Then, the (1, 1, 1) element of the 1-mode matrix product of X and U is: I 1 =3 (X 1 U) 1,1,1 = x i1,1,1 u 1,i1 i 1 =1 = 1 1 + 2 3 + 3 5 = 22 ] 10/30

s Operations -Vector Product Network Feature s The n-mode (vector) product of a tensor X R I 1 I 2... I N with a vector w R In is denoted X n w and lives in R I 1... I n 1 I n+1... I N. 11/30

s Operations -Vector Product Network Feature s The n-mode (vector) product of a tensor X R I 1 I 2... I N with a vector w R In is denoted X n w and lives in R I 1... I n 1 I n+1... I N. Elementwise, (X n w) i1,...,i n 1,i n+1,...,i N = I N i n=1 x i1,i 2,...,i N w in 11/30

s Operations -Vector Product Network Feature s Example. Consider the tensor X R 3 2 2 as defined as before, where X 1,1,1 = 1 and X 1,2,1 = 4. Then, the (1, 1) element of the 2-mode vector product of X and w = [1 2] is I 2 =2 (X 2 w) 1,1 = x i1,i 2,i 3 w i2 i 2 =1 = 1 1 + 4 2 = 9 12/30

Multilinear form of a X Network Feature s The multilinear form for a tensor X R I 1 I 2 I 3 as follows. is defined 13/30

Multilinear form of a X Network Feature s The multilinear form for a tensor X R I 1 I 2 I 3 is defined as follows. Consider matrices M i R q i p i for i 1, 2, 3. Then, the tensor X (M 1, M 2, M 3 ) R p 1 p 2 p 3 is defined as X j1,j 2,j 3 M 1 (j 1, :) M 2 (j 2, :) M 3 (j 3, :) j 1 [q 1 ] j 2 [q 2 ] j 3 [q 3 ] 13/30

Multilinear form of a X Network Feature s The multilinear form for a tensor X R I 1 I 2 I 3 is defined as follows. Consider matrices M i R q i p i for i 1, 2, 3. Then, the tensor X (M 1, M 2, M 3 ) R p 1 p 2 p 3 is defined as X j1,j 2,j 3 M 1 (j 1, :) M 2 (j 2, :) M 3 (j 3, :) j 1 [q 1 ] j 2 [q 2 ] j 3 [q 3 ] A simpler case is with vectors v, w R d. Then, X (I, v, w) := v j w l T (:, j, l) R d j,l [d] which is a multilinear combination of the mode-1 fibers (columns) of the tensor X 13/30

Rank One s Network Feature s An N-way tensor X is called rank one if it can be written as the tensor product of N vectors, i.e., X = a (1) a (2)... a (N). (1) where is the vector (outer) product. The simple form of (1) implies that X i1,i 2,...,i N = a (1) i 1 a (2) i 2... a (N) i N 14/30

Rank One s Network Feature s An N-way tensor X is called rank one if it can be written as the tensor product of N vectors, i.e., X = a (1) a (2)... a (N). (1) where is the vector (outer) product. The simple form of (1) implies that X i1,i 2,...,i N = a (1) i 1 a (2) i 2... a (N) i N The following diagram exhibits a rank one 3-mode tensor X = a b c: 14/30

Rank k s Network Feature s An order-3 tensor X R d d d is said to have rank k if it can be written as the sum of k rank-1 tensors, i.e., X = k w i a i b i c i, w i R, a i, b i, c i R d. i=1 15/30

Rank k s Network Feature s An order-3 tensor X R d d d is said to have rank k if it can be written as the sum of k rank-1 tensors, i.e., X = k w i a i b i c i, w i R, a i, b i, c i R d. i=1 Analogy to SVD where M = i σ iu v : suggests finding a decomposition of an arbitrary tensor into a spectrum of rank-one components: 15/30

Outline Network Feature s 1 2 Network Feature s 16/30

Motivation Network Feature s Consider a 3 rd -order tensor of the form A = i w ia i a i a i. Considering A as a multilinear map, we can represent its action on lower-order input tensors (vectors and matrices) using its multilinear form: A(B, C, D) := i w i (B T a i ) (C T a i ) (D T a i ). 17/30

Motivation Network Feature s Consider a 3 rd -order tensor of the form A = i w ia i a i a i. Considering A as a multilinear map, we can represent its action on lower-order input tensors (vectors and matrices) using its multilinear form: A(B, C, D) := i w i (B T a i ) (C T a i ) (D T a i ). Now suppose A had orthonormal columns. Then, M 3 (I, a 1, a 1 ) = i w i (I a i ) (a i a 1 ) 2 = w 1 a 1 + 0 + 0. 17/30

Motivation Network Feature s Consider a 3 rd -order tensor of the form A = i w ia i a i a i. Considering A as a multilinear map, we can represent its action on lower-order input tensors (vectors and matrices) using its multilinear form: A(B, C, D) := i w i (B T a i ) (C T a i ) (D T a i ). Now suppose A had orthonormal columns. Then, M 3 (I, a 1, a 1 ) = i w i (I a i ) (a i a 1 ) 2 = w 1 a 1 + 0 + 0. This is analogous to an eigenvector of a matrix. If v is an eigenvector of M we can write Mv = M(I, v) = λv 17/30

Whitening the Network Feature s We re extremely unlikely to encounter an empirical tensor built from orthogonal components like this... 18/30

Whitening the Network Feature s We re extremely unlikely to encounter an empirical tensor built from orthogonal components like this... But can learn a whitening transform using the second moment, M 2 = i w ia i a i i w ia i a i. 18/30

Whitening the Network Feature s We re extremely unlikely to encounter an empirical tensor built from orthogonal components like this... But can learn a whitening transform using the second moment, M 2 = i w ia i a i i w ia i a i. Whitening transforms the covariance matrix to the identity matrix. The data is thereby decorrelated with unit variance. The following diagram displays the action of a whitening transform on data sampled from a bivariate Gaussian: 18/30

The whitening transform is invertible so long as the empirical second moment matrix has full column rank Network Feature s 19/30

Network Feature s The whitening transform is invertible so long as the empirical second moment matrix has full column rank Given the whitening matrix W, we can whiten the empirical third moment tensor by evaluating T = M 3 (W, W, W ) = w i (W a i ) 3 = w i v 3 i i where {v i } is now an orthogonal basis i [k] 19/30

Procedure Start from a whitened tensor T. Then: Network Feature s 20/30

Procedure Network Feature s Start from a whitened tensor T. Then: 1 Randomly initialize v. Evaluate the expression v T (I, v, v) T (I, v, v, ) until convergence to obtain v with eigenvalue λ 20/30

Procedure Network Feature s Start from a whitened tensor T. Then: 1 Randomly initialize v. Evaluate the expression v T (I, v, v) T (I, v, v, ) until convergence to obtain v with eigenvalue λ 2 Deflate T = T λv v v. Store eigenvalue/eigenvector pair, and then go to 1. 20/30

Procedure Network Feature s Start from a whitened tensor T. Then: 1 Randomly initialize v. Evaluate the expression v T (I, v, v) T (I, v, v, ) until convergence to obtain v with eigenvalue λ 2 Deflate T = T λv v v. Store eigenvalue/eigenvector pair, and then go to 1. This leads to the algorithm for recovering the columns of a parameter matrix by representing its columns as moments: 20/30

of of Moments Network Feature s factorization is NP-hard in general 21/30

of of Moments Network Feature s factorization is NP-hard in general For orthogonal tensors, factorization is polynomial in sample size and number of operations 21/30

of of Moments Network Feature s factorization is NP-hard in general For orthogonal tensors, factorization is polynomial in sample size and number of operations Unlike EM algorithm or variational Bayes, this method converges to the global optimum 21/30

of of Moments Network Feature s factorization is NP-hard in general For orthogonal tensors, factorization is polynomial in sample size and number of operations Unlike EM algorithm or variational Bayes, this method converges to the global optimum For a more detailed analysis and how to frame any latent variable model using this method, see newport.eecs.uci.edu/anandkumar/mlss.html 21/30

Outline Network Feature s 1 2 Network Feature s 22/30

Network Feature s Very recent work (last arxiv.org datestamp: Jan 11 2016) on finding optimal weights for a two-layer neural network, with notes on how to generalize to more complex architectures: https://arxiv.org/pdf/1506.08473.pdf 23/30

Network Feature s Very recent work (last arxiv.org datestamp: Jan 11 2016) on finding optimal weights for a two-layer neural network, with notes on how to generalize to more complex architectures: https://arxiv.org/pdf/1506.08473.pdf Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the Perils of Non-Convexity: Guaranteed Training of using s. 23/30

Network Feature s Network Feature s Target network is a label-generating model with architecture f (x) := E[ỹ x] = A 1 σ(a 2 x + b 1) + b 2 : 24/30

Network Feature s Network Feature s Target network is a label-generating model with architecture f (x) := E[ỹ x] = A 1 σ(a 2 x + b 1) + b 2 : Must assume input pdf p(x) is known sufficiently well for learning (or can be estimated using unsupervised methods) 24/30

Network Feature s Network Feature s Key insight: there exists a transformation φ( ) of the input {(x i, y i )} that captures the relationship between the parameter matrices A 1 and A 2 and the input 25/30

Network Feature s Network Feature s Key insight: there exists a transformation φ( ) of the input {(x i, y i )} that captures the relationship between the parameter matrices A 1 and A 2 and the input The transformation generates feature tensors that can be factorized using the method of moments 25/30

Network Feature s Network Feature s Key insight: there exists a transformation φ( ) of the input {(x i, y i )} that captures the relationship between the parameter matrices A 1 and A 2 and the input The transformation generates feature tensors that can be factorized using the method of moments m th -order Score function, defined as (Janzamin et al. 2014) S m (x) := ( 1) m (m) x p(x) p(x) where p(x) is the pdf of the random vector x R d and is the m th order derivative operator (m) x 25/30

Network Feature s The 1st order score function is the normalized gradient of the log of the input density function 26/30

Network Feature s The 1st order score function is the normalized gradient of the log of the input density function This encodes variations in the input distribution p(x). By looking at the gradient of the distribution you get an idea of where there is a large change occurring in the distribution 26/30

Network Feature s The 1st order score function is the normalized gradient of the log of the input density function This encodes variations in the input distribution p(x). By looking at the gradient of the distribution you get an idea of where there is a large change occurring in the distribution The correlation E[y S 3 (x)] between the third-order score function S 3 (x) and the output y then has a particularly useful form, because the x averages out in expectation. 26/30

Network Feature s Network Feature s In Lemma 6 of the paper, authors prove that the rank-1 components of the third order tensor E[ỹ S 3 (x)] are the columns of the weight matrix A 1 : E[ỹ S 3 (x)] = k λ j (A 1 ) j (A 1 ) j (A 1 ) j R d d d j=1 27/30

Network Feature s Network Feature s In Lemma 6 of the paper, authors prove that the rank-1 components of the third order tensor E[ỹ S 3 (x)] are the columns of the weight matrix A 1 : E[ỹ S 3 (x)] = k λ j (A 1 ) j (A 1 ) j (A 1 ) j R d d d j=1 It follows that the columns of A 1 are recoverable using the method of moments, with optimal convergence guarantees 27/30

Network Feature s Network Feature s The remainder of the steps can to the following algorithm: 28/30

of Talk Network Feature s ial representations of Latent Variable Models promise to overcome shortcomings of EM algorithm and variational Bayes 29/30

of Talk Network Feature s ial representations of Latent Variable Models promise to overcome shortcomings of EM algorithm and variational Bayes s algebra involves non-trivial but conceptually straightforward operations 29/30

of Talk Network Feature s ial representations of Latent Variable Models promise to overcome shortcomings of EM algorithm and variational Bayes s algebra involves non-trivial but conceptually straightforward operations These methods may point to a new direction in machine learning research that gives guarantees in unsupervised learning 29/30

References I Network Feature s Anandkumar, Anima. Tutorial on and s for Guaranteed Learning, Part 1. (2014). Lecture slides, Machine Learning Summer School 2014, Carnegie Mellon University http: //newport.eecs.uci.edu/anandkumar/mlss.html Janzamin, Majid, Hanie Sedghi, and Anima Anandkumar. Beating the Perils of Non-Convexity: Guaranteed Training of using s. arxiv:1506.08473v3 [cs.lg] http://arxiv.org/abs/1506.08473v3 Kolda, Tamara G. Brett W. Bader. Decompositions and Applications. SIAM Review, 51(3), pp.455-500 30/30