Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30

Contact information Network Feature s Please inform me of any typos or errors at geoff.roeder@gmail.com 2/30

Outline Network Feature s 1 2 Network Feature s 3/30

Outline Network Feature s 1 2 Network Feature s 4/30

Formal Definition s as Multi-Linear Maps Network Feature s The tensor product of two vector spaces V and W over a field F is another vector space over F. It is denoted V K W or V W when the field is understood. If {v i } and {w j } are bases for V and W, then the set {v i w j } is a basis for V W If S : V X and T : W Y are linear maps, then the tensor product of S and T is a linear map defined by S T : V W X Y (S T )(v w) = S(v) T (w) 5/30

Sufficient Definition s as Multidimensional Arrays (adapted from Kolda and Bader 2009) Network Feature s With respect to a given basis, a tensor is a multidimensional array. E.g., a real N th -order tensor X R I 1 I 2... I N w.r.t. the standard basis is an N-dimensional array where X i1,i 2,...,i N is the element at index (i 1, i 2,..., i N ) The order of a tensor is the number of dimensions or modes or indices required to uniquely identify an element 6/30

s Subarrays Network Feature s fibers: higher-order analogue to matrix rows and columns. For 3 rd -order tensor, fix all but one index. Figure: Fibers of a 3-mode tensor 7/30

s Subarrays Network Feature s fibers: higher-order analogue to matrix rows and columns. For 3 rd -order tensor, fix all but one index. Figure: Fibers of a 3-mode tensor All fibers are treated as columns vectors by convention. 7/30

s Subarrays Network Feature s slices: two-dimensional sections of a tensor, defined by fixing all but two indices: Figure: Slices of a 3 rd -order tensor 8/30

s Subarrays Network Feature s slices: two-dimensional sections of a tensor, defined by fixing all but two indices: Figure: Slices of a 3 rd -order tensor Slices can be horizontal, lateral, or frontal, corresponding to the diagrams above. 8/30

s Operations -Matrix Product Network Feature s Consider multiplying tensor by a matrix or vector in the n th mode. 9/30

s Operations -Matrix Product Network Feature s Consider multiplying tensor by a matrix or vector in the n th mode. The n-mode (matrix) product of a tensor X R I 1 I 2... I N with a matrix S R J In is denoted X n S and lives in R I 1... I n 1 J I n+1... I N 9/30

s Operations -Matrix Product Network Feature s Example. Consider the tensor X R 3 2 2, with frontal slices 1 4 13 16 [ ] X :,:,1 = 2 5 X :,:,2 = 14 17 1 3 5 U = 2 5 6 3 6 15 18 10/30

s Operations -Matrix Product Network Feature s Example. Consider the tensor X R 3 2 2, with frontal slices 1 4 13 16 [ X :,:,1 = 2 5 X :,:,2 = 14 17 1 3 5 U = 2 5 6 3 6 15 18 Then, the (1, 1, 1) element of the 1-mode matrix product of X and U is: I 1 =3 (X 1 U) 1,1,1 = x i1,1,1 u 1,i1 i 1 =1 = 1 1 + 2 3 + 3 5 = 22 ] 10/30

s Operations -Vector Product Network Feature s The n-mode (vector) product of a tensor X R I 1 I 2... I N with a vector w R In is denoted X n w and lives in R I 1... I n 1 I n+1... I N. 11/30

s Operations -Vector Product Network Feature s The n-mode (vector) product of a tensor X R I 1 I 2... I N with a vector w R In is denoted X n w and lives in R I 1... I n 1 I n+1... I N. Elementwise, (X n w) i1,...,i n 1,i n+1,...,i N = I N i n=1 x i1,i 2,...,i N w in 11/30

s Operations -Vector Product Network Feature s Example. Consider the tensor X R 3 2 2 as defined as before, where X 1,1,1 = 1 and X 1,2,1 = 4. Then, the (1, 1) element of the 2-mode vector product of X and w = [1 2] is I 2 =2 (X 2 w) 1,1 = x i1,i 2,i 3 w i2 i 2 =1 = 1 1 + 4 2 = 9 12/30

Multilinear form of a X Network Feature s The multilinear form for a tensor X R I 1 I 2 I 3 as follows. is defined 13/30

Multilinear form of a X Network Feature s The multilinear form for a tensor X R I 1 I 2 I 3 is defined as follows. Consider matrices M i R q i p i for i 1, 2, 3. Then, the tensor X (M 1, M 2, M 3 ) R p 1 p 2 p 3 is defined as X j1,j 2,j 3 M 1 (j 1, :) M 2 (j 2, :) M 3 (j 3, :) j 1 [q 1 ] j 2 [q 2 ] j 3 [q 3 ] A simpler case is with vectors v, w R d. Then, X (I, v, w) := v j w l T (:, j, l) R d j,l [d] which is a multilinear combination of the mode-1 fibers (columns) of the tensor X 13/30

Rank One s Network Feature s An N-way tensor X is called rank one if it can be written as the tensor product of N vectors, i.e., X = a (1) a (2)... a (N). (1) where is the vector (outer) product. The simple form of (1) implies that X i1,i 2,...,i N = a (1) i 1 a (2) i 2... a (N) i N 14/30

Rank k s Network Feature s An order-3 tensor X R d d d is said to have rank k if it can be written as the sum of k rank-1 tensors, i.e., X = k w i a i b i c i, w i R, a i, b i, c i R d. i=1 15/30

Rank k s Network Feature s An order-3 tensor X R d d d is said to have rank k if it can be written as the sum of k rank-1 tensors, i.e., X = k w i a i b i c i, w i R, a i, b i, c i R d. i=1 Analogy to SVD where M = i σ iu v : suggests finding a decomposition of an arbitrary tensor into a spectrum of rank-one components: 15/30

Outline Network Feature s 1 2 Network Feature s 16/30

Motivation Network Feature s Consider a 3 rd -order tensor of the form A = i w ia i a i a i. Considering A as a multilinear map, we can represent its action on lower-order input tensors (vectors and matrices) using its multilinear form: A(B, C, D) := i w i (B T a i ) (C T a i ) (D T a i ). Now suppose A had orthonormal columns. Then, M 3 (I, a 1, a 1 ) = i w i (I a i ) (a i a 1 ) 2 = w 1 a 1 + 0 + 0. 17/30

Whitening the Network Feature s We re extremely unlikely to encounter an empirical tensor built from orthogonal components like this... 18/30

Whitening the Network Feature s We re extremely unlikely to encounter an empirical tensor built from orthogonal components like this... But can learn a whitening transform using the second moment, M 2 = i w ia i a i i w ia i a i. Whitening transforms the covariance matrix to the identity matrix. The data is thereby decorrelated with unit variance. The following diagram displays the action of a whitening transform on data sampled from a bivariate Gaussian: 18/30

The whitening transform is invertible so long as the empirical second moment matrix has full column rank Network Feature s 19/30

Network Feature s The whitening transform is invertible so long as the empirical second moment matrix has full column rank Given the whitening matrix W, we can whiten the empirical third moment tensor by evaluating T = M 3 (W, W, W ) = w i (W a i ) 3 = w i v 3 i i where {v i } is now an orthogonal basis i [k] 19/30

Procedure Start from a whitened tensor T. Then: Network Feature s 20/30

Procedure Network Feature s Start from a whitened tensor T. Then: 1 Randomly initialize v. Evaluate the expression v T (I, v, v) T (I, v, v, ) until convergence to obtain v with eigenvalue λ 20/30

Procedure Network Feature s Start from a whitened tensor T. Then: 1 Randomly initialize v. Evaluate the expression v T (I, v, v) T (I, v, v, ) until convergence to obtain v with eigenvalue λ 2 Deflate T = T λv v v. Store eigenvalue/eigenvector pair, and then go to 1. 20/30

of of Moments Network Feature s factorization is NP-hard in general 21/30

of of Moments Network Feature s factorization is NP-hard in general For orthogonal tensors, factorization is polynomial in sample size and number of operations 21/30

of of Moments Network Feature s factorization is NP-hard in general For orthogonal tensors, factorization is polynomial in sample size and number of operations Unlike EM algorithm or variational Bayes, this method converges to the global optimum For a more detailed analysis and how to frame any latent variable model using this method, see newport.eecs.uci.edu/anandkumar/mlss.html 21/30

Outline Network Feature s 1 2 Network Feature s 22/30

Network Feature s Very recent work (last arxiv.org datestamp: Jan 11 2016) on finding optimal weights for a two-layer neural network, with notes on how to generalize to more complex architectures: https://arxiv.org/pdf/1506.08473.pdf Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the Perils of Non-Convexity: Guaranteed Training of using s. 23/30

Network Feature s Network Feature s Target network is a label-generating model with architecture f (x) := E[ỹ x] = A 1 σ(a 2 x + b 1) + b 2 : 24/30

Network Feature s Network Feature s Target network is a label-generating model with architecture f (x) := E[ỹ x] = A 1 σ(a 2 x + b 1) + b 2 : Must assume input pdf p(x) is known sufficiently well for learning (or can be estimated using unsupervised methods) 24/30

Network Feature s Network Feature s Key insight: there exists a transformation φ( ) of the input {(x i, y i )} that captures the relationship between the parameter matrices A 1 and A 2 and the input The transformation generates feature tensors that can be factorized using the method of moments m th -order Score function, defined as (Janzamin et al. 2014) S m (x) := ( 1) m (m) x p(x) p(x) where p(x) is the pdf of the random vector x R d and is the m th order derivative operator (m) x 25/30

Network Feature s The 1st order score function is the normalized gradient of the log of the input density function 26/30

Network Feature s The 1st order score function is the normalized gradient of the log of the input density function This encodes variations in the input distribution p(x). By looking at the gradient of the distribution you get an idea of where there is a large change occurring in the distribution The correlation E[y S 3 (x)] between the third-order score function S 3 (x) and the output y then has a particularly useful form, because the x averages out in expectation. 26/30

Network Feature s Network Feature s In Lemma 6 of the paper, authors prove that the rank-1 components of the third order tensor E[ỹ S 3 (x)] are the columns of the weight matrix A 1 : E[ỹ S 3 (x)] = k λ j (A 1 ) j (A 1 ) j (A 1 ) j R d d d j=1 It follows that the columns of A 1 are recoverable using the method of moments, with optimal convergence guarantees 27/30

Network Feature s Network Feature s The remainder of the steps can to the following algorithm: 28/30

of Talk Network Feature s ial representations of Latent Variable Models promise to overcome shortcomings of EM algorithm and variational Bayes 29/30

of Talk Network Feature s ial representations of Latent Variable Models promise to overcome shortcomings of EM algorithm and variational Bayes s algebra involves non-trivial but conceptually straightforward operations These methods may point to a new direction in machine learning research that gives guarantees in unsupervised learning 29/30

References I Network Feature s Anandkumar, Anima. Tutorial on and s for Guaranteed Learning, Part 1. (2014). Lecture slides, Machine Learning Summer School 2014, Carnegie Mellon University http: //newport.eecs.uci.edu/anandkumar/mlss.html Janzamin, Majid, Hanie Sedghi, and Anima Anandkumar. Beating the Perils of Non-Convexity: Guaranteed Training of using s. arxiv:1506.08473v3 [cs.lg] http://arxiv.org/abs/1506.08473v3 Kolda, Tamara G. Brett W. Bader. Decompositions and Applications. SIAM Review, 51(3), pp.455-500 30/30