Overview 1. Brief tensor introduction 2. Stein s lemma 3. Score and score matching for fitting models 4. Bringing it all together for supervised deep learning
Tensor intro 1 Tensors are multidimensional arrays. Dimensionality of tensor is referred to as order. Vectors are first order and matrices are second order tensors. Tensors of order 3 or higher are called higher order tensors. Dimension of a tensor is called mode. Matrices have 2 modes, columns and rows. 1 Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W., SIAM Rev., 51(3), 2009.
Fibers and slices
Rank one tensors An N th -order tensor X R I 1 I 2 I N be expressed as is of rank 1 if its entries can x i1,i 2,i 3,...,i N = a (1) i 1 a (2) i 2... a (N) i N where a (n) R In More succinctly and i n {1,..., I n }, for n {1,..., N}. X = a (1) a (2)... a (N)
Canonical Decomposition or Paralel Factors (CP decomposition) Represent tensor as a sum of rank-1 tensors: Rank of a tensor X is the smallest number of rank-1 tensors whose sum generates X.
Score In the context of recent deep learning literature, given a density p(x) score 2 is x log p(x θ) This function reflects how the log density (log p(x)) changes with the data-vector (x). 2 A very similar and closely related function θ log p(x θ) is also a score, Fisher score.
Stein s lemma Critical piece of machinery in working with score functions is Stein s lemma: p(x)f (x) x log p(x)dx = p(x) x f (x)dx x x E f (x) x log p(x) }{{} = E [ x f (x)] score In words, shift the gradient to f and drop the score. Left side for telling stories about scores, right for algorithms. In practice, you d plug-in empirical distribution: 1 T T i=1 f (x i ) x log p(x i ) = 1 }{{} T score T x f (x i ) i=1
Learning by score matching 3 Fitting a pdf of form p(ξ θ) = 1 Z(θ) f (ξ θ) can be intractable if Z( ) is hard to compute. 3 Estimation of Non-Normalized Statistical Models by Score Matching, Aapo Hyvärinen, JMLR 2005.
Learning by score matching 3 Fitting a pdf of form p(ξ θ) = 1 Z(θ) f (ξ θ) can be intractable if Z( ) is hard to compute. The idea: the true data density function and the model density should drop off and increase the same way. More formally, the idea of score matching is to fit parameters θ so that score of the model, ξ log p(ξ θ), matches score of the data density, ξ log p x (ξ). 3 Estimation of Non-Normalized Statistical Models by Score Matching, Aapo Hyvärinen, JMLR 2005.
Score matching The objective is ξ p x (ξ) ξ log p x (ξ) }{{} data distribution ξ log p(ξ θ) }{{} model distribution 2 dξ
Score matching The objective is ξ p x (ξ) ξ log p x (ξ) }{{} data distribution ξ log p(ξ θ) }{{} model distribution Use Stein s lemma and plug-in empirical distribution to obtain: t 1 T 2 model model {}}{{}}{ 2 log f (ξ θ) + 1 log f (ξ θ) 2 ξ i i ξ 2 i 2 dξ No partition function, Z(θ) or the true data density p x (ξ).
Score matching and denoising autoencoders 5 Training a denosing autoencoder can be seen as minimizing distance between score of the model p(x θ) = 1 exp ( E(x θ)) Z(θ) E(w W, b, c) = ct x 1 2 x 2 + j log {1 + exp(w jx + b j )} }{{} σ 2 θ and score of a mixture of gaussians, each centered on a datapoint. 4 4 Parzen window density estimator. 5 A Connection Between Score Matching and Denoising Autoencoders, P. Vincent, Neural Computation, 2011.
Higher order scores 6 We can obtain higher order scores: Recursive formula: S m (x) := ( 1) m m p(x) p(x) S m (x) = S m 1 (x) log p(x) S m 1 (x) S 0 (x) = 1 Note that S 1 (x) = log p(x) 6 Score Function Features for Discriminative Learning, M. Janzamin, H. Sedghi, A. Anandkumar, 2015.
Supervised learning 7,8 We only looked at fitting densities p(x θ) unsupervised learning. What about discriminative setting? 7 Score Function Features for Discriminative Learning: Matrix and Tensor Frameworks, M. Janzamin, H. Sedghi, A. Anandkumar, 2015. 8 Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods,M. Janzamin, H. Sedghi, A. Anandkumar, 2016.
Sigmoidal net f (x) := E [ỹ x] = a 2, σ(a T 1 x + b 1 ) + b 2
Learning weights (A 1 ) of the network Series of papers by Sedghi and Anandkumar culminate in the following observation (heavily paraphrased): Using Stein s lemma expectation of product between label (y) is equal to a rank-k tensor computed using weight vectors (A 1 ) j E [y S 3 (x)] = E [ 3 xf (x) ] = j λ j (A 1 ) j (A 1 ) j (A 1 ) j 9 Using E [y S 2(x)] = E [ 2 xf (x) ] = j λ j(a 1) j (A 1) j does not lead to unique solution.
Learning weights (A 1 ) of the network Series of papers by Sedghi and Anandkumar culminate in the following observation (heavily paraphrased): Using Stein s lemma expectation of product between label (y) is equal to a rank-k tensor computed using weight vectors (A 1 ) j E [y S 3 (x)] = E [ 3 xf (x) ] = j λ j (A 1 ) j (A 1 ) j (A 1 ) j Practical version 1 T T y t S 3 (x t ) = t=1 k λ 3,j (Ã 1 ) j (Ã 1 ) j (Ã 1 ) j j=1 Algorithm: Compute the left side of the equation and perform tensor decomposition up to rank k. 9 9 Using E [y S 2(x)] = E [ 2 xf (x) ] = j λ j(a 1) j (A 1) j does not lead to unique solution.
Training algorithm
Crucial requirements and payoff Crucial requirements: data must have come from a sigmoidal net of the same structure (realizable setting) usual story you must get scores of p(x), S 1, use denoising autoencoder In return you get a guarantee on the mean square error between output of the true net and learned net.
What I did not tell you You have to learn b 1, A 2, b 2. b 1 Uses Fourier transform A 2, b 2 Uses plain regression
Practical concerns 1. S 3 is large (# features) 3 need to use approximation (sketch the tensor) 2. getting scores S 1 may be fragile (local minima in training DAE) 3. the result is for shallow nets 4. non-realizable setting needs additional assumptions on Fourier decomp of the target function Still, it might be worth implementing. 10 10 Read arxiv versions of these papers. ICLR ones are short on details. http://arxiv.org/pdf/1506.08473.pdf http://arxiv.org/pdf/1412.2863v2.pdf