Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Size: px

Start display at page:

Download "Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning"

Cornelius Bruce
6 years ago
Views:

1 Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group ( 14 March 2016 HSE Seminar on Applied Linear Algebra, Moscow, Russia

2 Outline 1 Tensor Train Format 2 ML Application 1: Markov Random Fields 3 ML Application 2: TensorNet

3 What is a tensor? Tensor = multidimensional array: A = [A(i 1,..., i d )], i k {1,..., n k }. Terminology: dimensionality = d (number of indices). size = n 1 n d (number of nodes along each axis). Case d = 1 vector, d = 2 matrix.

4 Curse of dimensionality Number of elements = n d (exponential in d) When n = 2, d = > ( PB of memory). Cannot work with tensors using standard methods.

5 Tensor rank decomposition [Hitchcock, 1927] Recall the rank decomposition for matrices: r A(i 1, i 2 ) = U(i 1, α)v (i 2, α). α=1 This can be generalized to tensors. Tensor rank decomposition (canonical decomposition): R A(i 1,..., i d ) = U 1 (i 1, α)... U d (i d, α). α=1 The minimal possible R is called the (canonical) rank of the tensor A. (+) No curse of dimensionality. (-) Ill-posed problem [de Silva, Lim, 2008]. (-) Rank R should be known in advance for many methods. (-) Computation of R is NP-hard [Hillar, Lim, 2013].

6 Unfolding matrices: definition Every tensor A has d 1 unfolding matrices: A k := [A(i 1... i k ; i k+1... i d )], where A(i 1... i k ; i k+1... i d ) := A(i 1,..., i d ). Here i 1... i k and i k+1... i d are row and column (multi)indices; A k are matrices of size M k N k with M k = k s=1 n s, N k = d s=k+1 n s. This is just a reshape.

7 Unfolding matrices: example Consider A = [A(i, j, k)] given by its elements: Then A(1, 1, 1) = 111, A(2, 1, 1) = 211, A(1, 2, 1) = 121, A(2, 2, 1) = 221, A(1, 1, 2) = 112, A(2, 1, 2) = 212, A(1, 2, 2) = 122, A(2, 2, 2) = 222. [ A 1 = [A(i; jk)] = A 2 = [A(ij; k)] = ],

8 Tensor Train decompositon: motivation Main idea: variable splitting. Consider a rank decomposition of an unfolding matrix: A(i 1 i 2 ; i 3 i 4 i 5 i 6 ) = α 2 U(i 1 i 2 ; α 2 )V (i 3 i 4 i 5 i 6 ; α 2 ). On the left: 6-dimensional tensor; on the right: 3- and 5-dimensional. The dimension has reduced! Proceed recursively.

9 Tensor Train decomposition [Oseledets, 2011] TT-format for a tensor A: A(i 1,..., i d ) = G 1 (α 0, i 1, α 1 )G 2 (α 1, i 2, α 2 )... G d (α d 1, i d, α d ). α 0,...,α d This can be written compactly as a matrix product: A(i 1,..., i d ) = G 1 [i 1 ] G 2 [i 2 ]... G d [i d ] }{{}}{{}}{{} 1 r 1 r 1 r 2 r d 1 1 Terminology: G i : TT-cores (collections of matrices) r i : TT-ranks r = max r i : maximal TT-rank TT-format uses O(dnr 2 ) memory to store O(n d ) elements. Efficient only if the ranks are small.

10 TT-format: example Consider a tensor: Its TT-format: where Check: A(i 1, i 2, i 3 ) := i 1 + i 2 + i 3, i 1 {1, 2, 3}, i 2 {1, 2, 3, 4}, i 3 {1, 2, 3, 4, 5}. A(i 1, i 2, i 3 ) = G 1 [i 1 ]G 2 [i 2 ]G 3 [i 3 ], G 1 [i 1 ] := [ i 1 1 ], G 2 [i 2 ] := A(i 1, i 2, i 3 ) = [ i 1 1 ] [ 1 0 i 2 1 [ 1 0 i 2 1 ] [ 1 i 3 ] = = [ i 1 + i 2 1 ] [ 1 i 3 ] [ 1, G 3 [i 3 ] := i 3 ] ] = i 1 + i 2 + i 3.

11 TT-format: example Consider a tensor: Its TT-format: where A(i 1, i 2, i 3 ) := i 1 + i 2 + i 3, i 1 {1, 2, 3}, i 2 {1, 2, 3, 4}, i 3 {1, 2, 3, 4, 5}. A(i 1, i 2, i 3 ) = G 1 [i 1 ]G 2 [i 2 ]G 3 [i 3 ], G 1 = ([ 1 1 ], [ 2 1 ], [ 3 1 ]) ([ ] [ ] [ G 2 =,, ([ ] [ ] [ ] [ ] G 3 =,,,, The tensor has = 60 elements. TT-format uses 32 elements to describe it. ], [ ]) ]) [ 1 5

12 Finding a TT-representation of a tensor General ways of building a of a tensor: Analytical formulas for the TT-cores. TT-SVD algorithm [Oseledets, 2011]: Exact quasi-optimal method. Suitable only for small tensors (which fit into memory). Interpolation algorithms: AMEn-cross [Dolgov & Savostyanov, 2013], DMRG [Khoromskij & Oseledets, 2010], TT-cross [Oseledets, 2010] Approximate heuristically-based methods. Can be applied for large tensors. No strong guarantees but work well in practice. Operations between other tensors in the TT-format: addition, element-wise product etc.

13 TT-interpolation: problem formulation Input: procedure A(i 1,..., i d ) for computing an arbitrary element of A. Output: of B A: B(i 1,..., i d ) = G 1 [i 1 ]... G d [i d ].

14 Matrix interpolation: Matrix-cross Let A R m n with rank r. It admits a skeleton decomposition [Goreinov et al., 1997]: where A = }{{} C m r Â 1 }{{} r r Â = A(I, J ): non-singular matrix. C = A(:, J ): columns containing Â. R = A(I, :): rows containing Â. }{{} R, r n Q: Which Â to choose? A: Any non-singular submatrix if rank A = r exact decomposition. Q: What if rank A r? Different Â will give different error. A: Choose a maximal volume submatrix [Goreinov, Tyrtyshnikov, 2001]. It can be found with the maxvol algorithm [Goreinov et al., 2008].

15 TT interpolation with TT-cross: example Hilbert tensor: A(i 1, i 2,..., i d ) := 1 i 1 + i i d. TT-rank Time Iterations Relative accuracy e e e e e e e e e e e-09 [Oseledets & Tyrtyshnikov, 2009]

16 Operations: addition and multiplication by number Let C = A + B: C(i 1,..., i d ) = A(i 1,..., i d ) + B(i 1,..., i d ). TT-cores of C are as [ follows: Ak [i C k [i k ] = k ] 0 0 B k [i k ] C 1 [i 1 ] = [ A 1 [i 1 ] B 1 [i 1 ] ], C d [i d ] = The ranks are doubled. ], k = 2,..., d 1, [ Ad [i d ] B d [i d ] Multiplication by a number: C = A const. Multiply only one core by const. The ranks do not increase. ].

17 Operations: element-wise product Let C = A B: C(i 1,..., i d ) = A(i 1,..., i d ) B(i 1,..., i d ). TT-cores of C can be computed as follows: C k [i k ] = A k [i k ] B k [i k ], where is the Kronecker product operation. rank(c) = rank(a) rank(b).

18 TT-format: efficient operations Operation Output rank Complexity A const r A O(dr A )) A + const r A + 1 A + B r A + r B O(dnrA 2)) O(dn(r A + r B ) 2 ) A B r A r B O(dnrA 2r B 2 ) sum(a) O(dnrA 2)...

19 TT-rounding TT-rounding procedure: Ã = round(a, ε), ε = accuracy: 1 Maximally reduces TT-ranks ensuring that A Ã F ε A F. 2 Uses SVD for compression (like TT-SVD). round( ) Allows one to trade off accuracy vs maximal rank of the TT-representation. Example: round(a + A, ε mach ) = A within machine accuracy ε mach.

20 Example: multivariate integration ( (e i ) d ) 1 I (d) := sin(x 1 + x x d )dx 1 dx 2... dx d = Im. i [0,1] d Use Chebyshev quadrature with n = 11 nodes + TT-cross with r = 2. d I (d) Relative accuracy Time e e e e e e e e e e e e [Oseledets & Tyrtyshnikov, 2009]

21 Outline 1 Tensor Train Format 2 ML Application 1: Markov Random Fields 3 ML Application 2: TensorNet

22 Motivational example: image segmentation Task: assign a label y i to each pixel of an M N image. Let P(y) be the joint probability of labelling y. Two extreme cases: No assumptions about independence: O(K MN ) parameters (K = total number of labels) represents every distribution intractable in general Everything is independent: P(y) = p 1 (y 1 )... p MN (y MN ) O(MNK) parameters represents only a small class of distributions tractable

23 Graphical models Provide a convenient way to define probabilistic models using graphs. Two types: directed graphical models and Markov random fields. We will consider only (discrete) Markov random fields. The edges represent dependencies between the variables. E.g., for image segmentation: yi yi A variable y i is independent of the rest given its immediate neighbours.

24 Markov random fields The model: P(y) = 1 Z Ψ c (y c ), c C Z: normalization constant C: set of all (maximal) cliques in the graph Ψ c : non-negative functions which are called factors Example: y 1 y 2 y 3 y 4 P(y 1, y 2, y 3, y 4 ) = 1 Z Ψ 1(y 1 )Ψ 2 (y 2 )Ψ 3 (y 3 )Ψ 4 (y 4 ) Ψ 12 (y 1, y 2 )Ψ 24 (y 2, y 4 )Ψ 34 (y 3, y 4 )Ψ 13 (y 1, y 3 ) The factors Ψ ij measure compatibility between variables y i and y j.

25 Main problems of interest Probabilistic model: P(y) = 1 Ψ c (y Z c ) = 1 Z exp( E(y)), c C where E is the energy function: E(y) = c C Θ c (y c ), Θ c (y c ) = ln Ψ c (y c ) Maximum a posteriori (MAP) inference: y = argmax P(y) = argmin E(y) y y Estimation of the normalization constant: Z = P(y) y Estimation of the marginal distributions: P(y i ) = y\y i P(y)

26 Tensorial perspective Energy and unnormalized probability are tensors: m E(y 1,..., y n ) = Θ c (y c ), P(y 1,..., y n ) = c=1 m Ψ c (y c ), c=1 tensors where y i {1,..., d}. In this language: MAP-inference minimal element in E Normalization constant sum of all the elements of P Details: Putting MRFs on a Tensor Train [Novikov et al., ICML 2014].

27 Outline 1 Tensor Train Format 2 ML Application 1: Markov Random Fields 3 ML Application 2: TensorNet

28 Classification problem

29 Neural networks hidden 1 Use a composite function: g(x) = f (W 2 f (W 1 x) ) }{{} h R m h k = f ((W 1 x) k ) 1 f (x) = 1 + exp( x) input h 1 x 1 h 2 output g(x) x n... h m

30 TensorNet: motivation Goal: compress fully-connected layers. Why care about memory? State-of-the-art deep neural networks don t fit into the memory of mobile devices. Up to 95% of the parameters are in the fully-connected layers. A shallow network with a huge fully-connected layer can achieve almost the same accuracy as an ensemble of deep CNNs [Ba and Caruana, 2014].

31 Tensor Train layer Let x be the input, y be the output: y = Wx. The matrix W is represented in the TT-format: y(i 1,..., i d ) = Parameters: TT-cores {G k } d k=1. j 1,..., j d G 1 [i 1, j 1 ]... G d [i d, j d ]x(j 1,..., j d ). Details: Tensorizing Neural Networks [Novikov et al., NIPS 2015].

32 Conclusions and corresponding algorithms are a good way to work with huge tensors. Memory and complexity depend on d linearly no curse of dimensionality. TT-format is efficient only if the TT-ranks are small. This is the case in many applications. Code is available: Python: MATLAB: Thanks for attention!

TENSOR APPROXIMATION TOOLS FREE OF THE CURSE OF DIMENSIONALITY

TENSOR APPROXIMATION TOOLS FREE OF THE CURSE OF DIMENSIONALITY Eugene Tyrtyshnikov Institute of Numerical Mathematics Russian Academy of Sciences (joint work with Ivan Oseledets) WHAT ARE TENSORS? Tensors