Some tensor decomposition methods for machine learning

Size: px

Start display at page:

Download "Some tensor decomposition methods for machine learning"

Maryann Flowers
5 years ago
Views:

1 Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August / 36

2 Outline Problem and motivation Tucker decomposition Two convex relaxations Applications Extensions 2 / 36

3 Problem Learning a tensor from a set of linear measurements Main example: Tensor completion Video denoising/completion Context-aware recommendation Multisensor data analysis Entities-relationships learning... 3 / 36

4 Multilinear multitask learning Tasks are associated with multiple indices, e.g. predict a rating given to different aspects of a restaurant by different critics Tasks regression vectors are vertical fibers of the tensor, e.g. (, food ) 4 / 36

5 0-Shot Transfer Learning Learning tasks for which no training instances are provided 5 / 36

6 Setting We consider a supervised learning setting in which we wish to learn a tensor from a set of measurements y = I (W) + ɛ where I : R d 1 d N R m a linear (sampling) operator, e.g. a subset of the tensor entries The number of parameters explodes with the order of the tensor, so regularization is key: minimize W E(W) + λr(w) for example E(W) = y I (W) 2 6 / 36

7 Matrix case The matrix case has been thoroughly studied, particularly focusing on spectral regularizers which encourage low rank matrices low rank matrix completion multitask feature learning Different notions of rank of a tensor! (some computationally intractable). Many are reviewed in [Kolda and Bader, 2009] Which ones are simple and effective? 7 / 36

8 Objectives We want some kind of guarantees statistical: bounds on the generalization / out-of-sample error optimization: convergence, rates function approximation We discuss two approaches based on: B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, M. Pontil. Multilinear multitask learning. 30th International Conference on Machine Learning, 2013 B. Romera-Paredes & M. Pontil. A new convex relaxation for tensor completion. Advances in Neural Information Processing Systems 26, / 36

9 Function approximation Learn function parametrized by a tensor from a sample Example: x = (x 1, x 2, x 3 ) V 1 V 2 V 3, three vector spaces f (x) = W, φ 1 (x 1 ) φ 2 (x 2 ) φ 3 (x 3 ), W R d 1 d 2 d 3 φ i : V i R d i feature maps min W m Loss(y i, f (x i )) + γω(w) i=1 Dual problem using kernel methods K(x, t) = K 1 (x 1, t 1 )K 2 (x 2, t 2 )K 3 (x 3, t 3 ) Matrix case: [Abernethy, Bach, Evgeniou, and Vert, JMLR 2009] Tensor case: [Signoretto, De Lathauwer, Suykens, Preprint 2013] 9 / 36

10 Basic operations Tensor W R d 1 d N, with entries W i1,...,i N Mode-n fiber is a vector in R dn formed by the elements of a tensor obtained by fixing all indices but the n-th one, e.g. in the above example W i1,i 2,: R d 3 is the (i 1, i 2 )-th regression vector 10 / 36

11 Basic operations (cont.) Matricization is the process of rearranging the tensor into a matrix Mode-n matricization W (n) R dn Jn, where J n = k n d k, is the matrix, the column of which are the mode-n fibers of W W (1) W (3) 11 / 36

12 Tucker rank TR(W) = (rank(w (1) ),..., rank(w (n) )) A natural regularizer associated to this is the sum of the ranks of the matricizations: R(W) := n rank(w (n) ) 12 / 36

13 Tucker decomposition meaning that W = G 1 A (1) N A (N) K 1 W i1,...,i N = j 1 =1 K N j N =1 K n d n (typically much smaller) G j1,...,j N A (1) i 1,j 1 A (N) i N,j N A (n) R dn Kn, n {1,..., N} are the factor matrices G R K 1 K N is called the core tensor and models the interaction between factors By construction rank(w (n) ) K n 13 / 36

14 Nonconvex approach The Tucker decomposition is invariant under multiplication and division of different factors by the same scalar. With the aim of avoiding this issue and reducing overfitting, we add Frobenius norm regularization terms to the components [ minimize E(G 1 A (1) N A (N) ) + α G,A (1),...,A (N) where α is a regularization parameter We solve this problem by alternate minimization G 2 F + Each step is detailed in the paper (also code available) ] N A (n) 2 F n=1 14 / 36

15 Convex approach An alternative approach is to relax the combinatorial problem argmin W E (W) + γ N rank ( ) W (n) n=1 The trace norm is a widely used convex surrogate for the rank. Therefore, we can consider the following convex relaxation: argmin E (W) + γ N W Tr (n) W n=1 Tensor completion: [Liu et al, 2009, Gandy et al, 2011, Signoretto et al, 2012] 15 / 36

16 Discussion In the Tucker decomposition the factors are explicit, hence we can add a new factor without relearning the full tensor ( N Space complexity: O n=1 d nk n + ) N n=1 K n which can be much smaller than that of the convex approach, particularly if K n d n Main drawback: no guarantee to find a local minimum GRR GTNR GMF TTNR TF Correlation m (Training Set Size) 16 / 36

17 Alternating Direction Method of Multipliers (ADMM) The regularizer N n=1 ADMM decouples the problem W (n) Tr is composite Introduce auxiliary tensors B n, n {1,..., N} accounting for each term in the sum, adding the constraints B n = W Optimize the resultant Lagrangian w.r.t. each B n only involves computing the proximal operator of the trace norm: 1 prox Tr (V ) = argmin 2 X V 2 F + X Tr X R d d 17 / 36

18 ADMM Want to minimize Decouple the regularization term { 1 min W,B 1,...,B N 1 N γ E(W) + Ψ ( ) W (n) γ E (W) + N n=1 n=1 Ψ ( } ) B n(n) : Bn = W, n = 1,...,N Augmented Lagrangian: L (W, B, C) = 1 γ E (W)+ N n=1 [ Ψ ( ) B n(n) Cn, W B n + β ] 2 W B n / 36

19 ADMM (cont.) L (W, B, C) = 1 γ E (W)+ N Updating equations: n=1 W [i+1] argmin W B [i+1] n argmin [ ( ) Ψ (B n(n) ) C n, W B n + β ] 2 W B n 2 2 B n C n [i+1] C n [i] ( L (W, B [i], C [i]) (W [i+1], B, C [i]) L βw [i+1] B [i+1] n ) 2nd step involves the computation of proximity operator of Ψ Convergence properties of ADMM are detailed e.g. in [Eckstein and Bertsekas, Mathematical Programming 1992] 19 / 36

20 Proximity Operator Let B = B n(n) and where A = (W 1 β C n) (n). Rewrite 2nd step as: { 1 ˆB = prox 1 Ψ(A) := argmin β B 2 B A } β Ψ(B) Case of interest: Ψ(B) = ψ(σ(b)) By von Neumann s matrix inequality: ( prox 1 β Ψ (A) = U Adiag prox 1 β ψ (σ A) ) V A 20 / 36

21 Rethinking the convex approach Convex envelope of a function f on a set S is the largest convex function f majorized by f at every point in S E.g: cardinality of a vector: f (v) = card(v) S = {v : v α} f (v) = v 1 /α The smaller S, the tighter the convex envelope card(v) card (v),α = 0.5 card (v),α = 1 card (v),α = In practice α is unknown and tuned by cross validation 21 / 36

22 Rethinking the convex approach The same reasoning applies to matrices W Tr /α is the convex envelope of rank(w) on the spectral unit ball of radius α [Fazel, Hindi, & Boyd, 2001] By using the regularizer N n=1 the same α for the different matricizations W (n) Tr we implicitly assume Difficulty with tensors: W (n) varies with n! 22 / 36

23 Rethinking the convex approach W (n) varies with n Let us consider W R Then: 0,1 0,0 1,0 Vectors of singular values of each matricization Smallest l ball containing each of the vectors Smallest l ball containing all vectors 23 / 36

24 Rethinking the convex approach We are interested in convex functions on matrices which are invariant to the matricization operation The Frobenius norm is very appealing: It is also a spectral function Therefore, we consider the set S = {W : W F α} Calculating the convex envelope of the rank on S can be reduced to calculating the convex envelope of card (v) on the set {v : v 2 α}, where v is the vector of singular values of W (follows by von Neumann s matrix inequality) 24 / 36

25 Convex envelope of the cardinality of a vector in the l 2 ball 0,1 0,0 1,0 Vectors of singular values of each matricization Smallest l ball containing each of the vectors Smallest l ball containing all vectors Smallest l 2 ball containing all vectors 25 / 36

26 Convex envelope of the cardinality of a vector in the l 2 ball Aim: derive convex envelope of f α (x) = card (x) on the set {x R d x 2 α} The conjugate of f α, s R d, is fα (s) = sup x s card (x) = max {α s 1:r 2 r} x B 2 r {0,...,d} Biconjugate of f α : fα (v) = sup s v fα (s), s R d v 2 α We do now know how to compute f. We can compute it and its proximal operator by projected sub-gradient descent 26 / 36

27 Quality of Relaxation (cont.) Lemma. If x 2 = α then ω α (x) = card (x). Let Ω α (W) = N ω α (σ(w (n) )), W tr = n=1 N σ(w (n) ) 1 n=1 Implication: Theorem. If W satisfies (a,b,c) below then Ω pmin (W) > W tr a) W (n) 1 n b) W 2 = p min c) min n rank(w (n) ) < max rank(w (n) ) n On the other hand, ω 1 is the convex envelope of card on l 2 unit ball, so: Ω 1 (W) W tr, W : W / 36

28 Experiments on tensor completion Video compression ( tensor) Exam score prediction ( tensor) Tensor Trace Norm Proposed Regularizer Tensor Trace Norm Proposed Regularizer RMSE RMSE m (Training Set Size) x Time comparison: Tensor Trace Norm Proposed Regularizer m (Training Set Size) 2000 Seconds p 28 / 36

29 Extensions and further work Latent tensor norm Tensor nuclear norm and an open problem Controlling the rank of all matricizations (quantum physics) Kernel methods Statistical analysis 29 / 36

30 Latent tensor nuclear norm Defined by the variational problem [Tomioka and Suzuki 2013, Wimalawarne et al 2014] { m m } W LNN = inf σ(v j ) 1 Mj V j = W. j=1 j=1 where M j is the adjoint of the matricization operator The associated unit ball is conv{w rank(w (n) ) 1, W (n) 1, n} 30 / 36

31 Latent tensor nuclear norm If we change the above to conv{w rank(w (n) ) k, W (n) p 1} the induced norm becomes [Combettes et al 2016] { m m } W LNN = inf σ(v j ) p,k Mj V j = W j=1 where p,k is the (p, k)-support norm [McDonald, Pontil, Stamos 2016]. Its unit ball is Ongoing experiments... j=1 conv{x R d card(x) k, x p 1} 31 / 36

32 Tensor nuclear norm Defined by W TNN,p = inf { λ p K k=1 λ k u (1) k u (N) k } = W where the infimum is over K N, λ = (λ 1,..., λ K ) R K and vectors u (n) k R dn such that u r (n) = 1, n, k Originally proposed and claimed to be a norm in [Lim & Comon, 2010], however [Friedland & Lim, 2016] shows this is well defined and a norm only when p = 1, but it is always zero if p > 0 disproving the claim in the former paper 32 / 36

33 Tensor nuclear norm [Friedland & Lim, 2016] also shows that computing W TNN is NP-hard. Assume for simplicity d 1 = = d N = d Claim 1. We may restrict w.l.o.g. K d N 1 Claim 2. We can rewrite the norm as { K N W TNN = inf where now the v (n) k k=1 n=1 v (n) k : K k=1 λ k v (1) k v (N) k } = W R d are not constrained to have unit norm 33 / 36

34 Tensor nuclear norm Claim 3. Let and ϕ(w) = E(W) + γ W TNN h(v (1),..., V (N) ) = ϕ(v (1) V (N) ) If (V (1),..., V (N) ) is a local minima of h then is a local minima of ϕ True for N = 2 (easy to show) W = V (1) V (N) 34 / 36

35 Controlling the rank of all matricizations Let c {1,..., } and W (i,ī) be the matricization obtained by taking the modes in c as rows and those in c as columns Fact: if max rank(w (c, c) ) k c {1,...,N} then W has a compact decomposition called matrix-product-state in quantum physics [Vidal 2003] W i1,...,i n = trace(a 1 i 1 A N i n ) where A n i are k k matrices, so we can describe the tensor by O(Nk 2 ) parameters. However, with this method we do not have control of max n rank(w (n)) If N 6 we may still obtain a latent norm convex relaxation 35 / 36

36 THANK YOU! 36 / 36

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data