Algorithms for tensor approximations

Size: px

Start display at page:

Download "Algorithms for tensor approximations"

Melanie Ferguson
5 years ago
Views:

1 Algorithms for tensor approximations Lek-Heng Lim MSRI Summer Graduate Workshop July 7 18, 2008 L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

2 Synopsis Naïve: the Gauss-Seidel heuristic. Harmonic analysis: pursuits algorithms. Real algebraic geometry: semi-definite programming. Riemannian geometry: Grassman-Newton method. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

3 Recap: best low rank approximation of a hypermatrix Outer product rank: A R l m n. Want u i R l, v i R m, w i R n unit vectors, σ i R, that minimize A r σ iu i v i w i. i=1 Symmetric outer product rank: A S 3 (R n ). Want v i unit vector, λ i R, that minimize A r λ iv i v i v i. i=1 Nonnegative outer product rank: A R l m n +. Want x i R l +, y i R m +, z i R n + unit vectors, δ i R +, that minimize A r i=1 δ ix i y i z i. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

4 Recap: best low rank approximation of a hypermatrix Multilinear rank: A R l m n. Want U R l r 1, V R m r 2, W R n r 3 matrices with orthonormal columns, C R r 1 r 2 r 3, that minimize A (U, V, W ) C. Hybrid: A R l m n. Want B 1,..., B r R l m n with rank (B i ) (r 1, r 2, r 3 ), B i = 1, that minimize A r i=1 σ ib i. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

5 Gauss-Seidel method Optimal solution B to argmin rank (B) r A B F not easy to compute since the objective function is non-convex. A widely used strategy is a nonlinear Gauss-Seidel algorithm, better known as the Alternating Least Squares algorithm: Algorithm: ALS for optimal rank-r approximation initialize X (0) R l r, Y (0) R m r, Z (0) R n r ; initialize s (0), ε > 0, k = 0; while ρ (k+1) /ρ (k) > ε; X (k+1) argmin X R l r T r α=1 x(k+1) α Y (k+1) argmin Ȳ R m r T r α=1 x(k+1) α Z (k+1) argmin Z R n r T r α=1 x(k+1) α ρ (k+1) r α=1 [x(k+1) a y (k+1) α z (k+1) α y (k) α ȳ (k+1) α y (k+1) α x (k) α k k + 1; Coordinate cycling heuristic. May not converge. z (k) α 2 F ; z (k) α 2 F ; z (k+1) α 2 F ; y α (k) z (k) α ] 2 F ; L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

6 Best r-term approximation f α 1 f 1 + α 2 f α r f r. Target function f H vector space, cone, etc. f 1,..., f r D H dictionary. α 1,..., α r R or C (linear), R + (convex), R { } (tropical). with respect to ϕ : H H R, some measure of nearness between pairs of points (e.g. norms, metric, volumes, expectation, entropy, Brègman divergences, etc), want argmin{ϕ(f, α 1 f α r f r ) f i D}. For concreteness, H separable Hilbert space; measure of nearness is a norm, but not necessarily the one induced by its inner product. Reference: various papers by A. Cohen, R. DeVore, V. Temlyakov. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

7 Recap: dictionaries Discrete cosine: D = { 2 N cos(k )(n ) π N } k [N 1] C N. Taylor: Fourier: D = {x n n N {0}} C ω (R). D = {cos(nx), sin(nx) n Z} L 2 ( π, π). Peter-Weyl: D = { π(x)e i, e j π Ĝ, i, j [d π]} L 2 (G). L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

8 Recap: dictionaries Paley-Wiener: Gabor: Wavelet: D = {sinc(x n) n Z} H 2 (R). D = {e iαnx e (x mβ)2 /2 (m, n) Z Z} L 2 (R). D = {2 n/2 ψ(2 n x m) (m, n) Z Z} L 2 (R). Friends of wavelets: D L 2 (R 2 ) beamlets, brushlets, curvelets, ridgelets, wedgelets, multiwavelets. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

9 Approximants Definition Dictionary D H. For r N, the set of r-term approximants is { r } Σ r (D) := α if i H α i C, f i D. i=1 Let f H. The error of r-term approximation is σ n (f ) := inf g Σr (D) f g. Linear combination of two r-term approximants may have more than r non-zero terms. Σ r (D) not a subspace of H. Hence nonlinear approximation. In contrast with usual (linear) approximation, ie. inf g span(d) f g. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

10 Small is beautiful f i I D α if i Want good approximation, ie. f i I D α if i small. Want sparse/concentrated representation, ie. I small. Sparsity depends on choice of D. D 10 = {10 n n Z}, D 3 = {3 n n Z} R, 1 3 = [ ] 10 = n= n = [0.1] 3 = D fourier = {cos(nx), sin(nx) n Z}, Dtaylor = {x n n N {0}}, 1 2 x = sin(x) 1 2 sin(2x) sin(3x). sin(x) = x 1 6 x x 5. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

11 Bigger is better Union of dictionaries: allows for efficient (sparse) representation of different features D = Dfourier D wavelets, D = Dspikes D sinusoids D splines, D = D wavelets D curvelets D beamlets D ridgelets. D overcomplete or redundant dictionary. Trade off: computational complexity. Rule of thumb: the larger and more diverse the dictionary, the more efficient/sparser the representation. Observation: D above all zero dimensional (at most countably infinite). Question: What about dictionaries with a continuously varying families of functions? L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

12 Dictionaries of positive dimensions Neural networks: D = {σ(w x + w 0 ) L 2 (R n ) (w 0, w) R R n } where σ : R R sigmoid function, eg. σ(x) = [1 + exp( x)] 1. Exponential: Separable: D = {e tx t R + } or D = {e τx τ C}. D = {g L 2 (R 3 ) g(x, y, z) = ϑ(x)ϕ(y)ψ(z)} where ϑ, ϕ, ψ : R R. Symmetric separable: where ϕ : R R. D = {g L 2 (R 3 ) g(x, y, z) = ϕ(x)ϕ(y)ϕ(z)} L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

13 Same thing different names rth secant (quasiprojective) variety of the Segre variety is the set of r term approximants. If D = Seg(R l, R m, R n ), then Σ r (D) = {A R l m n rank (A) r}. Outer product decomposition: D = {u v w (u, v, w) R l R m R n } = {A R l m n rank (A) 1}. Symmetric outer product decomposition: D = {v v v v R n } = {A S 3 (R n ) rank S (A) 1}. Nonnegative outer product decomposition: D = {x y z (x, y, z) R l + R m + R n +} = {A R l m n + rank + (A) 1}. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

14 Pursuit algorithms Stepwise projection: g k = argmin g D { f h h span{g 1,..., g k 1, g}}, f k = proj span{g1,...,g k }(f ). Orthonormal matching pursuit: g k = argmax g D f f k 1, g, f k = proj span{g1,...,g k }(f ). Pure greedy: g k = argmax g D f f k 1, g, f k = f k 1 + f f k 1, g k g k. Relaxed greedy: g k = argmin g D { f h h span{f k 1, g}}, f k = α k f k 1 + β k g k. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

15 Recap: hypermatrices are functions on finite sets Totally ordered finite sets: [n] = {1 < 2 < < n}, n N. Hypermatrix (order 3) f : [l] [m] [n] R. If f (i, j, k) = a ijk, then f is represented by A = a ijk l,m,n i,j,k=1 Rl m n. l 2 ([l] [m] [n]) = l 2 ([l]) l 2 ([m]) l 2 ([n]): A, B R l m n, A, B = l,m,n i,j,k=1 a ijkb ijk. Frobenius norm A 2 F = l,m,n i,j,k=1 a2 ijk. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

16 Pursuit algorithms for tensor approximations Tensor approximation. Target function f : [l] [m] [n] R. Dictionary of separable functions, D = {g : [l] [m] [n] R g(i, j, k) = ϑ(i)ϕ(j)ψ(k)}, where ϑ : [l] R, ϕ : [m] R, ψ : [n] R. Symmetric tensor approximation. Target function: f : [n] [n] [n] R with f (i, j, k) = f (j, i, k) = = f (k, j, i). Dictionary of symmetric separable functions: D S = {g : [n] [n] [n] R g(i, j, k) = ϑ(i)ϑ(j)ϑ(k)}, where ϑ : [l] R. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

17 Pursuit algorithms for tensor approximations Nonnegative tensor approximation. Target function f : [l] [m] [n] R +. Dictionary of nonnegative separable functions, D + = {g : [l] [m] [n] R + g(i, j, k) = ϑ(i)ϕ(j)ψ(k)}, where ϑ : [l] R +, ϕ : [m] R +, ψ : [n] R +. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

18 Some history f polynomial in variables x = (x 1,..., x N ). Suppose f : R N R non-negative valued, ie. f (x) 0 for all x R N. Question: Can we write f as a sum of squares of polynomials, f (x) = M j=1 p j(x) 2? Answer (Hilbert): Not in general, eg. f (w, x, y, z) = w 4 + x 2 y 2 + y 2 z 2 + z 2 x 2 4xyzw. Hilbert s 17th Problem: Can we write f as a sum of squares of rational functions, Answer (Artin): Yes! f (x) = M j=1 ( ) pj (x) 2? q j (x) L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

19 SDP based algorithms Observation 1: F (x 11,..., z nr ) = A r α=1 x α y α z α 2 F = l,m,n i,j,k=1 (a ijk r α=1 x iαy jα z kα ) 2 is a polynomial of total degree 6 (resp. 2k for order k-tensors) in variables x 11,..., z nr. Multivariate polynomial optimization: non-convex problem argmin F (x 11,..., z nr ) may be relaxed to a convex problem (thus global optima is guranteed) which can in turn be solved using semidefinite programming (SDP). [Lasserre; 2001], [Parrilo; 2003], [Parrilo, Sturmfels; 2003]. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

20 How it works Observation 2: If F λ can be expressed as a sum of squares of polynomials F (x 11,..., z nr ) λ = n then λ is a global lower bound for F, ie. for all x 11,..., z nr R. F (x 11,..., z nr ) λ i=1 P i(x 11,..., z nr ) 2, Simple strategy: Find the largest λ such that F λ is a sum of squares. Then λ is often min F (x 11,..., z nr ). L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

21 Sketch Write v = (1, x 11,..., z nr,..., x l1 y m1 z n1,..., znr 6 ), the D-tuple of monomials of total degree 6, where ( ) r(l + m + n) + 3 D :=. 3 Write F (x 11,..., z nr ) = α v where α = (α 1,..., α D ) R D are the coefficients of the respective monomials. Since deg(f ) is even, F may also be written as for some M R D D. So where E 11 = e 1 e 1 RD D. F (x 11,..., z nr ) = v Mv F (x 11,..., z nr ) λ = v (M λe 11 )v L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

22 Sketch Observation 3: The rhs is a sum of squares iff M λe 11 is positive semidefinite (since M λe 11 = B B). Hence we have This is an SDP problem minimize λ subjected to v (S + λe 11 )v = F, S 0. minimize 0 S λ subjected to S B 1 + λ = α 1, S B k = α k, k = 2,..., D S 0, λ R. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

23 Properties May be solved in polynomial time. Like all SDP-based algorithms, duality produces a certificate that tells us whether we have arrived at a globally optimal solution. The duality gap, ie. difference between the values of the primal and dual objective functions, is 0 at a global minima. Complexity: For rank-r approximations to order-k tensors A R d 1 d k, ( ) r(d1 + + d k ) + k D = k is large even for moderate d i, r and k. Sparsity to the rescue: The polynomials that we are interested in are always sparse (eg. for k = 3, only terms of the form xyz or x 2 y 2 z 2 or uvwxyz appear). L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

24 Newton polytope Newton polytope of a polynomial f is the convex hull of the powers of the monomials in f. Example Newton polytope of f (x, y) = 3.67x 4 y x 3 y x y is the convex hull of the points (4, 10), (3, 3), (3, 0), (2, 0), (0, 0) in R 2. Newton polytope of g(x, y, z) = 1.7x 4 y 6 z x 3 z 5 3.0y yz 2 is the convex hull of the points (4, 6, 2), (3, 0, 5), (0, 4, 0), (0, 1, 2) in R 3. Theorem (Reznick) If f (x) = m i=1 p i(x) 2, then the powers of the monomials in p i must lie in Newton(f ). 1 2 L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

25 Multilinear polynomial The Newton polytope for a polynomial of the form f (x 11,..., z nr ) = λ + l,m,n i,j,k=1 (a ijk r α=1 x iαy jα z kα ) 2 is spanned by 1 and monomials of the form xiα 2 y jα 2 z2 kα (ie. monomials of the form x iα y jα z kα and x iα y jα z kα x iβ y jβ z kβ may all be dropped). So if f (x 11,..., z nr ) = N j=1 p j(x 11,..., z nr ) 2, then only 1 and monomials of the form x iα y jα z kα may occur in p 1,..., p N. In other words, ) we have reduced the size of the problem from to rlmn + 1. ( r(l+m+n)+3 3 L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

26 Global convergence If polynomials of the form λ + l,m,n i,j,k=1 (a ijk r α=1 x iαy jα z kα ) 2 can always be written as a sum of polynomials (we don t know), then the SDP algorithm for optimal low-rank tensor approximation will always converge globally. Numerical experiments performed by Parrilo on general polynomials yield λ = min F in all cases. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

27 Best multilinear rank approximation Given A R l m n, want rank (B) = (r 1, r 2, r 3 ) with min A B F = min A (X, Y, Z) C F C R r 1 r 2 r 3, X R l r 1, Y R m r 2, Z R n r 3 orthonormal. Problem overparameterized and equivalent to max (X, Y, Z ) A = max A (X, Y, Z) F, F X X = I, Y Y = I, Z Z = I. Problem defined on a product of Grassmann manifolds since A (X, Y, Z) F = A (XQ 1, YQ 2, ZQ 3 ) F, for any (Q 1, Q 2, Q 3 ) O(l) O(m) O(n). Only the subspaces spanned by X, Y, Z matters. Problem reformulated as max (X,Y,Z) Gr(l,r1 ) Gr(m,r 2 ) Gr(n,r 3 ) Φ(X, Y, Z). L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

28 Newton and Quasi-Newton algorithms on manifolds T X tangent space at X Gr(n, r) R n r T X X = 0 1 Compute Grassmann gradient Φ T (X,Y,Z). 2 Compute Hessian or update Hessian approximation H : T (X,Y,Z) H T (X,Y,Z). 3 At (X, Y, Z) Gr(l, r 1 ) Gr(m, r 2 ) Gr(n, r 3 ), solve H = Φ for search direction. 4 Update iterate (X, Y, Z): Move along geodesic from (X, Y, Z) in the direction given by. Optimize over a product of three (or more) Grassmannians. [Gabay, 1982], [Arias, Edelman, Smith; 1999], [Eldén, Savas; 2008]. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

29 Picture L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

30 Quasi-Newton and BFGS update The BFGS update H k+1 = H k H ks k s k H k s k H ks k + y ky k y k y k where s k = x k+1 x k = t k p k, y k = f k+1 f k. On Grassmann manifold the vectors are defined on different points belonging to different tangent spaces. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

31 Different ways of parallel transporting vectors X Gr(n, r), 1, 2 T X and X (t) geodesic path along 1 Parallel transport using global coordinates 2 (t) = T 1 (t) 2 we have also 1 = X D 1 and 2 = X D 2 where X basis for T X. Let X (t) be basis for T X (t). Parallel transport using local coordinates 2 (t) = X (t) D 2. L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

32 Parallel transport in local coordinates All transported tangent vectors have the same coordinate representation in the basis X (t) at all points on the path X (t). Plus No need to transport the gradient or the Hessian. Minus Need to compute X (t). In global coordinate we compute T k+1 s k = t k T k (t k )p k T k+1 y k = f k+1 T k (t k ) f k T k (t k )H k T 1 k (t k ) : T k+1 T k+1 H k+1 = H k H ks k s k H k s k H ks k + y ky k y k y k L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

33 Limited memory BFGS Compact representation of BFGS in Euclidean space: H k = H 0 + [ S k where ] [ R H 0 Y k k S k = [s 0,..., s k 1 ], (D k + Y k H 0Y k )R 1 R k k R 1 k 0 Y k = [y 0,..., y k 1 ], [ ] D k = diag s 0 y 0,..., s k 1 y k 1, s 0 y 0 s 0 y 1 s 0 y k 1 0 s 1 R k = y 1 s 1 y k s k 1 y k 1 ] [ ] S k Yk H 0 L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

34 Limited memory BFGS Limited memory BFGS [Byrd et al; 1994]. Replace H 0 by γ k I and keep the m most resent s j and y j, H k = γ k I + [ S k where ] [ R γ k Y k k S k = [s k m,..., s k 1 ], (D k + γ k Y k Y k)r 1 R k k R 1 k 0 ] [ S k γ k Y k Y k = [y k m,..., y k 1 ], [ ] D k = diag s k m y k m,..., s k 1 y k 1, s k m y k m s k m y k m+1 s k m y k 1 0 s k m+1 R k = y k m+1 s k m+1 y k s k 1 y k 1 ] L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

35 L-BFGS on the Grassmann manifold In each iteration, parallel transport vectors in S k and Y k to T k, ie. perform S k = TS k, Ȳ k = TY k where T is the transport matrix. No need to modify R k or D k u, v = T u, T v where u, v T k and T u, T v T k+1. H k nonsingular, Hessian is singular. No problem T k at x k is invariant subspace of H k, ie. if v T k then H k v T k. [Savas, L.; 2008] L.-H. Lim (MSRI SGW) Algorithms for tensor approximations July 7 18, / 35

Globally convergent algorithms for PARAFAC with semi-definite programming

Globally convergent algorithms for PARAFAC with semi-definite programming Lek-Heng Lim 6th ERCIM Workshop on Matrix Computations and Statistics Copenhagen, Denmark April 1 3, 2005 Thanks: Vin de Silva,