Conditioning and illposedness of tensor approximations

Size: px

Start display at page:

Download "Conditioning and illposedness of tensor approximations"

Elinor Joseph
5 years ago
Views:

1 Conditioning and illposedness of tensor approximations and why engineers should care Lek-Heng Lim MSRI Summer Graduate Workshop July 7 18, 2008 L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

2 Slight change of plans Week 1 Mon: Overview: tensor approximations (LH) Wed: Conditioning and ill-posedness of tensor approximations, nonnegative and symmetric tensors, some applications (LH) Week 2 Tue: Computations: Gauss-Seidel method, semidefinite programming, Hilbert s 17th problem, optimization on Stiefel and Grassmann manifolds (LH) L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

3 Tensor rank is hard to compute Eugene L. Lawler: The Mystical Power of Twoness. 2-SAT is easy, 3-SAT is hard; 2-dimensional matching is easy, 3-dimensional matching is hard; Order-2 tensor rank is easy, order-3 tensor rank is hard. Theorem (Håstad) Computing rank (A) for A F l m n is NP-hard for F = Q and NP-complete for F = F q? Open question: Is tensor rank NP-hard/NP-complete over F = R, C in the sense of BCSS? L. Blum, F. Cucker, M. Shub, S. Smale, Complexity and real computation, Springer-Verlag, New York, NY, L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

4 Tensor rank depends on base field For A R m n C m n, rank R (A) = rank C (A). Not true for tensors. Theorem (Bergman) For A R l m n C l m n, rank (A) is base field dependent. x, y R n linearly independent and let z = x + iy. x x x x y y + y x y + y y x May show that rank,r (A) = 3 and rank,c (A) = 2. = 1 (z z z + z z z). 2 R has 8 distinct orbits under GL 2 (R) GL 2 (R) GL 2 (R). C has 7 distinct orbits under GL 2 (C) GL 2 (C) GL 2 (C). L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

5 Recall: fundamental problem of multiway data analysis A hypermatrix, symmetric hypermatrix, or nonnegative hypermatrix. Want argmin rank(b) r A B. rank(b) may be outer product rank, multilinear rank, symmetric rank (for symmetric hypermatrix), or nonnegative rank (nonnegative hypermatrix). Example Given A R d 1 d 2 d 3, find σ i, u i, v i, w i, i = 1,..., r, that minimizes A σ 1 u 1 v 1 w 1 σ 2 u 2 v 2 w 2 σ r u r v r w r or C R r 1 r 2 r 3 and U R d 1 r 1, V R d 2 r 2, W R d 3 r 3, that minimizes A (U, V, W ) C. May assume u i, v i, w i unit vectors and U, V, W orthonormal columns. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

6 Recall: fundamental problem of multiway data analysis Example Given A S k (C n ), find u i, i = 1,..., r, that minimizes A λ 1 u k 1 λ 2 u k 2 λ r u k r or C R r 1 r 2 r 3 and U R n r i that minimizes A (U, U, U) C. May assume u i unit vector and U orthonormal columns. Pierre s lectures in Week 2. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

7 Best low rank approximation of a matrix Given A R m n. Want argmin rank(b) r A B. More precisely, find σ i, u i, v i, i = 1,..., r, that minimizes A σ 1 u 1 v 1 σ 2 u 2 v 2 σ r u r v r. Theorem (Eckart Young) Let A = UΣV = rank(a) i=1 σ i u i v i be singular value decomposition. For r rank(a), let A r := r i=1 σ iu i v i. Then A A r F = min A B F. rank(b) r No such thing for hypermatrices of order 3 or higher. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

8 Lemma Let r 2 and k 3. Given the norm-topology on R d 1 d k, the following statements are equivalent: 1 The set S r (d 1,..., d k ) := {A rank (A) r} is not closed. 2 There exists a sequence A n, rank (A n ) r, n N, converging to B with rank (B) > r. 3 There exists B, rank (B) > r, that may be approximated arbitrarily closely by hypermatrices of strictly lower rank, i.e. inf{ B A rank (A) r} = 0. 4 There exists C, rank (C) > r, that does not have a best rank-r approximation, i.e. inf{ C A rank (A) r} is not attained (by any A with rank (A) r). L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

9 Non-existence of best low-rank approximation For x i, y i R d i, i = 1, 2, 3, A := x 1 x 2 y 3 + x 1 y 2 x 3 + y 1 x 2 x 3. For n N, ( A n := n x ) n y 1 ( x ) ( n y 2 x ) n y 3 nx 1 x 2 x 3. Lemma rank (A) = 3 iff x i, y i linearly independent, i = 1, 2, 3. Furthermore, it is clear that rank (A n ) 2 and lim n A n = A. Original result, in a slightly different form, due to: D. Bini, G. Lotti, F. Romani, Approximate solutions for the bilinear form computational problem, SIAM J. Comput., 9 (1980), no. 4. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

10 Outer product approximations are ill-behaved Such phenomenon can and will happen for all orders > 2, all norms, and many ranks: Theorem Let k 3 and d 1,..., d k 2. For any s such that 2 s min{d 1,..., d k }, there exists A R d 1 d k with rank (A) = s such that A has no best rank-r approximation for some r < s. The result is independent of the choice of norms. For matrices, the quantity min{d 1, d 2 } will be the maximal possible rank in R d 1 d 2. In general, a hypermatrix in R d 1 d k can have rank exceeding min{d 1,..., d k }. Vin s next three lectures. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

11 Outer product approximations are ill-behaved Tensor rank can jump over an arbitrarily large gap: Theorem Let k 3. Given any s N, there exists a sequence of order-k hypermatrix A n such that rank (A n ) r and lim n A n = A with rank (A) = r + s. Hypermatrices that fail to have best low-rank approximations are not rare. May occur with non-zero probability; sometimes with certainty. Theorem Let µ be a measure that is positive or infinite on Euclidean open sets in R l m n. There exists some r N such that µ({a A does not have a best rank-r approximation}) > 0. In R 2 2 2, all rank-3 hypermatrices fail to have best rank-2 approximation. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

12 Message That the best rank-r approximation problem for hypermatrices has no solution poses serious difficulties. It is incorrect to think that if we just want an approximate solution, then this doesn t matter. If there is no solution in the first place, then what is it that are we trying to approximate? i.e. what is the approximate solution an approximate of? L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

13 Weak solutions For a hypermatrix A that has no best rank-r approximation, we will call a C {A rank (A) r} attaining inf{ C A rank (A) r} a weak solution. In particular, we must have rank (C) > r. It is perhaps surprising that one may completely parameterize all limit points of order-3 rank-2 hypermatrices. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

14 Weak solutions Theorem Let d 1, d 2, d 3 2. Let A n R d 1 d 2 d 3 with rank (A n ) 2 and be a sequence of hypermatrices lim n A n = A, where the limit is taken in any norm topology. If the limiting hypermatrix A has rank higher than 2, then rank (A) must be exactly 3 and there exist pairs of linearly independent vectors x 1, y 1 R d 1, x 2, y 2 R d 2, x 3, y 3 R d 3 such that A = x 1 x 2 y 3 + x 1 y 2 x 3 + y 1 x 2 x 3. In particular, a sequence of order-3 rank-2 hypermatrices cannot jump rank by more than 1. Details: Vin s lectures. Not possible in general: JM s lectures. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

15 Conditioning of linear systems Let A R n n and b R n. Suppose we want to solve system of linear equations Ax = b. M = {A R n n det(a) = 0} is the manifold of ill-posed problems. A M iff Ax = 0 has nontrivial solutions. Note that det(a) is a poor measure of conditioning. Conditioning is the inverse distance to ill-posedness [Demmel; 1987] (also Dedieu, Shub, Smale), ie. 1 A 1 2. Normalizing by A 2 yields condition number Note that 1 A 2 A 1 2 = 1 κ 2 (A). A = σ n = min x i,y i A x 1 y 1 x n 1 y n 1 2. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

16 Conditioning of linear systems Important for error analysis [Wilkinson, 1961]. Let A = UΣV and define Then S forward (ε) = {x R n Ax = b, x x 2 ε} = {x R n n i=1 x i x i 2 ε 2 }, S backward (ε) = {x R n Ax = b, b b 2 ε} = {x R n x x = V (y y), n i=1 σ2 i y i y i 2 ε 2 }. S backward (ε) S forward (σ 1 n ε), S forward (ε) S backward (σ 1 ε). Determined by σ 1 = A 2 and σ 1 n = A 1 2. Rule of thumb: log 10 κ 2 (A) loss in number of digits of precision. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

17 What about multilinear systems? Look at the simplest case. Take A = a ijk R and b 0, b 1, b 2 R 2. a 000 x 0 y 0 + a 010 x 0 y 1 + a 100 x 1 y 0 + a 110 x 1 y 1 = b 00, a 001 x 0 y 0 + a 011 x 0 y 1 + a 101 x 1 y 0 + a 111 x 1 y 1 = b 01, a 000 x 0 z 0 + a 001 x 0 z 1 + a 100 x 1 z 0 + a 101 x 1 z 1 = b 10, a 010 x 0 z 0 + a 011 x 0 z 1 + a 110 x 1 z 0 + a 111 x 1 z 1 = b 11, a 000 y 0 z 0 + a 001 y 0 z 1 + a 010 y 1 z 0 + a 011 y 1 z 1 = b 20, a 100 y 0 z 0 + a 101 y 0 z 1 + a 110 y 1 z 0 + a 111 y 1 z 1 = b 21. When does this have a solution? What is the corresponding manifold of ill-posed problems? When does the homogeneous system, ie. b 0 = b 1 = b 2 = 0, have a non-trivial solution, ie. x 0, y 0, z 0? L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

18 2 2 2 hyperdeterminant Hyperdeterminant of A = a ijk R [Cayley; 1845] is Det 2,2,2(A) = 1 [ ([ ] [ ]) a000 a 010 a100 a 110 det + 4 a 001 a 011 a 101 a 111 ([ ] [ ])] 2 a000 a 010 a100 a 110 det a 001 a 011 a 101 a 111 [ ] a000 a det det a 001 a 011 [ a100 a 110 a 101 a 111 A result that parallels the matrix case is the following: the system of bilinear equations a 000x 0y 0 + a 010x 0y 1 + a 100x 1y 0 + a 110x 1y 1 = 0, a 001x 0y 0 + a 011x 0y 1 + a 101x 1y 0 + a 111x 1y 1 = 0, a 000x 0z 0 + a 001x 0z 1 + a 100x 1z 0 + a 101x 1z 1 = 0, a 010x 0z 0 + a 011x 0z 1 + a 110x 1z 0 + a 111x 1z 1 = 0, a 000y 0z 0 + a 001y 0z 1 + a 010y 1z 0 + a 011y 1z 1 = 0, a 100y 0z 0 + a 101y 0z 1 + a 110y 1z 0 + a 111y 1z 1 = 0, has a non-trivial solution iff Det 2,2,2 (A) = 0. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31 ].

19 2 2 3 hyperdeterminant Hyperdeterminant of A = a ijk R is a 000 a 001 a 002 a 100 a 101 a 102 Det 2,2,3(A) = det a 100 a 101 a 102 det a 010 a 011 a 012 a 010 a 011 a 012 a 110 a 111 a 112 a 000 a 001 a 002 a 000 a 001 a 002 det a 100 a 101 a 102 det a 010 a 011 a 012 a 110 a 111 a 112 a 110 a 111 a 112 Again, the following is true: a 000x 0y 0 + a 010x 0y 1 + a 100x 1y 0 + a 110x 1y 1 = 0, a 001x 0y 0 + a 011x 0y 1 + a 101x 1y 0 + a 111x 1y 1 = 0, a 002x 0y 0 + a 012x 0y 1 + a 102x 1y 0 + a 112x 1y 1 = 0, a 000x 0z 0 + a 001x 0z 1 + a 002x 0z 2 + a 100x 1z 0 + a 101x 1z 1 + a 102x 1z 2 = 0, a 010x 0z 0 + a 011x 0z 1 + a 012x 0z 2 + a 110x 1z 0 + a 111x 1z 1 + a 112x 1z 2 = 0, a 000y 0z 0 + a 001y 0z 1 + a 002y 0z 2 + a 010y 1z 0 + a 011y 1z 1 + a 012y 1z 2 = 0, a 100y 0z 0 + a 101y 0z 1 + a 102y 0z 2 + a 110y 1z 0 + a 111y 1z 1 + a 112y 1z 2 = 0, has a non-trivial solution iff Det 2,2,3 (A) = 0. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

20 Cayley hyperdeterminant and tensor rank The Cayley hyperdeterminant Det 2,2,2 may be extended to any A R d 1 d 2 d 3 with rank (A) 2. Theorem Let d 1, d 2, d 3 2. A R d 1 d 2 d 3 is a weak solution, i.e. iff Det 2,2,2 (A) = 0. Theorem (Kruskal) A = x 1 x 2 y 3 + x 1 y 2 x 3 + y 1 x 2 x 3, Let A R Then rank (A) = 2 if Det 2,2,2 (A) > 0 and rank (A) = 3 if Det 2,2,2 (A) < 0. Vin s next three lectures. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

21 Condition number of a multilinear system Like the matrix determinant, the value of the hyperdeterminant is a poor measure of conditioning. Need to compute distance to M. Theorem Let A R Det 2,2,2 (A) = 0 iff for some x i, y i R 2, i = 1, 2, 3. A = x x y + x y x + y x x Conditioning of the problem can be obtained from min A x x y x y x y x x. x,y x x y + x y x + y x x has outer product rank 3 generically (in fact, iff x, y are linearly independent). Surprising: the manifold of ill-posed problem has full rank almost everywhere! L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

22 Nonnegative hypermatrices and nonnegative tensor rank Let 0 A R l m n. The nonnegative rank of A is rank + (A) := min { r r p=1 x p y p z p, x p, y p, z p 0 } Clearly nonnegative decomposition exists for any A 0. Arises in the Naïve Bayes model, Gaussian mixture models. Theorem Let A = a ijk R l m n be nonnegative. Then inf { A r x p y p z p x p, y p, z p 0 } p=1 is always attained. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

23 Nonnegative matrix factorization D.D. Lee and H.S. Seung, Learning the parts of objects by nonnegative matrix factorization, Nature, 401 (1999), pp Main idea behind NMF (everything else is fluff): the way dictionary functions combine to build target objects is an exclusively additive process and should not involve any cancellations between the dictionary functions. NMF in a nutshell: given nonnegative matrix A, decompose it into a sum of outer-products of nonnegative vectors: A = XY = r i=1 x i y i. Noisy situation: approximate A by a sum of outer-products of nonnegative vectors min A XY F = min A r x F i y i. X 0,Y 0 x i 0,y i 0 i=1 L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

24 Generalizing to hypermatrices Nonnegative outer-product decomposition for hypermatrix A 0 is A = r p=1 x p y p z p where x p R l +, y p R m +, z p R n +. Clear that such a decomposition exists for any A 0. Nonnegative outer-product rank: minimal r for which such a decomposition is possible. Best nonnegative outer-product rank-r approximation: argmin { A r p=1 x p y p z p F x p, y p, z p 0 }. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

25 Recap: outer product decomposition in spectroscopy Application to fluorescence spectral analysis by [Bro; 1997]. Specimens with a number of pure substances in different concentration a ijk = fluorescence emission intensity at wavelength λ em j excited with light at wavelength λ ex k. Get 3-way data A = a ijk R l m n. Get outer product decomposition of A A = x 1 y 1 z x r y r z r. Get the true chemical factors responsible for the data. r: number of pure substances in the mixtures, xp = (x 1p,..., x lp ): relative concentrations of pth substance in specimens 1,..., l, yp = (y 1p,..., y mp ): excitation spectrum of pth substance, z p = (z 1p,..., z np ): emission spectrum of pth substance. of ith sample Noisy case: find best rank-r approximation (candecomp/parafac). L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

26 Proof Naive choice of objective: g : (R l R m R n ) r R, g(x 1, y 1, z 1,..., x r, y r, z r ) := A r p=1 x p y p z p 2 F. Need to show g attains infimum on (R l + R m + R n +) r. Doesn t work because of an additional degree of freedom x i, y i, z i may be scaled by non-zero positive scalars that product to 1, αx βy γz = x y z, αβγ = 1, e.g. (nx) y (z/n) can have a diverging loading vector even while the outer-product remains fixed. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

27 Picking the right objective function Define f : R r (R l R m R n ) r R by f (X ) := A r p=1 λ pu p v p w p 2 F where X = (λ 1,..., λ r ; u 1, v 1, w 1,..., u r, v r, w r ). Let S n 1 + := {x Rn + x 2 = 1} and P := R r + (S l 1 + Sm 1 + S+ n 1 )r. Global minimizer of f on P, (λ 1,..., λ r ; u 1, v 1, w 1,..., u r, v r, w r ) P, gives required global minimizer (non-unique). L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

28 Level sets are compact Note P is closed but unbounded. Will show that the level set of f restricted to P, is compact for all α. E α = {X P f (X ) α} E α = P f 1 (, α] closed since f continuous. Now to show E α bounded. Suppose not, {Xn } n=1 P with X n but f (X n ) α for all n. Clearly, Xn implies λ (n) q for at least one q {1,..., r}. Note f (X ) ( A F r p=1 λ pu p v p w p F ) 2. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

29 Using nonnegativity Taking X 0 into account, r λ pu p v p w p 2 p=1 F = l,m,n ( r λ ) 2 pu pi v pj w pk i,j,k=1 p=1 l,m,n (λ qu qi v qj w qk ) 2 i,j,k=1 = λ 2 q l,m,n i,j,k=1 (u qiv qj w qk ) 2 = λ 2 q u q v q w q 2 F = λ 2 q since u q = v q = w q = 1. (n) Hence, as λ q, f (X n ). Contradicts f (Xn ) α for all n. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

30 Symmetric hypermatrices for blind source separation Problem Given y = Mx + n. Unknown: source vector x C n, mixing matrix M C m n, noise n C m. Known: observation vector y C m. Goal: recover x from y. Assumptions: 1 components of x statistically independent, 2 M full column-rank, 3 n Gaussian. Method: use cumulants κ k (y) = (M, M,..., M) κ k (x) + κ k (n). By assumptions, κ k (n) = 0 and κ k (x) is diagonal. So need to diagonalize the symmetric hypermatrix κ k (y). Pierre s lectures in Week 2. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

31 Diagonalizing a symmetric hypermatrix Example A best symmetric rank approximation may not exist either: Let x, y R n be linearly independent. Define for n N, ( A n := n x + 1 ) k n y nx k and A := x y y + y x y + + y y x. Then rank S (A n ) 2, rank S (A) = k, and lim n A n = A. L.-H. Lim (MSRI SGW) Conditioning and illposedness July 7 18, / 31

Optimal solutions to non-negative PARAFAC/multilinear NMF always exist

Optimal solutions to non-negative PARAFAC/multilinear NMF always exist Lek-Heng Lim Stanford University Workshop on Tensor Decompositions and Applications CIRM, Luminy, France August 29 September 2, 2005