Algebraic models for higher-order correlations

Size: px

Start display at page:

Download "Algebraic models for higher-order correlations"

Sandra Montgomery
5 years ago
Views:

1 Algebraic models for higher-order correlations Lek-Heng Lim and Jason Morton U.C. Berkeley and Stanford Univ. December 15, 2008 L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

2 Tensors as hypermatrices Up to choice of bases on U, V, W, a tensor A U V W may be represented as a hypermatrix A = a ijk l,m,n i,j,k=1 Rl m n where dim(u) = l, dim(v ) = m, dim(w ) = n if 1 we give it coordinates; 2 we ignore covariance and contravariance. Henceforth, tensor = hypermatrix. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

3 Multilinear matrix multiplication Matrices can be multiplied on left and right: A R m n, X R p m, Y R q n, C = (X, Y ) A = XAY R p q, c αβ = m,n i,j=1 x αiy βj a ij. 3-tensors can be multiplied on three sides: A R l m n, X R p l, Y R q m, Z R r n, C = (X, Y, Z) A R p q r, c αβγ = l,m,n i,j,k=1 x αiy βj z γk a ijk. Correspond to change-of-bases transformations for tensors. Define right (covariant) multiplication by (X, Y, Z) A = A (X, Y, Z ). L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

4 Symmetric tensors Cubical tensor a ijk R n n n is symmetric if a ijk = a ikj = a jik = a jki = a kij = a kji. For order p, invariant under all permutations σ S p on indices. S p (R n ) denotes set of all order-p symmetric tensors. Symmetric multilinear matrix multiplication C = (X, X, X ) A where c αβγ = l,m,n x αix βj x γk a ijk. i,j,k=1 L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

5 Examples of symmetric tensors Higher order derivatives of real-valued multivariate functions. Moments of a vector-valued random variable x = (x 1,..., x n ): S p (x) = [ E(x j1 x j2 x jp ) ] n j 1,...,j p=1. Cumulants of a random vector x = (x 1,..., x n ): ( ) ( ) K p(x) = ( 1) q 1 (q 1)!E E x j n j A q A 1 A q={j 1,...,j p} j A 1 x j K p (x) for p = 1, 2, 3, 4 are expectation, variance, skewness, and kurtosis. j 1,...,j p=1. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

6 Cumulants In terms of log characteristic and cumulant generating functions, p κ j1 j p (x) = t α log E(exp( t, x ) 1 j 1 t αp j p t=0 = ( 1) p p t α log E(exp(i t, x ) 1 t αp. j p t=0 In terms of Edgeworth expansion, log E(exp(i t, x ) = α=0 j 1 i α κ α(x) tα α!, log E(exp( t, x ) = κ α(x) tα α!, α = (α 1,..., α p ) is a multi-index, t α = t α 1 1 tαp p, α! = α 1! α p!. For each x, K p (x) = [κ j1 j p (x)] S p (R n ) is a symmetric tensor. [Fisher, Wishart; 1932] α=0 L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

7 Properties of cumulants Multilinearity: If x is a R n -valued random variable and A R m n K p (Ax) = (A,..., A) K p (x). Independence: If x 1,..., x k are mutually independent of y 1,..., y k, then K p (x 1 +y 1,..., x k +y k ) = K p (x 1,..., x k )+K p (y 1,..., y k ). If I and J partition {j 1,..., j n } so that x I and x J are independent, then κ j1 j n (x) = 0. Gaussian: If x is multivariate normal, then K p (x) = 0 for all p 3. Support: There are no distributions where { 0 3 p n, K p (x) = 0 p > n. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

8 Estimation of cumulants How do we estimate K p (x) given multiple observations of x? Central and non-central moments are ˆm n = 1 n (x t x) n, ŝ n = 1 t n x n t t, etc. Cumulant estimator ˆK p (x) for p = 1, 2, 3, 4 given by ˆκ i = ˆm i = 1 n ŝi ˆκ ij = ˆκ ijk = ˆκ ijkl = = n ˆm n 1 ij = 1 (ŝ n 1 ij 1 ŝiŝ n j ) n 2 ˆm n (n 1)(n 2) ijk = [ŝ (n 1)(n 2) ijk 1 (ŝ n iŝ jk + ŝ j ŝ ik + ŝ k ŝ ij ) + 2 ŝ n 2 i ŝ j ŝ k ] n 2 (n 1)(n 2)(n 3) [(n + 1) ˆm ijkl (n 1)( ˆm ij ˆm kl + ˆm ik ˆm jl + ˆm il ˆm jk )] n [(n + 1)ŝ (n 1)(n 2)(n 3) ijkl n+1 (ŝ n iŝ jkl + ŝ j ŝ ikl + ŝ k ŝ ijl + ŝ l ŝ ijk ) n 1 n (ŝ ijŝ kl + ŝ ik ŝ jl + ŝ il ŝ jk ) + ŝ 2 i (ŝ jk + ŝ jl + ŝ kl ) + ŝ 2 j (ŝ ik + ŝ il + ŝ kl ) + ŝ 2 k (ŝ ij + ŝ il + ŝ jl ) + ŝ 2 l (ŝ ij + ŝ ik + ŝ jk ) 6 n 2 ŝ i ŝ j ŝ k ŝ l ]. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

9 Factor analysis Linear generative model y = As + ε noise ε R m, factor loadings A R m r, hidden factors s R r, observed data y R m. Do not know A, s, ε, but need to recover s and sometimes A from multiple observations of y. Time series of observations, get matrices Y = [y 1,..., y n ], S = [s 1,..., s n ], E = [ε 1,..., ε n ], and Y = AS + E. Factor analysis: Recover A and S from Y by a low-rank matrix approximation Y AS L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

10 Principal and independent components analysis Principal components analysis: s Gaussian, ˆK 2 (y) = QΛ 2 Q = (Q, Q) Λ 2, Λ 2 ˆK 2 (s) diagonal matrix, Q O(n, r), [Pearson; 1901]. Independent components analysis: s statistically independent entries, ε Gaussian What if ˆK p (y) = (Q,..., Q) Λ p, p = 2, 3,..., Λ p ˆK p (s) diagonal tensor, Q O(n, r), [Comon; 1994]. s not Gaussian, e.g. power-law distributed data in social networks. s not independent, e.g. functional components in neuroimaging. ε not white noise, e.g. idiosyncratic factors in financial modelling. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

11 Principal cumulant components analysis Note that if ε = 0, then K p (y) = K p (Qs) = (Q,..., Q) K p (s). In general, want principal components that account for variation in all cumulants simultaneously min Q O(n,r), Cp S p (R r ) p=1 α p ˆK p (y) (Q,..., Q) C p 2 F, C p ˆK p (s) not necessarily diagonal. Appears intractable: optimization over infinite-dimensional manifold O(n, r) p=1 Sp (R r ). Surprising relaxation: optimization over a single Grassmannian Gr(n, r) of dimension r(n r), In practice = 3 or 4. max Q Gr(n,r) p=1 α p ˆK p (y) (Q,..., Q) 2 F. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

12 Geometric insights Secants of Veronese in S p (R n ) not closed, not irreducible, difficult to study. Symmetric subspace variety in S p (R n ) closed, irreducible, easy to study. Stiefel manifold O(n, r): set of n r real matrices with orthonormal columns. O(n, n) = O(n), usual orthogonal group. Grassman manifold Gr(n, r): set of equivalence classes of O(n, r) under left multiplication by O(n). Parameterization of S p (R n ) via More generally Gr(n, r) S p (R r ) S p (R n ). Gr(n, r) p=1 Sp (R r ) p=1 Sp (R n ). L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

13 From Stieffel to Grassmann Given A S p (R n ), some r n, want min X O(n,r), C S p (R r ) A (X,..., X ) C F, Unlike approximation by secants of Veronese, subspace approximation problem always has an globally optimal solution. Equivalent to max X O(n,r) (X,..., X ) A F = max X O(n,r) A (X,..., X ) F. Problem defined on a Grassmannian since A (X,..., X ) F = A (XQ,..., XQ) F, for any Q O(r). Only the subspaces spanned by X matters. Equivalent to max X Gr(n,r) A (X,..., X ) F. Once we have optimal X Gr(n, r), may obtain C S p (R r ) up to O(n)-equivalence, C = (X,..., X ) A. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

14 Coordinate-cycling heuristics Alternating Least Squares (i.e. Gauss-Seidel) is commonly used for minimizing Ψ(X, Y, Z) = A (X, Y, Z) 2 F for A R l m n cycling between X, Y, Z and solving a least squares problem at each iteration. What if A S 3 (R n ) and Φ(X ) = A (X, X, X ) 2 F? Present approach: disregard symmetry of A, solve Ψ(X, Y, Z), set upon final iteration. Better: L-BFGS on Grassmannian. X = Y = Z = (X + Y + Z )/3 L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

15 Newton/quasi-Newton on a Grassmannian Objective Φ : Gr(n, r) R, Φ(X ) = A (X, X, X ) 2 F. T X tangent space at X Gr(n, r) R n r T X X = 0 1 Compute Grassmann gradient Φ T X. 2 Compute Hessian or update Hessian approximation 3 At X Gr(n, r), solve H : T X H T X. H = Φ for search direction. 4 Update iterate X : Move along geodesic from X in the direction given by. [Arias, Edelman, Smith; 1999], [Eldén, Savas; 2008], [Savas, L.; 2008]. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

16 Picture L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

17 BFGS on Grassmannian The BFGS update H k+1 = H k H ks k s k H k s k H ks k + y ky k y k y k where s k = x k+1 x k = t k p k, y k = f k+1 f k. On Grassmannian the vectors are defined on different points belonging to different tangent spaces. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

18 Different ways of parallel transporting vectors X Gr(n, r), 1, 2 T X and X (t) geodesic path along 1 Parallel transport using global coordinates 2 (t) = T 1 (t) 2 we have also 1 = X D 1 and 2 = X D 2 where X basis for T X. Let X (t) be basis for T X (t). Parallel transport using local coordinates 2 (t) = X (t) D 2. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

19 Parallel transport in local coordinates All transported tangent vectors have the same coordinate representation in the basis X (t) at all points on the path X (t). Plus: No need to transport the gradient or the Hessian. Minus: Need to compute X (t). In global coordinate we compute T k+1 s k = t k T k (t k )p k T k+1 y k = f k+1 T k (t k ) f k T k (t k )H k T 1 k (t k ) : T k+1 T k+1 H k+1 = H k H ks k s k H k s k H ks k + y ky k y k y k L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

20 BFGS Compact representation of BFGS in Euclidean space: H k = H 0 + [ S k where ] [ R H 0 Y k k S k = [s 0,..., s k 1 ], (D k + Y k H 0Y k )R 1 R k k R 1 k 0 Y k = [y 0,..., y k 1 ], [ ] D k = diag s 0 y 0,..., s k 1 y k 1, s 0 y 0 s 0 y 1 s 0 y k 1 0 s 1 R k = y 1 s 1 y k s k 1 y k 1 ] [ ] S k Yk H 0 L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

21 L-BFGS Limited memory BFGS [Byrd et al; 1994]. Replace H 0 by γ k I and keep the m most resent s j and y j, H k = γ k I + [ S k where ] [ R γ k Y k k S k = [s k m,..., s k 1 ], (D k + γ k Y k Y k)r 1 R k k R 1 k 0 ] [ S k γ k Y k Y k = [y k m,..., y k 1 ], [ ] D k = diag s k m y k m,..., s k 1 y k 1, s k m y k m s k m y k m+1 s k m y k 1 0 s k m+1 R k = y k m+1 s k m+1 y k s k 1 y k 1 ] L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

22 L-BFGS on the Grassmannian In each iteration, parallel transport vectors in S k and Y k to T k, ie. perform S k = TS k, Ȳ k = TY k where T is the transport matrix. No need to modify R k or D k u, v = T u, T v where u, v T k and T u, T v T k+1. H k nonsingular, Hessian is singular. No problem T k at x k is invariant subspace of H k, ie. if v T k then H k v T k. [Savas, L.; 2008] L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

23 Convergence Compares favorably with Alternating Least Squares. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

24 Higher order eigenfaces Principal cumulant subspaces supplement varimax subspace from PCA. Take face recognition for example, eigenfaces (p = 2) becomes skewfaces (p = 3) and kurtofaces (p = 4). Eigenfaces: given image pixel matrix A R m n with centered columns where m n. Eigenvectors of pixel pixel covariance matrix K pixel 2 S 2 (R n ) are the eigenfaces. For efficiency, compute image image covariance matrix K image 2 S 2 (R m ) instead. SVD A = UΣV gives both implicitly, K image 2 = 1 n (A, A ) I 2 = 1 n A A = 1 n V ΛV, K pixel 2 = 1 n (A, A) I 2 = 1 m AA = 1 m UΛU. Orthonormal columns of U, eigenvectors of nk pixel 2, are the eigenfaces. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

25 Computing image and pixel skewness Want to implicitly compute K pixel 3 S 3 (R n ), third cumulant tensor of the pixels (huge). Just need projector Π onto the subspace of skewfaces that best explain K pixel 3. Let A = UΣV, U O(n, m), Σ R m m, V O(m). K pixel 3 = 1 m (A, A, A) I m = 1 m (U, U, U) (Σ, Σ, Σ) (V, V, V ) I m K image 3 = 1 n (A, A, A ) I n = 1 n (V, V, V ) (Σ, Σ, Σ) (U, U, U ) I n I n = δ ijk S 3 (R n ) is the Kronecker delta tensor, i.e. δ ijk = 1 iff i = j = k and δ ijk = 0 otherwise. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

26 Computing skewmax projection Define A S 3 (R m ) by A = (Σ, Σ, Σ) (V, V, V ) I m Want Q O(m, s) and core tensor C S 3 (R s ) not necessarily diagonal, so that A (Q, Q, Q) C and thus Solve K pixel 3 1 m (U, U, U) (Q, Q, Q) C = 1 m (UQ, UQ, UQ) C. min Q O(m,s), C S 3 (R s ) A (Q, Q, Q) C F Π = UQ O(n, s) is our orthonormal-column projection matrix onto the skewmax subspace. Caveat: Q only determined up to O(s)-equivalence. Not a problem if we are just interested in the associated subspace or its projector. L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

27 Combining eigen-, skew-, and kurto-faces Combine information from multiple cumulants: Same procedure for the kurtosis tensor (a little more complicated). Say we keep the first r eigenfaces (columns of U), s skewfaces, and t kurtofaces. Their span is our optimal subspace. These three subspaces may overlap; orthogonalize the resulting r + s + t column vectors to get a final projector. This gives an orthonormal projector basis W for the column space of A; its first r vectors best explain the pixel covariance K pixel 2 S 2 (R n ), next s vectors, with W 1:r, best explain the pixel skewness K pixel 3 S 3 (R n ), last t vectors, with W 1:r+s, best explain pixel kurtosis K pixel 4 S 4 (R n ). L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

28 Advertisement and acknowledgement Jason Morton, Algebraic models for multilinear dependence, SAMSI Workshop on Algebraic Statistical Models, Research Triangle Park, NC, January 15 17, Thanks: J.M. Landsberg Berkant Savas L.-H. Lim & J. Morton (MSRI Workshop) Algebraic models for higher-order correlations December 15, / 28

Numerical Multilinear Algebra III

Numerical Multilinear Algebra III Lek-Heng Lim University of California, Berkeley January 5 7, 2009 (Contains joint work with Jason Morton of Stanford and Berkant Savas of UT Austin L.-H. Lim (ICM Lecture)