Received: 1 May 2009 / Accepted: 8 July 2010 / Published online: 1 August 2010 The Author(s) 2010

Size: px

Start display at page:

Download "Received: 1 May 2009 / Accepted: 8 July 2010 / Published online: 1 August 2010 The Author(s) 2010"

Ursula Short
6 years ago
Views:

1 Data Min Knowl Disc 20 22: DOI 0.007/s Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure: algorithms, convergence, probabilistic interpretations and novel probabilistic tensor latent variable analysis algorithms Stefanos Zafeiriou Maria Petrou Received: May 2009 / Accepted: 8 July 200 / Published online: August 200 The Authors 200 Abstract In this paper we study Nonnegative Tensor Factorization NTF based on the Kullback Leibler KL divergence as an alternative Csiszar Tusnady procedure. We propose new update rules for the aforementioned divergence that are based on multiplicative update rules. The proposed algorithms are built on solid theoretical foundations that guarantee that the limit point of the iterative algorithm corresponds to a stationary solution of the optimization procedure. Moreover, we study the convergence properties of the optimization procedure and we present generalized pythagorean rules. Furthermore, we provide clear probabilistic interpretations of these algorithms. Finally, we discuss the connections between generalized Probabilistic Tensor Latent Variable Models PTLVM and NTF, proposing in that way algorithms for PTLVM for arbitrary multivariate probabilistic mass functions. Keywords Nonnegative Matrix Factorization Nonnegative Tensor Factorization Kullback Leibler divergence Probabilistic Latent Semantic Analysis Introduction Nonnegative Matrix Factorization NMF has been proposed for the analysis of nonnegative data in Lee and Seung 999, In Lee and Seung 999, NMF was motivated as a technique for discovering the parts of obects. This was shown when Responsible editors: Tao Li, Chris Ding, Fei Wang. S. Zafeiriou B M. Petrou Department of Electrical and Electronic Engineering, Imperial College London, Signal Processing and Communication Research Group, South Kensington Campus, London SW7 2AZ, UK s.zafeiriou@imperial.ac.uk

2 420 S. Zafeiriou, M. Petrou NMF was applied to the decomposition of facial images. It was demonstrated that the derived bases of the decomposition intuitively corresponded to parts of faces. The problem of NMF is as follows: consider a nonnegative matrix X R+ F L and a prespecified integer constant K, and to find a matrix Z R+ F K, which is the matrix that contains the bases of the decomposition in its columns Lee and Seung 999, 2000; Zafeiriou et al. 2006; Kotsia et al. 2007, and a matrix H R+ K L, which contains the weights of the decomposition, such that X ZH. In order to find the approximation X ZH two optimization problems were defined in Lee and Seung 2000: the first optimization problem minimizes the Frobenius norm between X and the decomposition ZH and yields a least squares approximation, while the second one resorts to the minimization of the Kullback Leibler KL divergence between X and the decomposition. Updating rules of the iterative schemes used to solve these problems were derived from auxiliary functions. The provided updating rules guaranteed the non-increasing behavior of the cost functions, but they did not guarantee that the limit point corresponded to a local minimum or to a stationary limit point Gonzalez and Zhang 2005; Lin 2007a,b. In Gonzalez and Zhang 2005 it was numerically demonstrated that the multiplicative update rules in Lee and Seung 999, 2000 may fail to converge to a stationary point. Recently, the convergence properties of the NMF update rules were explored in Lin 2007a,b and Finesso and Sprei More precisely, in Lin 2007a update rules, which guarantee that the limit point of the Least Squares Error LSE optimization problem is a stationary point, were proposed. In Lin 2007b an alternative solution for the LSE optimization problem was proposed that was based on the Armio rule. In both Lin 2007a,b LSE was used for measuring the error of the approximation. Another very useful measure for the error of the nonnegative approximation with many applications is the KL divergence. The first algorithm with the KL divergence was proposed in Lee and Seung 2000 where both multiplicative and additive update rules were proposed. In Sra and Dhillon 2006, using the generalized Bregman distance formulation, an exponential family of update rules was proposed for the KL divergence. In Finesso and Sprei 2006, a convergence analysis for the multiplicative update rules of the KL divergence was provided and NMF was interpreted within a probabilistic framework. Moreover, in Finesso and Sprei 2006, it was proven that the proposed set of multiplicative update rules guarantee that the KL divergence converges to a stationary limit point. Finally, the relation between multiplicative update rules for NMF with KL divergence and Probabilistic Latent Semantic Analysis PLSA was explored in Gaussier and Goutte Recently, the relationship of NMF approaches with various Probabilistic Latent Variable Models PLVM was investigated in Shashanka et al. 2007, 2008 and Smaragdis and Ra One disadvantage of NMF is that it cannot describe more than pairwise data relations. The application of NMF to modelling relations and furthermore to clustering was initiated in Lee and Seung 999. In Ding et al it was shown that there is a close relationship between spectral clustering and NMF. In Shashua et al the problem of clustering data, given complex relations beyond pairwise relations between data points, was considered. The n-wise relations between the data points can be modelled by an n-order tensor, where each entry corresponds to an affinity mea-

3 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 42 sure usually nonnegative over an n-tuple of data points, which are more naturally described by a n-order tensor. That way, NTF models for clustering n-wise relations were motivated. Another example of the advantage of NTF over NMF can be found when these methods are applied to feature extraction from images. In this case, in order to apply NMF, the obect images should be vectorized in order to find their nonnegative decomposition. This vectorization leads to information loss, since the local structure of the images is lost. In order to remedy this drawback of NMF representation, the n-order and 3-order Nonnegative Tensor Factorization NTF schemes were proposed Hazan et al. 2005; Friedlander and Hatz 2006; Shashua and Hazan In Shashua and Hazan 2005, an n-order tensor factorization method was proposed, based on the Frobenius norm. In Hazan et al. 2005, an image database was represented as a 3-order tensor, i.e. a 3D cube that has as slices the 2D images. Update rules for the factors used in the decomposition, that guarantee a nonincreasing behavior of the KL divergence, were proposed. The 3D NTF decomposition was proven to be more suitable for part-based obect representation than NMF Hazan et al Examples [like the decomposition of the Swimmer dataset Donoho and Stodden 2004] which demonstrate the superiority of the 3D NTF over the NMF can be found in Hazan et al A recent overview of NTF algorithms can be found in Zafeiriou 2009b. Recently, there has been an increasing interest in NTF algorithms Kim and Choi 2007; Cichocki et al. 2007, 2008; Mørup et al. 2008; Kim et al. 2008; Zafeiriou 2009a,b. In Kim and Choi 2007, Mørup et al. 2008, and Kim et al algorithms for arbitrary order Tucker 966 factorizations, where data, core and mode matrices are non-negative, were proposed. Multiplicative update rules for all the factors were also proposed. In order to reduce further the ambiguities of the decomposition, updates that can impose sparseness in any combination of modalities were proposed in Mørup et al Moreover, the notion of nonsmoothness Pascual-Montano et al for controlling the sparseness was extended for NTF algorithms in Kim and Choi Algorithms for 3D NTF using the PARAFAC2 Kiers et al. 999; Bro et al. 999 decomposition were proposed in Cichocki et al. 2007, Finally, in Zafeiriou 2009a the NMF method, that uses the generalized Bregman divergence in Sra and Dhillon 2006 and the Discriminant-NMF DNMF method in Zafeiriou et al. 2006, were generalized for arbitrary nonnegative tensor decomposition. The tensorization of well-established vector-based algorithms is not an easy procedure and is a very active research field. For example, in De Lathauwer et al Singular Value Decomposition SVD was extended to a multilinear SVD, in Lu et al Principal Component Analysis PCA was extended to multilinear PCA, in Yan et al and Tao et al Fisher s Linear Discriminant Analysis FLDA was extended to various multilinear counterparts, Independent Component Analysis ICA was extended to multilinear ICA in Ra and Bovik 2008 and Canonical Correlation Analysis CCA to multilinear CCA in Kim and Cipolla In this paper, we study NTF using the KL-divergence. The KL divergence is one of the most used distances for performing NMF and NTF and is of importance since it provides factorizations that share a lot of similarities with Expectation Maximization EM approaches. Moreover, such NMF and NTF approaches, in most cases,

4 422 S. Zafeiriou, M. Petrou lead to clear probabilistic interpretations. Algorithms for NTF based of KL divergence were recently proposed in Zafeiriou 2009a. In Zafeiriou 2009a generalization of the NTF algorithms in Shashua and Hazan 2005 and of the NMF algorithms in Zafeiriou et al and Sra and Dhillon 2006 were proposed. In Zafeiriou 2009a no probabilistic interpretation for the NTF algorithm have been given and the properties of the algorithms have not been studied. In this paper, we interpret the NTF algorithm as an alternative Csiszar Tusnady procedure, in a similar manner as Finesso and Sprei 2006, and we explore the convergence properties of the optimization procedure. This analysis leads to updating rules which guarantee that the limit point is stationary. Moreover, we investigate the relationship between Probabilistic Tensor Latent Variable Models PTLVM and NTF with the KL divergence and we propose novel algorithms for PTLVM. In Shashanka et al. 2008, Gaussier and Goutte 2005, and Ding et al. 2008, the relationship between Probabilistic Latent Variable Analysis PLVA models for bivariate distributions and NMF were explored. In Gaussier and Goutte 2005 a first attempt to connect NMF and PLSA was performed. In Ding et al it was shown that NMF is similar but not exactly equivalent to NMF and a hybrid algorithm for performing PLSA using NMF was proposed. In Shashanka et al. 2008, it was also noted that the approach can be extended to arbitrary order PTLVM but no algorithm was proposed that could solve such problems. In this paper, we show the close connection between the algorithms in Finesso and Sprei 2006, Ding et al. 2008, Gaussier and Goutte 2005 and Shashanka et al and we generalize them for arbitrary multivariate distributions and for arbitrary order tensors. Summarizing the novel contributions of this paper are: We propose novel nonnegative tensor factorization algorithms based on KL-divergence by interpreting the optimization problem as a alternative Csiszar Tusnady procedure. We provide clear probabilistic interpretations for nonnegative tensor factorization, study the convergence and other theoretical properties of the algorithm. We propose algorithms for symmetric and asymmetric probabilistic tensor latent variable analysis. All the algorithms are proposed using simple matrix operations. The remainder of this paper is organized as follows. In Sect. 2, we briefly describe the problem of NMF with the KL divergence and comment on the relationship between NMF and PLVM. In Sect. 3, we briefly outline some elements of multilinear algebra and show how NMF algorithms can be extended to arbitrary order NTF. In Sect. 4, robust update rules of NTF with the KL divergence are proposed and their convergence theorems are presented. In Sect. 5 we present a probabilistic interpretation of the optimization problem and explore various properties of this approach. In Sect. 6, we show the equivalence between the proposed NTF algorithm and PTLVMs and propose algorithms for arbitrary order probability tensor decompositions. Experimental results which demonstrate the merits of the proposed approach are presented in Sect. 7. Finally, conclusions are drawn in Sect. 8. For completeness, extensive proofs of the various statements throughout the paper are given in the Appendices.

5 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure Nonnegative Matrix Factorization to NTF In this section, we briefly describe how NMF is formulated. In the following, we consider a collection of L nonnegative vectors x R F Nonnegative Matrix Factorization For two vectors x =[x, x 2,...,x F ] T and q =[q, q 2,...,q F ] T, the KL divergence or relative entropy between them is defined as Lee and Seung 2000: KLx q F i= x i ln x i + q i x i. q i For convenience, let us define 0 0 = 0 and 0 ln 0 = 0Finesso and Sprei It can be shown, that, in general, the KL divergence is nonnegative and is equal to zero if and only if its two arguments are equal. The basic idea behind NMF is to approximate the obect x by a linear combination of the elements of h R+ K such that x Zh, where Z R+ F K is a nonnegative matrix, the columns of which sum up to. In order to measure the error of approximation x Zh, theklx Zh divergence may be used Lee and Seung NMF, when applied to matrix X =[x... x... x L ] R+ F L =[x i ],aimsat finding two matrices Z R+ F M =[z ik ] and H R+ M L =[h k, ] such that: X ZH. 2 Vector x after the NMF decomposition can be written as x Zh, where h is the th column of H. Thus, the columns of matrix Z define a basis and h consists of the corresponding weights. Decomposition 2 induces an approximation error which is the sum of all KL divergences for all x, i.e.: D KL X ZH = = L KLx Zh = F i= = L x i K x i ln Kk= + z ik h k x i z ik h k k= 3 as the measure of the cost for factoring X into ZH Lee and Seung Factorization 2 is the outcome of the following optimization problem: min Z,H D KL X ZH subect to z ik 0, h k 0, k =,...,M and i =,...,F =,...,L.

6 424 S. Zafeiriou, M. Petrou NMF has non-negative constraints in both the elements of Z and H; these nonnegativity constraints permit the combination of multiple basis components in order to represent the original nonnegative vectors using only additions between the different basis components. The following update rules guarantee a nonincreasing behavior of the KL divergence 3 Lee and Seung 999, 2000: z t ik = zt ik h t k = h t k Mm= h t km x im /[Z t H t ] im Mm= h t F z t km 4 /[Z lk l x t H t ] l F z t. 5 lk 2.2 Alternative multiplicative update rules of NMF using KL divergence and a probabilistic interpretation In Finesso and Sprei 2006, an alternative algorithm for solving the optimization problem of minimizing 3 under nonnegativity constraints was proposed. That is, the problem was transformed in the factorization of a probabilities matrix P with P i 0 and F i= L= P i =. In the following for denoting the elements of matrix P we shall use either p i or Pi,. The problem was formulated as follows: min D KL P Q Q + 6 Q,Q + :Q + e=e,e T Q e= for notation simplicity, we shall use only e for denoting an arbitrary dimensional vector of ones. For the above constrained optimization problem cost function 6 is simplified as: D KL P Q Q + = F i= = L Pi, Pi, ln Kk=. 7 Q i, kq + k, The update rules for solving the above optimization problem are: Q t i, k = Qt i, k M m= Q t + k, mpi, m [ ] Q t Q t + im 8 and Q t + k, = Qt + k, F Q t / i, kpi, F [ ] i= Q t Qt + m= i M Q t m, kqt + k, lpm, l [ ] Q t. 9 Qt + ml

7 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 425 Update rule 9 may be equivalently written as Q t F + k, = Qt + k, i= Q t / i, kpi, N ] [ Q t Qt + Matrices P, Q and Q + are related with X, Z and H through: i Q t l, k. 0 P = e T Xe X, Q = e T Ze Z, Q + = H. The above updating rules not-only guarantee the nonincreasigness of cost function D KL P Q Q + but also guarantee the stationarity of the limit point. By transforming the optimization problem 4 into the factorization of a probabilities matrix, not only helped in proving stationarity, but also revealed a lot of interesting properties and interpretations. The generalization of these properties to arbitrary order tensor factorizations will be explored in this paper. Before proceeding we will briefly comment on an alternative factorization introducing another vector a which is also computed during the procedure. The optimization problem is: min D KL P Q AQ + 2 Q,Q + :Q + e=e,q T e=e,at e= where A = diaga and diaga returns a diagonal matrix that has in its main diagonal the elements of vector a. The corresponding update rules are given by: and Q t i, k = Qt i, k Q t Q i, k = t Q t + k, = Qt Q t [ ] M A t Q t + Pi, m [ km ] A t Q t m= i, k Fm= Q t m, k, [ F + k, i= Q t Q t At ] ik + [ Q t At Q t + im Pi, ] i 3 + k, = Q + t k, Lm= Q t 4 m, k, a t k = a t k + F,L i=, = Pi, Q t + k, Qt k, [ ]. 5 Q t At Q t + i

8 426 S. Zafeiriou, M. Petrou In Gaussier and Goutte 2005, Ding et al. 2008, Shashanka et al. 2007, 2008, and Smaragdis and Ra 2007 the close relationship between NMF with the KL divergence, PLSA and other PLVM approaches was commented. In Gaussier and Goutte 2005, it was shown how NMF problems may be interpreted as PLSA Hofmann 999. In Ding et al the authors showed that NMF and PLSA even though they minimize similar obective functions they do not converge to the same limit point. Moreover, in Shashanka et al. 2007, 2008 the relationship between PLVM and NMF algorithms was further studied. Here we shall try to provide an interpretation optimization problem 6 and update rules 8, 9 and 0 in terms of PLVM. Moreover, we will show that the constrained optimization problem 2 which is solved via update rules 3, 4 and 5 is exactly the same as PLSA algorithm in Hofmann 999. Latent class models, like PLSA, enable one to attribute the observations to hidden or latent factors. The main characteristic of these models is that conditionally independent bivariate data are modelled as belonging to latent classes, such that the random variables within a latent class are independent of one another. Random variables X, X 2 can be thought of as if they were defined in the canonical measurable space Ω, F, where Ω is the set of all pairs i, i 2 and F = 2 Ω i.e., the powerset of Ω. Let P be a probabilistic measure on this space. Then, the distribution of the pair X, X 2 under P is given by matrix P =[p i ]. We shall use the notation Px, x 2 for distribution Px = i, x 2 = = p i. Distribution Px, x 2 is modelled as a mixture, where each component of the mixture is the product of one-dimensional marginal distributions. Let x and x 2 be random variables and let Px, x 2 be their oint probability mass function. The PLSA approximation may be written as: Px, x 2 z {,...,K } Px, zpx 2 z 6 where z is a latent random variable. In an EM Hofmann 999 EM approach we maximize the following functional: D PLSA Px, x 2 Px, zpx 2 z I,I 2 K = Px = i, x 2 = i 2 ln Px =, z = i 3 Px 2 = i 2 z = i 3 i =,i 2 = i 3 = = D KL Px, x 2 Px, zpx 2 z c 7 where c is a constant c = I,I 2 i =,i 2 = Px = i, x 2 = i 2 lnpx = i, x 2 = i 2. Thus, as can be seen is equivalent to minimization of D KL Px, x 2 Px, z. Solving maximization of 7Hofmann 999 we have the following updating rules: P t P t x, zp t x 2 z z x, x 2 = y {,2,...,K } Pt x, yp t x 2 y 8

9 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 427 and P t x, z = a 2 Px, a 2 P t z x, a 2 P t x 2 z = a Pa, x 2 P t z a, x 2 a,a 2 Pa, a 2 P t z a, a 2. 9 Now by substituting 8 into9 wehave: P t x, z = P t x, z P t a 2 z Px, a 2 a 2 y {,2,...,K } Pt x, yp t a 2 y P t x 2 z = P t a Pa, x 2 P t a,z y {,2,...,K } x 2 z Pt a,yp t x 2 y 20 P a,a 2 Pa, a 2 t a,zp t a 2 z = P t a Pa, x 2 x 2 z P t a,yp t a 2 y y {,2,...,K } P t a,z y {,2,...,K } Pt a,yp t x 2 y a P t a, z Then, since F i= Mk= Px = i, z = k = and L = Px 2 = z = k = which are equivalent to e T Q e = and Q + e = e, respectively we can see that Q + i, k = Px = i, z = k and [Q ] T, k = Px 2 =, z = k. Thus, the update rules 8, 9 and 0 are exactly the same as the EM algorithm 20. In case that we that want to calculate the probabilities Pz [which is computed in the PLSA algorithm in Hofmann 999] as well then approximation 2 is reformulated as: Px, x 2 z {,...,K } PzPx zpx 2 z 2 and optimization problem is the maximization of D PLSA Px, x 2 Px, zdiagpz Px 2 z which is the PLSA model proposed in Hofmann 999. The corresponding update rules are given by: P t x z = P t a 2 Px, a 2 x z a,a 2 Pa, a 2 P t x 2 z = P t a Pa, x 2 x 2 z a,a 2 Pa, a 2 P t zp t a 2 z y {,2,...,K } Pt yp t x yp t a 2 y P t zp t a 2 z y {,2,...,K } Pt yp t a yp t a 2 y P t zp t a z y {,2,...,K } Pt yp t a yp t x 2 y P t zp t a zp t a 2 z y {,2,...,K } Pt yp t a yp t a 2 y 22.

10 428 S. Zafeiriou, M. Petrou and P t z = P t z Pa, a 2 P t a zp t a 2 z a,a 2 y {,2,...,K } Pt yp t a yp t a 2 y. 23 We observe that since F i= Px = i z = k =, L = Px 2 = z = k = and K k= Pz = k which are equivalent to Q T e =, Q +e = e and a T e =, respectively we can see that Q i, k = Px = i z = k, [Q + ] T, k = Px 2 = z = k and ak = Pz = k. Thus, update rules 8, 9 and 0 are exactly the same as the PLSA algorithm given by 22 and Relationship between PLSA, original NMF and the approach proposed in Ding et al The relationship between NMF and PLSA has been commented in Gaussier and Goutte 2005, Ding et al. 2008, and Shashanka et al Now, we will discuss the actual difference between NMF and PLSA and complete the analysis initiated in Ding et al InDing et al the authors proposed an algorithm for solving the approximation 6 using the actual NMF update rules 4 and 5. After the computation of Z and H they computed matrices Q i, k = Px = i z = k, Q + k, = Px 2 = z = k and ak = Pz = k the diagonal of A: Q = Z EZ Q + = H HE A = diag e T Z diaghe. 24 As the authors showed the above procedure results to a different limit point than the limit point derived from update rules 22 and 23. What the authors actually showed in Ding et al is that the algorithm will converge to a different limit point when the factors are normalized under every iteration such that the constraints Q T e =, Q +e = e and a T e = are satisfied and it will converge to another limit point when the normalization is applied once at the end of the procedure i.e., at the convergence. This is quite expected since these are two different optimization problems. The first one is the NMF optimization problem 4 having free variables Z and H which solution is given by update rules 4 and 5. The second one is the constrained optimization problem 2 having as free variables Q, Q +, a subect to the additional constraints Q T e = e, Q +e = e and a T e = which solution is computed by update rules 3, 4 and 5. A similar analysis also hold for the optimization problem 6. That is, even though 2 and 6 are similar they are not equivalent. Thus, they do not converge to the same limit point. Summarizing, we provided a supplement to the analysis given in Ding et al and we showed that even though optimization problems 4, 6 and 2 optimize

11 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 429 similar obective function they are not equivalent, since they are optimization problems with different set of variables for example in PLSA we search for an additional vector a and with different constraints and thus a different search space. Only the constrained NMF given by optimization problem 2 and solved by update rules 3, 4 and 5 is equivalent to PLSA and both converge to the same limit point. 3 Nonnegative Tensor Factorization In this section we shall briefly describe elements of multilinear algebra and show how to perform nonnegative tensor factorizations. 3. Tensor models and PARAFAC decomposition To begin with, let us briefly review the tensor algebra concepts needed hereafter and define the notation that will be used throughout the paper. An n-order tensor is a collection of measurements indexed by n indices, where each index refers to a mode. Accordingly, vectors are first-order tensors and matrices are second-order tensors Kolda and Bader Lower case letters e.g. x have already been used to denote scalars in the introduction, while boldface lowercase letters e.g. x and boldface capital letters e.g. X have denoted vectors and matrices, respectively. Higher-order tensors of order 3 or higher are denoted by boldface Euler script calligraphic letters e.g. X. If the ith element of a vector x R+ I is denoted by x i, i =, 2,...,I, then the elements of an n-order tensor X will be denoted by x i i 2...i n, i l =, 2,...,I l, l =, 2,...,n. In the following, we shall focus on nth order tensors with non-negative elements, i.e. X R I I 2 I n +. For example: X could be the representation of a database with L obects. Then every database obect is a nonnegative tensor of order n denoted as X :in R I I 2 I n +, i n =, 2,...,L, that is indexed by an n tuple of indices i, i 2,...,i n. To be more specific, the most natural way to model a facial image database is by a 3rd order tensor X R I I 2 I 3 +, where I and I 2 refer to the image hight and width, respectively and I 3 = L is the number of images in the database Shashua and Hazan A relationship that refers to more than two modes is more naturally represented by a tensor X. Such relationships could be nonnegative affinity measures over an n-tuple of patterns that frequently arise in visual interpretation problems, including 3D multi-body segmentation and illumination-based clustering of human faces Shashua and Hazan Ann-variate distribution is more naturally modelled as an n-order tensor. Frequently, transformations of tensors into matrices l-mode matricization and matrices into vectors vectorization are needed. The mode-l matricization of a tensor X R I I 2 I n + maps X toamatrixx l R Il M with M = n m= I m m =l such that the tensor element x i i 2...i n is mapped to the matrix element x il where = + n k= k =l i k J k with J k = k m= m =l I m. The operator vec applied to a matrix

12 430 S. Zafeiriou, M. Petrou stacks the matrix columns into a single vector and the operator. The matrix of ones and the identity matrix are denoted by E and I, respectively. A tensor of ones is similarly denoted as E. A tensor which has all elements equal to 0 except those for which all indices are the same is called superdiagonal. If the nonzero elements are equal to, then such a superdiagonal tensor is referred to as the unit superdiagonal tensor I. In the remaining of the paper, and / tensor operators will denote the elementwise multiplication and division between tensors. In the following, we shall need several products between vectors and matrices as well as between tensors and matrices. Let a R+ I and b RJ + be two non-negative real valued vectors. Their outer product yields a matrix C R+ I J C = a b with elements c i = a i b. 25 Consequently, the outer product of n vectors a l R I l +,l =, 2,...,n, a a 2 a n = n a l yields a tensor A R I I 2 I n +. The Kronecker product between two matrices A R I M + and B R I 2 M 2 + is defined as: a B a 2 B a M B a 2 B a 22 B a 2M B A B = a I B a I 2B a I M B = [ ] a b a b 2...a b M2 a 2 b...a M b...a M b M2 and yields a nonnegative matrix of size I I 2 M M 2. Accordingly, the Kronecker product of n matrices A l R I i M i +,l=, 2,...,n, denoted compactly as n A l yields a matrix of size n I i n M i. The Khatri-Rao product between two matrices A =[a a 2... a K ] R+ I K and B =[b b 2... b K ] R+ J K results in a matrix of size IJ K. It is defined as the matching columnwise Kronecker product of the aforementioned matrices, i.e. A B = [a b a 2 b 2... a K b K ]. 27 If we have n matrices A l R I l K +, l =, 2,...,n, their Khatri-Rao product is compactly denoted as n A l and yields a nonnegative matrix of size n I l K. The two most commonly used tensor decompositions are the PARAFAC/CANDE- COMP model Harshman 970; Carroll and Chang 970 [which is also equivalent to the decomposition using -order Kruskal tensors Kruskal 977] and the Tucker tensor models Tucker 966. Another way to perform the decomposition, especially for 3D tensor factorization, is given through the PARAFAC2 model Smilde et al which has been applied for 3D NTF in Cichocki et al An n-order tensor X is of rank at most K if it can be expressed as the sum of K rank- Kruskal tensors

13 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 43 Fig. Visualization of the rank-k approximation of a 3-order tensor using Kruskal tensor notation i.e., a sum of Kn-fold outer-products. Using the PARAFAC tensor model a tensor X can be decomposed as the sum of Kn-fold outer-products as: X K n = ul x i...i n K ul i...ul i n n, 0 i I, n 28 with u l RI T + and ul [u = l,...,ul I ]. That is, NTF aims at finding the best rank K nonnegative approximation of X with respect to an approximation cost. The NTF factorization using Kruskal tensors is pictorially described in Fig.. As can be seen, tensor X R I I 2 I 3 + is represented as the sum of K outer product tensors u l ul 2 ul 3. Decompositions like 28 can[ be calculated ] using only matrix operations. To do so, let us define matrices U u... uk which contain as columns the Kruskal [ ] vectors, =,...,n or equivalently U = u l i R I K +, i I, n, l n. These matrices will be used for defining NTF algorithms using matrix multiplications. Let us introduce the compact notation U l U n U + U U. 29 Using compact notation 29 the nonnegative factorization 28 can be written in a matricized form as: X R = U Z T = U U l T 30 matrices R R I n i=,i = I i + and Z R ni=,i = I i K + where Z = U i will be helpful in defining algorithms for the above factorization. The NMF problem X ZH or x Zh in Lee and Seung 2000 can be easily derived from 28 by selecting Z = U and H = U2 T. In the rest of the paper PARAFAC tensor decomposition and decomposition using the Kruskal tensor model will be used interchangeably in order to denote the same tensor decomposition.

14 432 S. Zafeiriou, M. Petrou 3.2 Nonnegative Tensor Factorization algorithms for KL-divergence The first method for NTF was proposed in Paatero and Tapper 994. The NMF methods proposed in Lee and Seung 2000 were generalized using PARAFAC tensor decomposition in Shashua and Hazan Recently, in Kim and Choi 2007 and Mørup et al NTF methods using Tucker decomposition were proposed. Moreover, in Cichocki et al PAFARAC2 models for 3D NTF were also proposed. InHazan et al the authors proposed an NTF algorithm using the KL-divergence for 3-order tensors. NTF with the KL-divergence for arbitrary order tensors was proposed in Zafeiriou 2009a. In the general n-order tensor case, the KL divergence between the given tensor X and the sum of rank- tensors is: D KL X K n = ul = I,...,I n x i...i n ln x i...i n Km= n= i =,...,i n = ui m K n x i...i n +. 3 m= = u m i The optimization problem of this NTF decomposition is: min u m i D KL X K m= n = um subect to u m i 0, i I, n, m K 32 The solution of the above optimization problem was calculated via the use of an auxiliary function. W is an auxiliary function of Y F if W F, F t Y F and W F, F = Y F. IfW is an auxiliary function of Y, then Y is nonincreasing under the update F t = arg min F W F, F t Lee and Seung With the help of the auxiliary function, the update rules for U can be derived. By fixing matrices U,...,U, U +,...,U n, the elements of matrix U are updated by minimizing Y U = D KL X K m= n = um we define function: = D KL X U U l W U t, U t I,...,I n x i i =,...,i n =...i n lnx i...i n x i...i n + I,...,I n x u K i m t n r= u m i r r r = i i =,...,i n =...i n m= K u l t nr= ln i u l i r r ui m t n r= ui m r r r = ln K u l t i r= r = u l i r r + I,...,I n r = T. For matrix U u l t n i K i =,...,i n = m= um i t n r= r = r= r = u l i r r u m i r r. 33

15 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 433 Function 33 is an auxiliary function for 3. A proof of the above statements is given in Zafeiriou 2009a. Let us define the compact notation and n r=,t i I I 2 i = i 2 = u m i r,r um i t n r=,r =,t I I + i = i + = u m i r r I n i n =. 34 u m i t u m i t u m i t u m i + + t u m i n n t. 35 The above calculates the product n r= ui m r r where for r =,..., weusethe estimate of ui m r r at time t while for the rest of them we use the estimate at time t. Having defined the above compact notations the update rules that can guarantee a non increasing behavior of cost function 3 for factors ui m t, are obtained by solving W U t,u t = 0 and are given by: u m i u m i t = u m i t nr=,r = ui m,t i x i...i n K nr= u l ir r,t nr= i ui m,t. 36 The above update rules can be formulated using matrix operations. To do so, let us define notation Z t U n t U t + Ut Ut 37 and [ R t U t Z t ] T. 38 Using the above definitions, the update rules 36 are given by: U t = U t Z t T X R t t EZ. 39 If we allow a total of r iterations, the complexity of the NTF algorithm is O rnk n = I.

16 434 S. Zafeiriou, M. Petrou InSra and Dhillon 2006 exponential rules for NMF using the KL divergence were proposed. The exponential rules for arbitrary valence NTF were proposed in Zafeiriou 2009a. For the exponential rules the following problem was considered: K min D KL ui m m= n = um X subect to ui m 0, i I, n, m K. 40 The update rules that guarantee a nonincreasing behavior of the above cost function are the following Zafeiriou 2009a: i nr=,r = u l i u l t i = u l t r r ln exp,t i or in matrix notation: U t = U t exp ln x i...in nm= nr= uir m r,t nr= i u l i r r,t X R t t EZ Z t T NTF with KL divergence as an alternative Csiszar Tusnady procedure and convergence properties In this section we will formally define the proposed NTF with KL divergence, provide a clear probabilistic interpretation and explore its convergence properties. First, let us consider the optimization problem with constraints: min U,...,U n :e T U e=,u 2 T e=e,...,u n T e=e D KL X K n = ul Proposition 2. The NTF optimization problem 43 has a solution.. 43 The proof of the above proposition can be found in Appendix 0. Before starting to explore the properties of NTF with KL Divergence, we should try to find the resemblance between the robust NMF method proposed in Finesso and Sprei 2006 and NTF decomposition. NMF is a special case of NTF when having 2-order tensors i.e. having only U and U 2, thus the NTF optimization problem is exactly the one in Finesso and Sprei 2006 under the constraints e T U e = and U2 T e = e. The generalized optimization problem for NTF using KL divergence is the one expressed by 43. In order to define the robust NTF algorithm we should reformulate the optimization problem using probability matrices. That is, we shall define the probability tensor as

17 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 435 P such that p i i 2...i n 0 and I,I 2,...,I n i =,i 2 =,...,i n = p i i 2...i n =. For convenience, we shall use the notation Pi,...,i n interchangeably with p i...i n. Let p I,...,I n i =,...,i n = x i...i n and w e T U e then we define P p X, Q w U and Q i U i for all i = 2,...,n. As proven in Appendix cost function D KL X K n = ul can be reformulated as: D KL X K = pd KL K n = ul P n = ql + D KL p w. 44 Since number p is known, the minimization of D KL X K n = ul with respect to U,...,U n is equivalent to minimizing D KL P K n = ql with respect to Q,,Q n and minimizing D KL p w with respect to w.forthenew optimization problem, we obtain w opt = p and so U opt = pq opt and U opt Q opt, = 2,...,n. 2 Thus, minimization of D KL X K n = ul = subect to sub- nonnegativity constraints is equivalent to minimizing D KL P K n = ql ect to nonnegativity constraints. This leads to the minimization of the KL-Divergence between finite probability measures: D KL P K n = ql = I,...,I n i =,...,i n = D KL p i,...,i n K n q i l = = I,...,I n p i i =,...,i n =,...,i n log K n= qi l. 45 p i,...,i n Given a probability tensor P and an integer K, the optimization problem is to find Q,...,Q n : min Q,...,Q n :e T Q e=,q 2 T e=e,...,q n T e=e D KL P K n = ql The superscript opt reflects the optimality.

18 436 S. Zafeiriou, M. Petrou We define the function: W + U t, U t = I,...,I n i =,...,i n = I,...,I n K p i...i n i =,...,i n = m= ln p i...i n lnp i...i n qi m t m n r= q ir r r = K qi l t nr= qi l r r r = ln qi l t n qi m t n r= qi m r r r = K qi l t lr=. 47 q ir r r = This function 47 is an auxiliary function for 45. Let Q t w U t and Q t 2 U t 2,, Qt n r= r = q l i r r U t n. Now by substituting the definition of P, Q t,...,q n t into 36 and using the fact that w opt = p, the following update rules for t are obtained for factors q l i : qi l t = q l t i i p i...i n n=2 qi l,t Km= n=,t q m i 48 and for {2,...,n} qi l t = q l t i nr=,r = q l ir r,t i p i...i n Km= nr= qir m r,t n=2 qi l I,...,I n i =,...,i n = p,t i...i n Km= n= qi m,t Update rule 49 can be equivalently written as: qi l t = q l t i nr=,r = qir l,r,t i p i...i n Km= nr= qir m,r,t I i = ql t i, In a matrix notation the above update rules can be written as: P Q t = Q t R t Z t T Zt T

19 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 437 and Q t = Q t P R t Z t T Z t T EQ t 52 where Z t = Q n t Q t + Qt Qt and Rt = Q t Z t The following Theorems guarantee that the limit point of the update rules 48 and 49 exists and is a stationary point of optimization problem 46. Before presenting the convergence theorems we should introduce the following sets. Given a probability tensor P and an integer K, we introduce the following sets: { V R I K I 2 I n + : I ℵ } K Vi, l, i 2,...,i n = P { Q R I K I 2 I n + : Qi, l, i 2,...,i n = Q i, lq 2 i 2, l...q n i n, l } Q,...,Q n 0, e T Q e =, Q2 T e = e,...,qt n e = e { } K O R I I 2 I n + : O = Qi, l, i 2,...,i n for some Q I 53 Theorem 2. Let Q t,...,qt n be generated by algorithm 48, 49 and Q t Q t i, l, i 2,...,i n = Q t i, lq t 2 i 2, l...q t n i n, l the corresponding tensor. Then, Q t i, l converges to limit Q i, l and Q t converges to the limit Q in Q, Q t i, l for = 2,...,n converge to limits Q i, l for all l with I i = Q i, l >0. T. The proof can be found in Appendix 2. The role of tensor Q will be made clear in the next section. Theorem 2.2 If Q,...,Q n is a limit point of the algorithm given by update rules 48 and 49 in the interior of the domain, then it is a stationary point of the obective function D KL 46. If Q,...,Q n is a limit point on the boundary of the domain corresponding to an approximate factorization where none of the columns of Q is zero I i = Q i, l >0 l, then all partial derivatives are nonnegative. D KL Q i,l for =,...,n The proof can be found in Appendix 3. Corollary 2. shows that the limit points of the above optimization problem using the update rules 48 and 49 are all Kuhn-Tucker points.

20 438 S. Zafeiriou, M. Petrou Corollary 2. The limit points of the algorithm with I i = Q i, l > 0 for all l are all the Kuhn-Tucker points for the minimization of D KL under the inequality constraints Q 0,...,Q n 0. The proof of the above Corollary can be found in Appendix 4. 5 A probabilistic interpretation of the optimization problem In this section we explore the properties of optimization problem 46. More precisely we: provide a solid probabilistic interpretation of the optimization problem; propose generalized pythagorean rules; motivate the use of the auxiliary function Setup and exact NTF Let us consider the n + -tuple of random variables Y, X, Y 2,...,Y n, taking values in {,...,I } {,...,K } {,...,I 2 } {,...,I n }. The random variables Y, X, Y 2,...,Y n can be thought of as if they were defined in the canonical measurable space Ω, F, where Ω is the set of all n + -ples i, l, i 2,...,i n and F = 2 Ω i.e., the powerset of Ω. For ω = i, l, i 2,...,i n we have the identity mapping Y, X, Y 2,...,Y n ω = i, l, i 2,...,i n.ifr is a given probabilistic measure on this space, the distribution of the n + -tuple Y, X, Y 2,...,Y n under R is given by tensor R: Ri, l, i 2,...,i n = RY = i, X = l, Y 2 = i 2,...,Y n = i n. 54 Tensors V and Q define probability measures P and Q on Ω, F. Obviously, the sets and I are subsets of the set of all measures on Ω, F. In particular, corresponds to the subset of all measures whose Y = Y,...,Y n marginal coincides with the given P, while I corresponds to the subset of measures under which Y,...,Y n are conditionally independent given X. The first assertion can be proven by definition. The second assertion is proven as follows. First, notice that QY = i, X = l, Y 2 = i 2,...,Y n = i n = Qi, l, i 2,...,i n. Then, by summing over all i with = 2,...,n we have QY = i, X = l = Q i, l, since Q T e = e for = 2,...,n. By summing over all i with =,...,n we have QX = l = I i = Q i, l and thus QY = i X = l = QY =i,x=l QX=l. Moreover, for = 2,...,n and by summing over all i m with m {,...,n} then QY = i X = l = QY =i,x=l QX=l = I i = Q i,lq i,l I i = Q i,l = Q i, l. Finally, QY = i,...,y n = i n X = l = QY =i,x=l,...,y n =i n,y n =i n QX=l = QY = i, X = lqy 2 = i 2 X = l...qy n = i n X = l. Lemma 2. Tensor P admits exact factorization of K rank one tensors iff I = and an approximate NTF as I=.

21 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 439 A proof can be found in Appendix 5. Now we generalize the probabilistic interpretation of NMF to an arbitrary order exact NTF problem. The probability tensor P admits an exact NTF i.e., P = K n = ql iff there exist at least one measure Ω, F whose Y = Y,...,Y n marginal is P and at the same time making Y,...,Y n conditionally independent given X. Proposition 2.2 Given P, function V, Q D KL V Q attains a minimum on Iand it is valid that min D KLP Q = min D KLV Q 55 Q ℵ V,Q I The proof can be found in Appendix 6. Let V opt and Q opt be the tensors satisfying Proposition 2.2. If l 0 such that I,...,I n i =,i 2 =,...,i n = Vopt i, l 0, i 2,...,i n = 0, then all Q opt i, l 0, i 2,...,i n = 0, as well. Equivalently, if l 0, such that I,...,I n i =,...,i n = Qopt i, l 0, i 2,...,i n = 0, then all V opt i, l 0, i 2,...,i n = 0, as well. In both cases the optimal factorization of P has rank less than K and in order to proceed to the factorization, we may omit all the columns from Q,...,Q n that correspond to l Two partial subproblems Now we shall try to solve the equivalent double minimization problem: min V,Q I D KLV Q. 56 We should solve two partial minimization problems. In the first problem, given Q I, we minimize D KL V Q over V, while in the second one, given V we minimize the divergence over Q given V. The unique solution V opt = V opt Q can be easily computed analytically and is given by: V opt i, l, i 2,...,i n = Pi,...,i n K n= Q i, l Qi, l, i 2,...,i n 57 where K Q i, l = K Qi, l, i 2,...,i n. For the second minimization problem, the unique solution is given by: Q opt = I 2,...,I n i 2 =,...,i n = Vi, l, i 2,...,i n 58

22 440 S. Zafeiriou, M. Petrou and for = 2,...,n: Q opt i Vi, l, i 2,...,i n i, l = I,...,I n i =,...,i n = Vi, l, i 2,...,i n 59 These two minimization problems and their solutions have a direct probabilistic interpretation. This interpretation generalizes the discussion in Finesso and Sprei 2006 using arbitrary order tensors instead of order-2 tensors, i.e. matrices. In the first minimization given a distribution Q, which makes the sequence of Y = Y,...,Y n conditionally independent given X, the best solution represents the best set in with the marginal of Y being tensor P. Let V opt be the optimal distribution Y, l, Y 2,...,Y n. Equation 57 can then be interpreted in terms of corresponding measures as: P opt Y = i, X = l, Y 2 = i 2,...,Y n = i n = QX = l Y = i,...,y n = i n Pi, i 2,...,i n 60 Notice that the conditional distributions of X given Y under P opt and Q are the same. In the second optimization problem, one is given a distribution Q, with the marginal of Y given by P and finds the best approximation to it in the set I of distributions which make Y = Y,...,Y n conditionally independent given X. LetQ opt denote the optimal distribution of Y, l, Y 2,...,Y n, Y n. Equations 58 and 59 can be now be interpreted as: Q opt Y = i, X = l = PY = i, X = l 6 and for = 2,...,n Q opt Y = i X = l = PY = i X = l, 62 respectively. That is, the optimal solution Q opt is such that the marginal distribution of X, Y under P and Q opt coincide, as well as, the marginal distributions of Y given X under P and Q opt. Lemma 2.2 For fixed Q and V opt = V opt Q it holds that, for any V D KL V Q = D KL V V opt + D KL V opt Q 63 moreover D KL V opt Q = D KL P O 64

23 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 44 where Oi,...,i n = K Qi, l, i 2,...,l, i n 65 For fixed V and Q opt = Q opt V, it holds that, for any Q I. D KL V Q = D KL V Q opt + D KL Q opt Q. 66 The proof is provided in Appendix 7. Before we produce the other two Pythagorean rules for NTF, we should define the following laws. Let the law of U, V under arbitrary probability measures P and Q by P U,V and Q U,V. The conditional distributions of U given V are denoted by the matrices P U V and Q U V, with the convention P U V i, = PU = V = i and likewise for Q U V. Let us define the mean value operator E Z D KL P O E Z D KL P O = i,...,i n Zi,...,i n Pi,...,i n ln Pi,...,i n. Oi,...,i n 67 Lemma 2.3 It holds that D KL P U,V Q U,V = E P D KL P U V Q U V + D KL P V Q V 68 where D KL P U V Q U V = PU = V log PU = V OU = V 69 If moreover V = V, V 2 and U, V 2 are conditionally independent given V under Q, then the first term in 68 can be written as: E P D KL P U V Q U V = E P D KL P U V P U V + E P D KL P U V Q U V Now we will reinterpret the Pythagorean rules 63, 64 and 66 using probabilistic terms. In that way we generalize the probabilistic interpretation of NMF pythagorean rules in Finesso and Sprei The first optimization problem of 56 can be reinterpreted in probabilistic terms using similar lines as Finesso and Sprei That is, given Q, we are to find its best approximation within. LetQ correspond toagivenq and the P corresponds to a generic V. Choosing now, U = X and V = Y = Y,...,Y n in Lemma 2.3, Eq.68 reads 70 D KL V Q = E P D KL P X Y Q X Y + D KL P Q. 7

24 442 S. Zafeiriou, M. Petrou Moreover, 7 is equivalent to 63. That is, the optimization problem is equivalent to the minimization of E P P X Y Q X Y with respect to V, which is attained with value 0 at P opt with P X Y opt = QX Y and Popt Y = P. That is, for the second optimization problem i.e, minimizing D KL V Q given V we have the following. Given V we are to find its best approximation within Q. LetP correspond to given V and Q correspond to a generic Q I. Choosing U ={Y 2,...,Y n }, V = X and V 2 = Y in Lemma 2.3, and remembering that under any Q Ithe random variables Y, Y 2,...,Y n are conditionally independent given X, Eq.68 combined with 70 now reads: D KL V Q = E P D KL P Y 2,...,Y n X,Y P Y,...,Y n X + E P D KL P Y 2,...,Y n X Q Y 2,...,Y n X +D KL P Y,X Q Y,X = n =2 E P D KL P Y X Q Y X + D KL P Y,X Q Y,X as shown analytically in Appendix 8. In the special case of NMF Finesso and Sprei 2006 i.e., only two random variables Y and Y 2 the above Pythagorean rule is written as D KL V Q = E P D KL P Y 2 X Q Y 2 X + D KL P Y,X Q Y,X. The problem is equivalent to the minimization of n =2 E P D KL P Y X Q Y X and D KL P Y,X Q Y,X with respect to Q I. Both these minima are attained both with value 0 at Q opt with Q Y X opt = PY X for = 2,...,n and Q Y,X opt = PY,X. Note that X have the same distribution under P and Q opt. To derive probabilistically the corresponding generalized Pythagorean rule we notice that: D KL V Q D KL V Q opt = n =2 E P D KL Q Y X opt QY X 72 +D KL Q Y,X opt QY,X. 73 On the right hand side of 73 we can, by conditional independence, replace E Q opt D KL Q Y X opt QY X,for = 2,...,n with E Q opt D KL Q Y X,Y opt Q Y X,Y. By another application of 68, we see that D KL V Q D KL V Q opt = D KL Q opt Q, which is the second Pythagorean rule Alternating minimization procedure As in Finesso and Sprei 2006, we set up the alternating minimization algorithm for obtaining min O D KL P O, where P is a given nonnegative tensor of arbitrary order. In view of Proposition 2.2. we can lift this problem in Ispace. Starting with an arbitrary Q 0 Iwith positive elements, we adopt the following minimization

25 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 443 scheme Q t V t Q t+ V t+ 74 where V t = V opt Q t, Q t+ = Q opt V t.aswecanseeby74 the update gain is independent of the order of the tensors. To relate this algorithm to the one defined by formulae 48 and 49, we combine two steps of the alternative minimization at a time. From the algorithms 74 we get: Q t+ = Q opt V opt Q t. 75 Computing the optimal solutions according to 57, 58 and 59 one gets from there the update rules 48 and 49. The Pythagorean rules allow us to compute easily the gain D KL P O t D KL P O t+ of the algorithm. Proposition 2.3 The update gain at each iteration of the algorithm 75 in terms of the tensors O t is given by: D KL P O t D KL P O t+ = D KL V t V t+ +D KL Q t+ Q t 76 The proof is provided in Appendix 9. If for matrices Q 0,...,Q0 n we have that qi l 0 > 0 then under the iterations in 48 and in 49 we shall have qi l t > 0 with t > 0. If in the tth step the update gain is zero then we will have D KL Q t Q t = 0. Hence, tensors Q t and Q t are identical. From this it follows by summation that Q t = Q t. Moreover, we shall have that Q t+ i, l...q n t+ i n, l = Q t i, l...q t n i n, l, since Q t = Q t, Q t is strictly positive and by summing for all a {2,...,, +,...,n} we have that Q t = Q t for all {2,...,n}. Hence, the update rules strictly decrease the obective function until the algorithm reaches a fixed point. 5.3 Auxiliary functions We consider now the minimization of D KL P O and its lifted version i.e., the minimization of D KL V Q. We shall consider that the problem starts by noting that Q t+ is found by minimizing Q D KL V opt O t O. This strongly motivates the choice of function Q, Q W Q, Q = D KL V opt Q t Q 77 as an auxiliary function for minimizing D KL P O with respect to O.

26 444 S. Zafeiriou, M. Petrou Using the decomposition of the divergence in Eq. 78 we can rewrite W as W Q, Q = D KL V opt Q t Q = D KL Popt Y Q Y +E P opt D KL P X Y opt Q X Y 78 Since P X Y opt = QX Y, and Popt Y = P we write 78 as W Q, Q = D KL P Q Y + E P D KL Q X Y Q X Y 79 From 79 it follows that W Q, Q D KL P O, and that W Q, Q = D KL P O, which are the properties of an auxiliary function for D KL P O. For our problem one can find n auxiliary functions for the original minimization of D KL P K n = ql the same auxiliary functions given in 33. In every auxiliary function for Q, =,...,n we fix all m {,...,n} {}. As in Finesso and Sprei 2006 the auxiliary function W Q, Q can be denoted as W Q,...,Q n, Q n,...,q. The auxiliary function minimization with fixed Qm with m {,...,n} {} can be taken as Q W O Q = W Q,...,Q n, Q,...,Q, Q, Q +,...,Q n 80 Now, we shall generalize the update gains in Finesso and Sprei 2006 and provide the update gains for arbitrary order NTF decompositions. Lemma 2.4 Consider the auxiliary functions W Q, Q and n auxiliary functions W Q,...,Q n, Q,...,Q, Q, Q +,...,Q n. Denote by Q the minimizers of auxiliary functions in all n + cases. Then, the following Lemma holds. D KL P D KL P K K D KL P K n = ql n = ql n = ql = D KL Q Y,X Q Y,X + W Q Q = DKL Q Y,X Q Y,X 8 W Q Q = E P opt D KL Q Y X Q Y X 82 W Q,...,Q n, Q,...,Q n n E Q D KL Q Y X Q Y X 83 =2 The proof of the above can be found in Appendix 20.

27 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 445 Corollary 2.2 The update gain of the multiplicative update rules in 48 and 49 is given by: D KL P O t D KL P O t+ = D KL Q t+y,x Q t Y,X + n =2 E Q t+ D KL Q t+y X Q t Y X +E P D KL Q t X Y Q t+ X Y. 84 In order to prove the above we start by writing: D KL P O t D KL P O t+ = D KL P O t W O t, O t+ +W O t, O t+ D KL P O t+ 85 and then, using 79 and 83 the corollary is proven. Finally, by using 76 we can prove that: D KL V t V t+ = E P Q t X Y t+ X Y Q D KL Q t Q t+ = D KL Q t X Y t+ X Y Q + n E Q t+ D KL Q t+y X Q t Y X 86 =2 6 Probabilistic latent tensor variable analysis In the previous section we studied some of the properties of the NTF optimization problem in 43 and the corresponding algorithm based on multiplicative update rules. In this section we shall try to relate the above algorithm with probabilistic tensor latent component analysis models. In Shashanka et al the authors commented about the possible extension of PLSA and PLVA models the algorithms of which are implemented using update rules similar to NMF using arbitrary order tensors but no algorithm was proposed. Here we shall show that the algorithm for solving 43 can indeed be used for multilinear PLTVA and by introducing additional factors we propose algorithms for PLTVA which are the generalization of the NMF algorithms used in Shashanka et al and Gaussier and Goutte In the following, we generalize both the symmetric and asymmetric PLSA models. 6. The symmetric model In the symmetric PLSA model for the two random variable case we seek for the latent probabilities matrices for Px 2 z and Px, z or Px z. Let x,...,x n be random

28 446 S. Zafeiriou, M. Petrou variables. Then approximation 6 can be generalized as: Px,...,x n z,z=,...,k In an EM manner the above is solved as: Px, z n Px z 87 =2 P t P t x, z n =2 P t x z z x,...,x n = z {,2,...,K } Pt x, z n =2 P t x z 88 and P t x, z = Px,...,x n P t z x,...,x n x 2,...,x n P t x,...,x,x +,...,x n Px,...,x n P t z x,...,x n x z = x,...,x n Px,...,x n P t, = 2,...,n. z x,...,x n 89 We can easily verify that the above update rules 88 and 89 are equivalent to update rules 57, 58 and 59 and also to update rules 48 and 49. Thus, approximation 87 is equivalent to optimization problem 43. Now, we shall expand further the approximation 87 by allowing another factor Pz, which is naturally introduced by expanding Px, z = PzPx z. Thus, factorization 87 becomes: Px,...,x n z,z=,...,k Pz In an EM approach the above problem is solved as: n Px z. 90 = P t P t z n = P t x z z x,...,x n = z {,2,...,K } Pt z n = P t x z 9 and P t x Px,...,x n P t z x,...,x n x z = x,...,x n Px,...,x n P t z x,...,x n P t z = Px,...,x n P t z x,...,x n. 92 x,...,x n

29 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 447 By substituting 9 into92 we get: x Px,...,x n P t x z = x,...,x n Px,...,x n P t z n r= Px r z,t z {,2,...,K } Pt z n r=,t Px r z P t z n r= Px r z,t z {,2,...,K } Pt z n r=,t Px r z P t z = P t z n = P t x z Px,...,x n x,...,x n z {,2,...,K } Pt z n = P t x z. 93 Now, let P be the probability tensor with p i...i n = Px = i, x 2 = i 2,...,x n = i n and let matrices [Q ] i l = Px = i z = l and vector a with [a] l = Pz = l. Then, approximation 90 is written as: P R = K a l n = ql. 94 In a matricized form the above nonnegative approximation can be written as: P R = Q Z T = Q A Q i T 95 for =,...,n where A = diaga and Z = i Q A. In terms of A it can be written as: vecp Q Q i a. 96 Using tensor unfolding and matricization procedures, the update rules 93 can be written as: Q t = Q t P Q t = Q t E Q t R t Z t T 97 where R t = Q t t Z T and Z t = Q t n... Q t... Q t + Q t.

30 448 S. Zafeiriou, M. Petrou The update rule for the probabilities of the latent variables is: a t = a t Q t Q t i T vecp Q t Qt i a t. 98 We can easily verify that the above update rules are the solution of the following optimization problem: min Q,...,Q n :Q T e=e,e T a= D KL P K a l n = ql. 99 In order to use the above algorithm for the factorization of an arbitrary tensor X K a l u l with w = I,...,I n i =,...,i n = X i,...,i n we should normalize the tensor into P = w X, then use update rules 97. Finally, the factors of the decomposition are given by b opt = w a opt and U opt = Q opt. 6.2 Asymmetric model Now we generalize the asymmetric PLSA model for the arbitrary order tensor case. In the asymmetric PLSA model for the two random variable case we seek for the latent probabilities matrices for Px z and Pz x 2. An example of the asymmetric model decomposition is the application of NMF to facial images. In this case the images are vectorized. The vectors are then normalized in order to sum up to one i.e., in order to form probability mass functions. Then, matrix P =[Px = i x 2 = ] contains as columns the facial images x is the random variable that corresponds to feature dimensionality and x 2 is the random variable that corresponds to number of faces in the dataset. The decomposition 6 can be reformulated as: Px x 2 = Pz x 2 Px z 00 z {,...,K } then using the EM in 8 can be reformulated as: P t P t z x 2 P t x z z x, x 2 = z {,...,K } Pt x zp t z x 2 0

31 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 449 and P t z x 2 = x Px, x 2, z x,z {,...,K } Px, x 2, z = = P t z x 2 x Px x 2 z {,...,K } Pt x zp t z x 2 Pt x z x,z {,...,K } x Px zpz x, x 2 x,z {,...,K } Px zpz x, x 2 Px x 2 z {,...,K } Pt x zp t z x 2 Pt z x 2 P t x z 02 and P t x x z = 2 Px, x 2, z x,x 2 Px, x 2, z = x 2 Px 2 Px x 2 Pz x, x 2 x,x 2 Px 2 Px x 2 Pz x, x 2 = P t x z Px x 2 x 2 z {,...,K } Pt x zp t z x 2 Px 2P t z x 2 Px x 2 x,x 2 z {,...,K } Pt x zp t z x 2 Px 2P t z x 2 P t x z. 03 Let that the images are not vectorized, then each of the facial image is represented by a matrix and all the faces together for a tensor P =[Px = i, x 2 = i 2 x 3 = i 3 ] where x and x 2 are the random variables that correspond to height and width of the facial images and x 3 is the random variable that corresponds to the number of facial images in the dataset. In the general n-order case we have: Px,...,x n x n+ = z {,...,K } = z {,...,K } Pz x n+ Px,...,x n z Pz x n+ the above model in an EM manner can be solved as: n Px z 04 = P t P t z x n+ n = P t x z z x,...,x n+ = z {,...,K } Pt z x n+ n = P t x z 05 and P t z x n+ = P t z x n+ x n+ Px,...,x n x n+ z {,...,K } Pt z x n+ n r= P t x r z x z {,...,K } Px,...,x n x n+ nr= P t x r z z {,...,K } Pt z x n+ n r= P t x r z Pt z x n+ n = P t x z 06

32 450 S. Zafeiriou, M. Petrou and P t x z = P t x z x Px,...,x n x n+ z {,...,K } Pt z x n+ n r=,t Px r z Px n+p t z x n+ n r=,r =,t Px r z Px,...,x n x n+ x,...,x n+ z {,...,K } Pt z x n+ n r= Px r z Px n+p t z x n+ n r= Px z,t,t 07 Then, by defining tensor P with elements Pi,...,i n = Px = i,...,x n = i n x n+ = i n+ and Q i, l = Px = i z = l for =,...,n and Q n+ i n+, l = Pz = l x n+ = i n+. Then, we formulate the following optimization problem: min Q,...,Q n+ :Q T e=e,q n+ e=e D KL P K n+ = ql. 08 Using tensor unfolding and matricization procedures the update rules 93 can be written as: Q t = Q t P Q t = Q t E Q t R t [ Z t ] T 09 for =,...,n where R t = Q t t Z T, Z t = Q t... Q t + Qt... Qt and Z t = Q t n+ Qt n Q t... Qt, where Q t n+ = BQt matrix with b l = Px n+ = l i.e., the probability of every instance. For matrix Q t n+ we have the following update rules: n+ Qt n... Q t + n+ and B = diagb is a I n+ I n+ diagonal Q t n+ = Q n+ t P [ n+ R t n+ Q t n+ = n+ Q t 7 Experimental results Z t n+ Q t 0 n+e. We demonstrate the usefulness of the proposed approach using both simulated and real life data. ] T

Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 45 7.

33 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure Experiments using simulated data We demonstrate the power of PLTVA framework providing an empirical verification of the fact that the algorithm given by update rules 97 and 98 can produce better bases in terms of interpretability and sparseness than PLSA and its effect on the success of recreating the underlying model. One dataset that has been used to this end is the Swimmer database. The Swimmer database has been used in Donoho and Stodden 2004 for demonstrating a case when the PLTVA can provide a unique decomposition into parts. Some of the images of the Swimmer dataset can be seen in Fig. 2a. The Swimmer image set is comprised of 256 images of dimensions forming a tensor X R Each image contains a torso the invariant part of 2 pixels in the center and four limbs of 6 pixels that can be in one of 4 positions. We applied the NMF algorithm given by update rules 3, 4 and 5 which is equivalent to PLSA to matrix X 3 R matricization of tensor X in terms of the number of images. Matrix X 3 is first normalized so as is sums up to one. The resulted bases given by the columns of matrix Q are shown in Fig. 2b. As can be seen the NMF scheme Lee and Seung 2000 correctly resolves the local parts but fails on the torso, which is shown as a ghost part in all images [this has been also demonstrated in Donoho and Stodden 2004]. Next we applied the NTF given by 97 and 98 which is equivalent to PLTVA. The proposed NTF algorithm computed bases which are given by the outer product q l ql 2 and l =,...,K that contain a unique factorization and correctly resolve the parts Fig. 2b. Moreover, we performed eigenanalysis to the bases of NMF PLSA and NTFPLTVA and the mean number of non-zero eigenvalues for NMF was 3.92 and for NTF was. This shows that the proposed NTF correctly decomposed the data tensors to rank-one basic tensors or independent factors given the latent variable. 7.2 Face verification experiments The experiments conducted with the XM2VTS database used the protocol described in Messer et al The images were aligned semi-automatically according to the eyes position of each facial image using the eye coordinates. The facial images were down-scaled to a resolution of pixels. Histogram equalization was used for Fig. 2 Some images of the Swimmer database and the corresponding bases for PLSA and PLTVA, a Swimmer images; b PLSA bases images; c PLTVA bases images

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation Amnon Shashua School of Computer Science & Eng. The Hebrew University Matrix Factorization