Received: 1 May 2009 / Accepted: 8 July 2010 / Published online: 1 August 2010 The Author(s) 2010

Size: px
Start display at page:

Download "Received: 1 May 2009 / Accepted: 8 July 2010 / Published online: 1 August 2010 The Author(s) 2010"

Transcription

1 Data Min Knowl Disc 20 22: DOI 0.007/s Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure: algorithms, convergence, probabilistic interpretations and novel probabilistic tensor latent variable analysis algorithms Stefanos Zafeiriou Maria Petrou Received: May 2009 / Accepted: 8 July 200 / Published online: August 200 The Authors 200 Abstract In this paper we study Nonnegative Tensor Factorization NTF based on the Kullback Leibler KL divergence as an alternative Csiszar Tusnady procedure. We propose new update rules for the aforementioned divergence that are based on multiplicative update rules. The proposed algorithms are built on solid theoretical foundations that guarantee that the limit point of the iterative algorithm corresponds to a stationary solution of the optimization procedure. Moreover, we study the convergence properties of the optimization procedure and we present generalized pythagorean rules. Furthermore, we provide clear probabilistic interpretations of these algorithms. Finally, we discuss the connections between generalized Probabilistic Tensor Latent Variable Models PTLVM and NTF, proposing in that way algorithms for PTLVM for arbitrary multivariate probabilistic mass functions. Keywords Nonnegative Matrix Factorization Nonnegative Tensor Factorization Kullback Leibler divergence Probabilistic Latent Semantic Analysis Introduction Nonnegative Matrix Factorization NMF has been proposed for the analysis of nonnegative data in Lee and Seung 999, In Lee and Seung 999, NMF was motivated as a technique for discovering the parts of obects. This was shown when Responsible editors: Tao Li, Chris Ding, Fei Wang. S. Zafeiriou B M. Petrou Department of Electrical and Electronic Engineering, Imperial College London, Signal Processing and Communication Research Group, South Kensington Campus, London SW7 2AZ, UK s.zafeiriou@imperial.ac.uk

2 420 S. Zafeiriou, M. Petrou NMF was applied to the decomposition of facial images. It was demonstrated that the derived bases of the decomposition intuitively corresponded to parts of faces. The problem of NMF is as follows: consider a nonnegative matrix X R+ F L and a prespecified integer constant K, and to find a matrix Z R+ F K, which is the matrix that contains the bases of the decomposition in its columns Lee and Seung 999, 2000; Zafeiriou et al. 2006; Kotsia et al. 2007, and a matrix H R+ K L, which contains the weights of the decomposition, such that X ZH. In order to find the approximation X ZH two optimization problems were defined in Lee and Seung 2000: the first optimization problem minimizes the Frobenius norm between X and the decomposition ZH and yields a least squares approximation, while the second one resorts to the minimization of the Kullback Leibler KL divergence between X and the decomposition. Updating rules of the iterative schemes used to solve these problems were derived from auxiliary functions. The provided updating rules guaranteed the non-increasing behavior of the cost functions, but they did not guarantee that the limit point corresponded to a local minimum or to a stationary limit point Gonzalez and Zhang 2005; Lin 2007a,b. In Gonzalez and Zhang 2005 it was numerically demonstrated that the multiplicative update rules in Lee and Seung 999, 2000 may fail to converge to a stationary point. Recently, the convergence properties of the NMF update rules were explored in Lin 2007a,b and Finesso and Sprei More precisely, in Lin 2007a update rules, which guarantee that the limit point of the Least Squares Error LSE optimization problem is a stationary point, were proposed. In Lin 2007b an alternative solution for the LSE optimization problem was proposed that was based on the Armio rule. In both Lin 2007a,b LSE was used for measuring the error of the approximation. Another very useful measure for the error of the nonnegative approximation with many applications is the KL divergence. The first algorithm with the KL divergence was proposed in Lee and Seung 2000 where both multiplicative and additive update rules were proposed. In Sra and Dhillon 2006, using the generalized Bregman distance formulation, an exponential family of update rules was proposed for the KL divergence. In Finesso and Sprei 2006, a convergence analysis for the multiplicative update rules of the KL divergence was provided and NMF was interpreted within a probabilistic framework. Moreover, in Finesso and Sprei 2006, it was proven that the proposed set of multiplicative update rules guarantee that the KL divergence converges to a stationary limit point. Finally, the relation between multiplicative update rules for NMF with KL divergence and Probabilistic Latent Semantic Analysis PLSA was explored in Gaussier and Goutte Recently, the relationship of NMF approaches with various Probabilistic Latent Variable Models PLVM was investigated in Shashanka et al. 2007, 2008 and Smaragdis and Ra One disadvantage of NMF is that it cannot describe more than pairwise data relations. The application of NMF to modelling relations and furthermore to clustering was initiated in Lee and Seung 999. In Ding et al it was shown that there is a close relationship between spectral clustering and NMF. In Shashua et al the problem of clustering data, given complex relations beyond pairwise relations between data points, was considered. The n-wise relations between the data points can be modelled by an n-order tensor, where each entry corresponds to an affinity mea-

3 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 42 sure usually nonnegative over an n-tuple of data points, which are more naturally described by a n-order tensor. That way, NTF models for clustering n-wise relations were motivated. Another example of the advantage of NTF over NMF can be found when these methods are applied to feature extraction from images. In this case, in order to apply NMF, the obect images should be vectorized in order to find their nonnegative decomposition. This vectorization leads to information loss, since the local structure of the images is lost. In order to remedy this drawback of NMF representation, the n-order and 3-order Nonnegative Tensor Factorization NTF schemes were proposed Hazan et al. 2005; Friedlander and Hatz 2006; Shashua and Hazan In Shashua and Hazan 2005, an n-order tensor factorization method was proposed, based on the Frobenius norm. In Hazan et al. 2005, an image database was represented as a 3-order tensor, i.e. a 3D cube that has as slices the 2D images. Update rules for the factors used in the decomposition, that guarantee a nonincreasing behavior of the KL divergence, were proposed. The 3D NTF decomposition was proven to be more suitable for part-based obect representation than NMF Hazan et al Examples [like the decomposition of the Swimmer dataset Donoho and Stodden 2004] which demonstrate the superiority of the 3D NTF over the NMF can be found in Hazan et al A recent overview of NTF algorithms can be found in Zafeiriou 2009b. Recently, there has been an increasing interest in NTF algorithms Kim and Choi 2007; Cichocki et al. 2007, 2008; Mørup et al. 2008; Kim et al. 2008; Zafeiriou 2009a,b. In Kim and Choi 2007, Mørup et al. 2008, and Kim et al algorithms for arbitrary order Tucker 966 factorizations, where data, core and mode matrices are non-negative, were proposed. Multiplicative update rules for all the factors were also proposed. In order to reduce further the ambiguities of the decomposition, updates that can impose sparseness in any combination of modalities were proposed in Mørup et al Moreover, the notion of nonsmoothness Pascual-Montano et al for controlling the sparseness was extended for NTF algorithms in Kim and Choi Algorithms for 3D NTF using the PARAFAC2 Kiers et al. 999; Bro et al. 999 decomposition were proposed in Cichocki et al. 2007, Finally, in Zafeiriou 2009a the NMF method, that uses the generalized Bregman divergence in Sra and Dhillon 2006 and the Discriminant-NMF DNMF method in Zafeiriou et al. 2006, were generalized for arbitrary nonnegative tensor decomposition. The tensorization of well-established vector-based algorithms is not an easy procedure and is a very active research field. For example, in De Lathauwer et al Singular Value Decomposition SVD was extended to a multilinear SVD, in Lu et al Principal Component Analysis PCA was extended to multilinear PCA, in Yan et al and Tao et al Fisher s Linear Discriminant Analysis FLDA was extended to various multilinear counterparts, Independent Component Analysis ICA was extended to multilinear ICA in Ra and Bovik 2008 and Canonical Correlation Analysis CCA to multilinear CCA in Kim and Cipolla In this paper, we study NTF using the KL-divergence. The KL divergence is one of the most used distances for performing NMF and NTF and is of importance since it provides factorizations that share a lot of similarities with Expectation Maximization EM approaches. Moreover, such NMF and NTF approaches, in most cases,

4 422 S. Zafeiriou, M. Petrou lead to clear probabilistic interpretations. Algorithms for NTF based of KL divergence were recently proposed in Zafeiriou 2009a. In Zafeiriou 2009a generalization of the NTF algorithms in Shashua and Hazan 2005 and of the NMF algorithms in Zafeiriou et al and Sra and Dhillon 2006 were proposed. In Zafeiriou 2009a no probabilistic interpretation for the NTF algorithm have been given and the properties of the algorithms have not been studied. In this paper, we interpret the NTF algorithm as an alternative Csiszar Tusnady procedure, in a similar manner as Finesso and Sprei 2006, and we explore the convergence properties of the optimization procedure. This analysis leads to updating rules which guarantee that the limit point is stationary. Moreover, we investigate the relationship between Probabilistic Tensor Latent Variable Models PTLVM and NTF with the KL divergence and we propose novel algorithms for PTLVM. In Shashanka et al. 2008, Gaussier and Goutte 2005, and Ding et al. 2008, the relationship between Probabilistic Latent Variable Analysis PLVA models for bivariate distributions and NMF were explored. In Gaussier and Goutte 2005 a first attempt to connect NMF and PLSA was performed. In Ding et al it was shown that NMF is similar but not exactly equivalent to NMF and a hybrid algorithm for performing PLSA using NMF was proposed. In Shashanka et al. 2008, it was also noted that the approach can be extended to arbitrary order PTLVM but no algorithm was proposed that could solve such problems. In this paper, we show the close connection between the algorithms in Finesso and Sprei 2006, Ding et al. 2008, Gaussier and Goutte 2005 and Shashanka et al and we generalize them for arbitrary multivariate distributions and for arbitrary order tensors. Summarizing the novel contributions of this paper are: We propose novel nonnegative tensor factorization algorithms based on KL-divergence by interpreting the optimization problem as a alternative Csiszar Tusnady procedure. We provide clear probabilistic interpretations for nonnegative tensor factorization, study the convergence and other theoretical properties of the algorithm. We propose algorithms for symmetric and asymmetric probabilistic tensor latent variable analysis. All the algorithms are proposed using simple matrix operations. The remainder of this paper is organized as follows. In Sect. 2, we briefly describe the problem of NMF with the KL divergence and comment on the relationship between NMF and PLVM. In Sect. 3, we briefly outline some elements of multilinear algebra and show how NMF algorithms can be extended to arbitrary order NTF. In Sect. 4, robust update rules of NTF with the KL divergence are proposed and their convergence theorems are presented. In Sect. 5 we present a probabilistic interpretation of the optimization problem and explore various properties of this approach. In Sect. 6, we show the equivalence between the proposed NTF algorithm and PTLVMs and propose algorithms for arbitrary order probability tensor decompositions. Experimental results which demonstrate the merits of the proposed approach are presented in Sect. 7. Finally, conclusions are drawn in Sect. 8. For completeness, extensive proofs of the various statements throughout the paper are given in the Appendices.

5 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure Nonnegative Matrix Factorization to NTF In this section, we briefly describe how NMF is formulated. In the following, we consider a collection of L nonnegative vectors x R F Nonnegative Matrix Factorization For two vectors x =[x, x 2,...,x F ] T and q =[q, q 2,...,q F ] T, the KL divergence or relative entropy between them is defined as Lee and Seung 2000: KLx q F i= x i ln x i + q i x i. q i For convenience, let us define 0 0 = 0 and 0 ln 0 = 0Finesso and Sprei It can be shown, that, in general, the KL divergence is nonnegative and is equal to zero if and only if its two arguments are equal. The basic idea behind NMF is to approximate the obect x by a linear combination of the elements of h R+ K such that x Zh, where Z R+ F K is a nonnegative matrix, the columns of which sum up to. In order to measure the error of approximation x Zh, theklx Zh divergence may be used Lee and Seung NMF, when applied to matrix X =[x... x... x L ] R+ F L =[x i ],aimsat finding two matrices Z R+ F M =[z ik ] and H R+ M L =[h k, ] such that: X ZH. 2 Vector x after the NMF decomposition can be written as x Zh, where h is the th column of H. Thus, the columns of matrix Z define a basis and h consists of the corresponding weights. Decomposition 2 induces an approximation error which is the sum of all KL divergences for all x, i.e.: D KL X ZH = = L KLx Zh = F i= = L x i K x i ln Kk= + z ik h k x i z ik h k k= 3 as the measure of the cost for factoring X into ZH Lee and Seung Factorization 2 is the outcome of the following optimization problem: min Z,H D KL X ZH subect to z ik 0, h k 0, k =,...,M and i =,...,F =,...,L.

6 424 S. Zafeiriou, M. Petrou NMF has non-negative constraints in both the elements of Z and H; these nonnegativity constraints permit the combination of multiple basis components in order to represent the original nonnegative vectors using only additions between the different basis components. The following update rules guarantee a nonincreasing behavior of the KL divergence 3 Lee and Seung 999, 2000: z t ik = zt ik h t k = h t k Mm= h t km x im /[Z t H t ] im Mm= h t F z t km 4 /[Z lk l x t H t ] l F z t. 5 lk 2.2 Alternative multiplicative update rules of NMF using KL divergence and a probabilistic interpretation In Finesso and Sprei 2006, an alternative algorithm for solving the optimization problem of minimizing 3 under nonnegativity constraints was proposed. That is, the problem was transformed in the factorization of a probabilities matrix P with P i 0 and F i= L= P i =. In the following for denoting the elements of matrix P we shall use either p i or Pi,. The problem was formulated as follows: min D KL P Q Q + 6 Q,Q + :Q + e=e,e T Q e= for notation simplicity, we shall use only e for denoting an arbitrary dimensional vector of ones. For the above constrained optimization problem cost function 6 is simplified as: D KL P Q Q + = F i= = L Pi, Pi, ln Kk=. 7 Q i, kq + k, The update rules for solving the above optimization problem are: Q t i, k = Qt i, k M m= Q t + k, mpi, m [ ] Q t Q t + im 8 and Q t + k, = Qt + k, F Q t / i, kpi, F [ ] i= Q t Qt + m= i M Q t m, kqt + k, lpm, l [ ] Q t. 9 Qt + ml

7 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 425 Update rule 9 may be equivalently written as Q t F + k, = Qt + k, i= Q t / i, kpi, N ] [ Q t Qt + Matrices P, Q and Q + are related with X, Z and H through: i Q t l, k. 0 P = e T Xe X, Q = e T Ze Z, Q + = H. The above updating rules not-only guarantee the nonincreasigness of cost function D KL P Q Q + but also guarantee the stationarity of the limit point. By transforming the optimization problem 4 into the factorization of a probabilities matrix, not only helped in proving stationarity, but also revealed a lot of interesting properties and interpretations. The generalization of these properties to arbitrary order tensor factorizations will be explored in this paper. Before proceeding we will briefly comment on an alternative factorization introducing another vector a which is also computed during the procedure. The optimization problem is: min D KL P Q AQ + 2 Q,Q + :Q + e=e,q T e=e,at e= where A = diaga and diaga returns a diagonal matrix that has in its main diagonal the elements of vector a. The corresponding update rules are given by: and Q t i, k = Qt i, k Q t Q i, k = t Q t + k, = Qt Q t [ ] M A t Q t + Pi, m [ km ] A t Q t m= i, k Fm= Q t m, k, [ F + k, i= Q t Q t At ] ik + [ Q t At Q t + im Pi, ] i 3 + k, = Q + t k, Lm= Q t 4 m, k, a t k = a t k + F,L i=, = Pi, Q t + k, Qt k, [ ]. 5 Q t At Q t + i

8 426 S. Zafeiriou, M. Petrou In Gaussier and Goutte 2005, Ding et al. 2008, Shashanka et al. 2007, 2008, and Smaragdis and Ra 2007 the close relationship between NMF with the KL divergence, PLSA and other PLVM approaches was commented. In Gaussier and Goutte 2005, it was shown how NMF problems may be interpreted as PLSA Hofmann 999. In Ding et al the authors showed that NMF and PLSA even though they minimize similar obective functions they do not converge to the same limit point. Moreover, in Shashanka et al. 2007, 2008 the relationship between PLVM and NMF algorithms was further studied. Here we shall try to provide an interpretation optimization problem 6 and update rules 8, 9 and 0 in terms of PLVM. Moreover, we will show that the constrained optimization problem 2 which is solved via update rules 3, 4 and 5 is exactly the same as PLSA algorithm in Hofmann 999. Latent class models, like PLSA, enable one to attribute the observations to hidden or latent factors. The main characteristic of these models is that conditionally independent bivariate data are modelled as belonging to latent classes, such that the random variables within a latent class are independent of one another. Random variables X, X 2 can be thought of as if they were defined in the canonical measurable space Ω, F, where Ω is the set of all pairs i, i 2 and F = 2 Ω i.e., the powerset of Ω. Let P be a probabilistic measure on this space. Then, the distribution of the pair X, X 2 under P is given by matrix P =[p i ]. We shall use the notation Px, x 2 for distribution Px = i, x 2 = = p i. Distribution Px, x 2 is modelled as a mixture, where each component of the mixture is the product of one-dimensional marginal distributions. Let x and x 2 be random variables and let Px, x 2 be their oint probability mass function. The PLSA approximation may be written as: Px, x 2 z {,...,K } Px, zpx 2 z 6 where z is a latent random variable. In an EM Hofmann 999 EM approach we maximize the following functional: D PLSA Px, x 2 Px, zpx 2 z I,I 2 K = Px = i, x 2 = i 2 ln Px =, z = i 3 Px 2 = i 2 z = i 3 i =,i 2 = i 3 = = D KL Px, x 2 Px, zpx 2 z c 7 where c is a constant c = I,I 2 i =,i 2 = Px = i, x 2 = i 2 lnpx = i, x 2 = i 2. Thus, as can be seen is equivalent to minimization of D KL Px, x 2 Px, z. Solving maximization of 7Hofmann 999 we have the following updating rules: P t P t x, zp t x 2 z z x, x 2 = y {,2,...,K } Pt x, yp t x 2 y 8

9 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 427 and P t x, z = a 2 Px, a 2 P t z x, a 2 P t x 2 z = a Pa, x 2 P t z a, x 2 a,a 2 Pa, a 2 P t z a, a 2. 9 Now by substituting 8 into9 wehave: P t x, z = P t x, z P t a 2 z Px, a 2 a 2 y {,2,...,K } Pt x, yp t a 2 y P t x 2 z = P t a Pa, x 2 P t a,z y {,2,...,K } x 2 z Pt a,yp t x 2 y 20 P a,a 2 Pa, a 2 t a,zp t a 2 z = P t a Pa, x 2 x 2 z P t a,yp t a 2 y y {,2,...,K } P t a,z y {,2,...,K } Pt a,yp t x 2 y a P t a, z Then, since F i= Mk= Px = i, z = k = and L = Px 2 = z = k = which are equivalent to e T Q e = and Q + e = e, respectively we can see that Q + i, k = Px = i, z = k and [Q ] T, k = Px 2 =, z = k. Thus, the update rules 8, 9 and 0 are exactly the same as the EM algorithm 20. In case that we that want to calculate the probabilities Pz [which is computed in the PLSA algorithm in Hofmann 999] as well then approximation 2 is reformulated as: Px, x 2 z {,...,K } PzPx zpx 2 z 2 and optimization problem is the maximization of D PLSA Px, x 2 Px, zdiagpz Px 2 z which is the PLSA model proposed in Hofmann 999. The corresponding update rules are given by: P t x z = P t a 2 Px, a 2 x z a,a 2 Pa, a 2 P t x 2 z = P t a Pa, x 2 x 2 z a,a 2 Pa, a 2 P t zp t a 2 z y {,2,...,K } Pt yp t x yp t a 2 y P t zp t a 2 z y {,2,...,K } Pt yp t a yp t a 2 y P t zp t a z y {,2,...,K } Pt yp t a yp t x 2 y P t zp t a zp t a 2 z y {,2,...,K } Pt yp t a yp t a 2 y 22.

10 428 S. Zafeiriou, M. Petrou and P t z = P t z Pa, a 2 P t a zp t a 2 z a,a 2 y {,2,...,K } Pt yp t a yp t a 2 y. 23 We observe that since F i= Px = i z = k =, L = Px 2 = z = k = and K k= Pz = k which are equivalent to Q T e =, Q +e = e and a T e =, respectively we can see that Q i, k = Px = i z = k, [Q + ] T, k = Px 2 = z = k and ak = Pz = k. Thus, update rules 8, 9 and 0 are exactly the same as the PLSA algorithm given by 22 and Relationship between PLSA, original NMF and the approach proposed in Ding et al The relationship between NMF and PLSA has been commented in Gaussier and Goutte 2005, Ding et al. 2008, and Shashanka et al Now, we will discuss the actual difference between NMF and PLSA and complete the analysis initiated in Ding et al InDing et al the authors proposed an algorithm for solving the approximation 6 using the actual NMF update rules 4 and 5. After the computation of Z and H they computed matrices Q i, k = Px = i z = k, Q + k, = Px 2 = z = k and ak = Pz = k the diagonal of A: Q = Z EZ Q + = H HE A = diag e T Z diaghe. 24 As the authors showed the above procedure results to a different limit point than the limit point derived from update rules 22 and 23. What the authors actually showed in Ding et al is that the algorithm will converge to a different limit point when the factors are normalized under every iteration such that the constraints Q T e =, Q +e = e and a T e = are satisfied and it will converge to another limit point when the normalization is applied once at the end of the procedure i.e., at the convergence. This is quite expected since these are two different optimization problems. The first one is the NMF optimization problem 4 having free variables Z and H which solution is given by update rules 4 and 5. The second one is the constrained optimization problem 2 having as free variables Q, Q +, a subect to the additional constraints Q T e = e, Q +e = e and a T e = which solution is computed by update rules 3, 4 and 5. A similar analysis also hold for the optimization problem 6. That is, even though 2 and 6 are similar they are not equivalent. Thus, they do not converge to the same limit point. Summarizing, we provided a supplement to the analysis given in Ding et al and we showed that even though optimization problems 4, 6 and 2 optimize

11 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 429 similar obective function they are not equivalent, since they are optimization problems with different set of variables for example in PLSA we search for an additional vector a and with different constraints and thus a different search space. Only the constrained NMF given by optimization problem 2 and solved by update rules 3, 4 and 5 is equivalent to PLSA and both converge to the same limit point. 3 Nonnegative Tensor Factorization In this section we shall briefly describe elements of multilinear algebra and show how to perform nonnegative tensor factorizations. 3. Tensor models and PARAFAC decomposition To begin with, let us briefly review the tensor algebra concepts needed hereafter and define the notation that will be used throughout the paper. An n-order tensor is a collection of measurements indexed by n indices, where each index refers to a mode. Accordingly, vectors are first-order tensors and matrices are second-order tensors Kolda and Bader Lower case letters e.g. x have already been used to denote scalars in the introduction, while boldface lowercase letters e.g. x and boldface capital letters e.g. X have denoted vectors and matrices, respectively. Higher-order tensors of order 3 or higher are denoted by boldface Euler script calligraphic letters e.g. X. If the ith element of a vector x R+ I is denoted by x i, i =, 2,...,I, then the elements of an n-order tensor X will be denoted by x i i 2...i n, i l =, 2,...,I l, l =, 2,...,n. In the following, we shall focus on nth order tensors with non-negative elements, i.e. X R I I 2 I n +. For example: X could be the representation of a database with L obects. Then every database obect is a nonnegative tensor of order n denoted as X :in R I I 2 I n +, i n =, 2,...,L, that is indexed by an n tuple of indices i, i 2,...,i n. To be more specific, the most natural way to model a facial image database is by a 3rd order tensor X R I I 2 I 3 +, where I and I 2 refer to the image hight and width, respectively and I 3 = L is the number of images in the database Shashua and Hazan A relationship that refers to more than two modes is more naturally represented by a tensor X. Such relationships could be nonnegative affinity measures over an n-tuple of patterns that frequently arise in visual interpretation problems, including 3D multi-body segmentation and illumination-based clustering of human faces Shashua and Hazan Ann-variate distribution is more naturally modelled as an n-order tensor. Frequently, transformations of tensors into matrices l-mode matricization and matrices into vectors vectorization are needed. The mode-l matricization of a tensor X R I I 2 I n + maps X toamatrixx l R Il M with M = n m= I m m =l such that the tensor element x i i 2...i n is mapped to the matrix element x il where = + n k= k =l i k J k with J k = k m= m =l I m. The operator vec applied to a matrix

12 430 S. Zafeiriou, M. Petrou stacks the matrix columns into a single vector and the operator. The matrix of ones and the identity matrix are denoted by E and I, respectively. A tensor of ones is similarly denoted as E. A tensor which has all elements equal to 0 except those for which all indices are the same is called superdiagonal. If the nonzero elements are equal to, then such a superdiagonal tensor is referred to as the unit superdiagonal tensor I. In the remaining of the paper, and / tensor operators will denote the elementwise multiplication and division between tensors. In the following, we shall need several products between vectors and matrices as well as between tensors and matrices. Let a R+ I and b RJ + be two non-negative real valued vectors. Their outer product yields a matrix C R+ I J C = a b with elements c i = a i b. 25 Consequently, the outer product of n vectors a l R I l +,l =, 2,...,n, a a 2 a n = n a l yields a tensor A R I I 2 I n +. The Kronecker product between two matrices A R I M + and B R I 2 M 2 + is defined as: a B a 2 B a M B a 2 B a 22 B a 2M B A B = a I B a I 2B a I M B = [ ] a b a b 2...a b M2 a 2 b...a M b...a M b M2 and yields a nonnegative matrix of size I I 2 M M 2. Accordingly, the Kronecker product of n matrices A l R I i M i +,l=, 2,...,n, denoted compactly as n A l yields a matrix of size n I i n M i. The Khatri-Rao product between two matrices A =[a a 2... a K ] R+ I K and B =[b b 2... b K ] R+ J K results in a matrix of size IJ K. It is defined as the matching columnwise Kronecker product of the aforementioned matrices, i.e. A B = [a b a 2 b 2... a K b K ]. 27 If we have n matrices A l R I l K +, l =, 2,...,n, their Khatri-Rao product is compactly denoted as n A l and yields a nonnegative matrix of size n I l K. The two most commonly used tensor decompositions are the PARAFAC/CANDE- COMP model Harshman 970; Carroll and Chang 970 [which is also equivalent to the decomposition using -order Kruskal tensors Kruskal 977] and the Tucker tensor models Tucker 966. Another way to perform the decomposition, especially for 3D tensor factorization, is given through the PARAFAC2 model Smilde et al which has been applied for 3D NTF in Cichocki et al An n-order tensor X is of rank at most K if it can be expressed as the sum of K rank- Kruskal tensors

13 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 43 Fig. Visualization of the rank-k approximation of a 3-order tensor using Kruskal tensor notation i.e., a sum of Kn-fold outer-products. Using the PARAFAC tensor model a tensor X can be decomposed as the sum of Kn-fold outer-products as: X K n = ul x i...i n K ul i...ul i n n, 0 i I, n 28 with u l RI T + and ul [u = l,...,ul I ]. That is, NTF aims at finding the best rank K nonnegative approximation of X with respect to an approximation cost. The NTF factorization using Kruskal tensors is pictorially described in Fig.. As can be seen, tensor X R I I 2 I 3 + is represented as the sum of K outer product tensors u l ul 2 ul 3. Decompositions like 28 can[ be calculated ] using only matrix operations. To do so, let us define matrices U u... uk which contain as columns the Kruskal [ ] vectors, =,...,n or equivalently U = u l i R I K +, i I, n, l n. These matrices will be used for defining NTF algorithms using matrix multiplications. Let us introduce the compact notation U l U n U + U U. 29 Using compact notation 29 the nonnegative factorization 28 can be written in a matricized form as: X R = U Z T = U U l T 30 matrices R R I n i=,i = I i + and Z R ni=,i = I i K + where Z = U i will be helpful in defining algorithms for the above factorization. The NMF problem X ZH or x Zh in Lee and Seung 2000 can be easily derived from 28 by selecting Z = U and H = U2 T. In the rest of the paper PARAFAC tensor decomposition and decomposition using the Kruskal tensor model will be used interchangeably in order to denote the same tensor decomposition.

14 432 S. Zafeiriou, M. Petrou 3.2 Nonnegative Tensor Factorization algorithms for KL-divergence The first method for NTF was proposed in Paatero and Tapper 994. The NMF methods proposed in Lee and Seung 2000 were generalized using PARAFAC tensor decomposition in Shashua and Hazan Recently, in Kim and Choi 2007 and Mørup et al NTF methods using Tucker decomposition were proposed. Moreover, in Cichocki et al PAFARAC2 models for 3D NTF were also proposed. InHazan et al the authors proposed an NTF algorithm using the KL-divergence for 3-order tensors. NTF with the KL-divergence for arbitrary order tensors was proposed in Zafeiriou 2009a. In the general n-order tensor case, the KL divergence between the given tensor X and the sum of rank- tensors is: D KL X K n = ul = I,...,I n x i...i n ln x i...i n Km= n= i =,...,i n = ui m K n x i...i n +. 3 m= = u m i The optimization problem of this NTF decomposition is: min u m i D KL X K m= n = um subect to u m i 0, i I, n, m K 32 The solution of the above optimization problem was calculated via the use of an auxiliary function. W is an auxiliary function of Y F if W F, F t Y F and W F, F = Y F. IfW is an auxiliary function of Y, then Y is nonincreasing under the update F t = arg min F W F, F t Lee and Seung With the help of the auxiliary function, the update rules for U can be derived. By fixing matrices U,...,U, U +,...,U n, the elements of matrix U are updated by minimizing Y U = D KL X K m= n = um we define function: = D KL X U U l W U t, U t I,...,I n x i i =,...,i n =...i n lnx i...i n x i...i n + I,...,I n x u K i m t n r= u m i r r r = i i =,...,i n =...i n m= K u l t nr= ln i u l i r r ui m t n r= ui m r r r = ln K u l t i r= r = u l i r r + I,...,I n r = T. For matrix U u l t n i K i =,...,i n = m= um i t n r= r = r= r = u l i r r u m i r r. 33

15 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 433 Function 33 is an auxiliary function for 3. A proof of the above statements is given in Zafeiriou 2009a. Let us define the compact notation and n r=,t i I I 2 i = i 2 = u m i r,r um i t n r=,r =,t I I + i = i + = u m i r r I n i n =. 34 u m i t u m i t u m i t u m i + + t u m i n n t. 35 The above calculates the product n r= ui m r r where for r =,..., weusethe estimate of ui m r r at time t while for the rest of them we use the estimate at time t. Having defined the above compact notations the update rules that can guarantee a non increasing behavior of cost function 3 for factors ui m t, are obtained by solving W U t,u t = 0 and are given by: u m i u m i t = u m i t nr=,r = ui m,t i x i...i n K nr= u l ir r,t nr= i ui m,t. 36 The above update rules can be formulated using matrix operations. To do so, let us define notation Z t U n t U t + Ut Ut 37 and [ R t U t Z t ] T. 38 Using the above definitions, the update rules 36 are given by: U t = U t Z t T X R t t EZ. 39 If we allow a total of r iterations, the complexity of the NTF algorithm is O rnk n = I.

16 434 S. Zafeiriou, M. Petrou InSra and Dhillon 2006 exponential rules for NMF using the KL divergence were proposed. The exponential rules for arbitrary valence NTF were proposed in Zafeiriou 2009a. For the exponential rules the following problem was considered: K min D KL ui m m= n = um X subect to ui m 0, i I, n, m K. 40 The update rules that guarantee a nonincreasing behavior of the above cost function are the following Zafeiriou 2009a: i nr=,r = u l i u l t i = u l t r r ln exp,t i or in matrix notation: U t = U t exp ln x i...in nm= nr= uir m r,t nr= i u l i r r,t X R t t EZ Z t T NTF with KL divergence as an alternative Csiszar Tusnady procedure and convergence properties In this section we will formally define the proposed NTF with KL divergence, provide a clear probabilistic interpretation and explore its convergence properties. First, let us consider the optimization problem with constraints: min U,...,U n :e T U e=,u 2 T e=e,...,u n T e=e D KL X K n = ul Proposition 2. The NTF optimization problem 43 has a solution.. 43 The proof of the above proposition can be found in Appendix 0. Before starting to explore the properties of NTF with KL Divergence, we should try to find the resemblance between the robust NMF method proposed in Finesso and Sprei 2006 and NTF decomposition. NMF is a special case of NTF when having 2-order tensors i.e. having only U and U 2, thus the NTF optimization problem is exactly the one in Finesso and Sprei 2006 under the constraints e T U e = and U2 T e = e. The generalized optimization problem for NTF using KL divergence is the one expressed by 43. In order to define the robust NTF algorithm we should reformulate the optimization problem using probability matrices. That is, we shall define the probability tensor as

17 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 435 P such that p i i 2...i n 0 and I,I 2,...,I n i =,i 2 =,...,i n = p i i 2...i n =. For convenience, we shall use the notation Pi,...,i n interchangeably with p i...i n. Let p I,...,I n i =,...,i n = x i...i n and w e T U e then we define P p X, Q w U and Q i U i for all i = 2,...,n. As proven in Appendix cost function D KL X K n = ul can be reformulated as: D KL X K = pd KL K n = ul P n = ql + D KL p w. 44 Since number p is known, the minimization of D KL X K n = ul with respect to U,...,U n is equivalent to minimizing D KL P K n = ql with respect to Q,,Q n and minimizing D KL p w with respect to w.forthenew optimization problem, we obtain w opt = p and so U opt = pq opt and U opt Q opt, = 2,...,n. 2 Thus, minimization of D KL X K n = ul = subect to sub- nonnegativity constraints is equivalent to minimizing D KL P K n = ql ect to nonnegativity constraints. This leads to the minimization of the KL-Divergence between finite probability measures: D KL P K n = ql = I,...,I n i =,...,i n = D KL p i,...,i n K n q i l = = I,...,I n p i i =,...,i n =,...,i n log K n= qi l. 45 p i,...,i n Given a probability tensor P and an integer K, the optimization problem is to find Q,...,Q n : min Q,...,Q n :e T Q e=,q 2 T e=e,...,q n T e=e D KL P K n = ql The superscript opt reflects the optimality.

18 436 S. Zafeiriou, M. Petrou We define the function: W + U t, U t = I,...,I n i =,...,i n = I,...,I n K p i...i n i =,...,i n = m= ln p i...i n lnp i...i n qi m t m n r= q ir r r = K qi l t nr= qi l r r r = ln qi l t n qi m t n r= qi m r r r = K qi l t lr=. 47 q ir r r = This function 47 is an auxiliary function for 45. Let Q t w U t and Q t 2 U t 2,, Qt n r= r = q l i r r U t n. Now by substituting the definition of P, Q t,...,q n t into 36 and using the fact that w opt = p, the following update rules for t are obtained for factors q l i : qi l t = q l t i i p i...i n n=2 qi l,t Km= n=,t q m i 48 and for {2,...,n} qi l t = q l t i nr=,r = q l ir r,t i p i...i n Km= nr= qir m r,t n=2 qi l I,...,I n i =,...,i n = p,t i...i n Km= n= qi m,t Update rule 49 can be equivalently written as: qi l t = q l t i nr=,r = qir l,r,t i p i...i n Km= nr= qir m,r,t I i = ql t i, In a matrix notation the above update rules can be written as: P Q t = Q t R t Z t T Zt T

19 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 437 and Q t = Q t P R t Z t T Z t T EQ t 52 where Z t = Q n t Q t + Qt Qt and Rt = Q t Z t The following Theorems guarantee that the limit point of the update rules 48 and 49 exists and is a stationary point of optimization problem 46. Before presenting the convergence theorems we should introduce the following sets. Given a probability tensor P and an integer K, we introduce the following sets: { V R I K I 2 I n + : I ℵ } K Vi, l, i 2,...,i n = P { Q R I K I 2 I n + : Qi, l, i 2,...,i n = Q i, lq 2 i 2, l...q n i n, l } Q,...,Q n 0, e T Q e =, Q2 T e = e,...,qt n e = e { } K O R I I 2 I n + : O = Qi, l, i 2,...,i n for some Q I 53 Theorem 2. Let Q t,...,qt n be generated by algorithm 48, 49 and Q t Q t i, l, i 2,...,i n = Q t i, lq t 2 i 2, l...q t n i n, l the corresponding tensor. Then, Q t i, l converges to limit Q i, l and Q t converges to the limit Q in Q, Q t i, l for = 2,...,n converge to limits Q i, l for all l with I i = Q i, l >0. T. The proof can be found in Appendix 2. The role of tensor Q will be made clear in the next section. Theorem 2.2 If Q,...,Q n is a limit point of the algorithm given by update rules 48 and 49 in the interior of the domain, then it is a stationary point of the obective function D KL 46. If Q,...,Q n is a limit point on the boundary of the domain corresponding to an approximate factorization where none of the columns of Q is zero I i = Q i, l >0 l, then all partial derivatives are nonnegative. D KL Q i,l for =,...,n The proof can be found in Appendix 3. Corollary 2. shows that the limit points of the above optimization problem using the update rules 48 and 49 are all Kuhn-Tucker points.

20 438 S. Zafeiriou, M. Petrou Corollary 2. The limit points of the algorithm with I i = Q i, l > 0 for all l are all the Kuhn-Tucker points for the minimization of D KL under the inequality constraints Q 0,...,Q n 0. The proof of the above Corollary can be found in Appendix 4. 5 A probabilistic interpretation of the optimization problem In this section we explore the properties of optimization problem 46. More precisely we: provide a solid probabilistic interpretation of the optimization problem; propose generalized pythagorean rules; motivate the use of the auxiliary function Setup and exact NTF Let us consider the n + -tuple of random variables Y, X, Y 2,...,Y n, taking values in {,...,I } {,...,K } {,...,I 2 } {,...,I n }. The random variables Y, X, Y 2,...,Y n can be thought of as if they were defined in the canonical measurable space Ω, F, where Ω is the set of all n + -ples i, l, i 2,...,i n and F = 2 Ω i.e., the powerset of Ω. For ω = i, l, i 2,...,i n we have the identity mapping Y, X, Y 2,...,Y n ω = i, l, i 2,...,i n.ifr is a given probabilistic measure on this space, the distribution of the n + -tuple Y, X, Y 2,...,Y n under R is given by tensor R: Ri, l, i 2,...,i n = RY = i, X = l, Y 2 = i 2,...,Y n = i n. 54 Tensors V and Q define probability measures P and Q on Ω, F. Obviously, the sets and I are subsets of the set of all measures on Ω, F. In particular, corresponds to the subset of all measures whose Y = Y,...,Y n marginal coincides with the given P, while I corresponds to the subset of measures under which Y,...,Y n are conditionally independent given X. The first assertion can be proven by definition. The second assertion is proven as follows. First, notice that QY = i, X = l, Y 2 = i 2,...,Y n = i n = Qi, l, i 2,...,i n. Then, by summing over all i with = 2,...,n we have QY = i, X = l = Q i, l, since Q T e = e for = 2,...,n. By summing over all i with =,...,n we have QX = l = I i = Q i, l and thus QY = i X = l = QY =i,x=l QX=l. Moreover, for = 2,...,n and by summing over all i m with m {,...,n} then QY = i X = l = QY =i,x=l QX=l = I i = Q i,lq i,l I i = Q i,l = Q i, l. Finally, QY = i,...,y n = i n X = l = QY =i,x=l,...,y n =i n,y n =i n QX=l = QY = i, X = lqy 2 = i 2 X = l...qy n = i n X = l. Lemma 2. Tensor P admits exact factorization of K rank one tensors iff I = and an approximate NTF as I=.

21 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 439 A proof can be found in Appendix 5. Now we generalize the probabilistic interpretation of NMF to an arbitrary order exact NTF problem. The probability tensor P admits an exact NTF i.e., P = K n = ql iff there exist at least one measure Ω, F whose Y = Y,...,Y n marginal is P and at the same time making Y,...,Y n conditionally independent given X. Proposition 2.2 Given P, function V, Q D KL V Q attains a minimum on Iand it is valid that min D KLP Q = min D KLV Q 55 Q ℵ V,Q I The proof can be found in Appendix 6. Let V opt and Q opt be the tensors satisfying Proposition 2.2. If l 0 such that I,...,I n i =,i 2 =,...,i n = Vopt i, l 0, i 2,...,i n = 0, then all Q opt i, l 0, i 2,...,i n = 0, as well. Equivalently, if l 0, such that I,...,I n i =,...,i n = Qopt i, l 0, i 2,...,i n = 0, then all V opt i, l 0, i 2,...,i n = 0, as well. In both cases the optimal factorization of P has rank less than K and in order to proceed to the factorization, we may omit all the columns from Q,...,Q n that correspond to l Two partial subproblems Now we shall try to solve the equivalent double minimization problem: min V,Q I D KLV Q. 56 We should solve two partial minimization problems. In the first problem, given Q I, we minimize D KL V Q over V, while in the second one, given V we minimize the divergence over Q given V. The unique solution V opt = V opt Q can be easily computed analytically and is given by: V opt i, l, i 2,...,i n = Pi,...,i n K n= Q i, l Qi, l, i 2,...,i n 57 where K Q i, l = K Qi, l, i 2,...,i n. For the second minimization problem, the unique solution is given by: Q opt = I 2,...,I n i 2 =,...,i n = Vi, l, i 2,...,i n 58

22 440 S. Zafeiriou, M. Petrou and for = 2,...,n: Q opt i Vi, l, i 2,...,i n i, l = I,...,I n i =,...,i n = Vi, l, i 2,...,i n 59 These two minimization problems and their solutions have a direct probabilistic interpretation. This interpretation generalizes the discussion in Finesso and Sprei 2006 using arbitrary order tensors instead of order-2 tensors, i.e. matrices. In the first minimization given a distribution Q, which makes the sequence of Y = Y,...,Y n conditionally independent given X, the best solution represents the best set in with the marginal of Y being tensor P. Let V opt be the optimal distribution Y, l, Y 2,...,Y n. Equation 57 can then be interpreted in terms of corresponding measures as: P opt Y = i, X = l, Y 2 = i 2,...,Y n = i n = QX = l Y = i,...,y n = i n Pi, i 2,...,i n 60 Notice that the conditional distributions of X given Y under P opt and Q are the same. In the second optimization problem, one is given a distribution Q, with the marginal of Y given by P and finds the best approximation to it in the set I of distributions which make Y = Y,...,Y n conditionally independent given X. LetQ opt denote the optimal distribution of Y, l, Y 2,...,Y n, Y n. Equations 58 and 59 can be now be interpreted as: Q opt Y = i, X = l = PY = i, X = l 6 and for = 2,...,n Q opt Y = i X = l = PY = i X = l, 62 respectively. That is, the optimal solution Q opt is such that the marginal distribution of X, Y under P and Q opt coincide, as well as, the marginal distributions of Y given X under P and Q opt. Lemma 2.2 For fixed Q and V opt = V opt Q it holds that, for any V D KL V Q = D KL V V opt + D KL V opt Q 63 moreover D KL V opt Q = D KL P O 64

23 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 44 where Oi,...,i n = K Qi, l, i 2,...,l, i n 65 For fixed V and Q opt = Q opt V, it holds that, for any Q I. D KL V Q = D KL V Q opt + D KL Q opt Q. 66 The proof is provided in Appendix 7. Before we produce the other two Pythagorean rules for NTF, we should define the following laws. Let the law of U, V under arbitrary probability measures P and Q by P U,V and Q U,V. The conditional distributions of U given V are denoted by the matrices P U V and Q U V, with the convention P U V i, = PU = V = i and likewise for Q U V. Let us define the mean value operator E Z D KL P O E Z D KL P O = i,...,i n Zi,...,i n Pi,...,i n ln Pi,...,i n. Oi,...,i n 67 Lemma 2.3 It holds that D KL P U,V Q U,V = E P D KL P U V Q U V + D KL P V Q V 68 where D KL P U V Q U V = PU = V log PU = V OU = V 69 If moreover V = V, V 2 and U, V 2 are conditionally independent given V under Q, then the first term in 68 can be written as: E P D KL P U V Q U V = E P D KL P U V P U V + E P D KL P U V Q U V Now we will reinterpret the Pythagorean rules 63, 64 and 66 using probabilistic terms. In that way we generalize the probabilistic interpretation of NMF pythagorean rules in Finesso and Sprei The first optimization problem of 56 can be reinterpreted in probabilistic terms using similar lines as Finesso and Sprei That is, given Q, we are to find its best approximation within. LetQ correspond toagivenq and the P corresponds to a generic V. Choosing now, U = X and V = Y = Y,...,Y n in Lemma 2.3, Eq.68 reads 70 D KL V Q = E P D KL P X Y Q X Y + D KL P Q. 7

24 442 S. Zafeiriou, M. Petrou Moreover, 7 is equivalent to 63. That is, the optimization problem is equivalent to the minimization of E P P X Y Q X Y with respect to V, which is attained with value 0 at P opt with P X Y opt = QX Y and Popt Y = P. That is, for the second optimization problem i.e, minimizing D KL V Q given V we have the following. Given V we are to find its best approximation within Q. LetP correspond to given V and Q correspond to a generic Q I. Choosing U ={Y 2,...,Y n }, V = X and V 2 = Y in Lemma 2.3, and remembering that under any Q Ithe random variables Y, Y 2,...,Y n are conditionally independent given X, Eq.68 combined with 70 now reads: D KL V Q = E P D KL P Y 2,...,Y n X,Y P Y,...,Y n X + E P D KL P Y 2,...,Y n X Q Y 2,...,Y n X +D KL P Y,X Q Y,X = n =2 E P D KL P Y X Q Y X + D KL P Y,X Q Y,X as shown analytically in Appendix 8. In the special case of NMF Finesso and Sprei 2006 i.e., only two random variables Y and Y 2 the above Pythagorean rule is written as D KL V Q = E P D KL P Y 2 X Q Y 2 X + D KL P Y,X Q Y,X. The problem is equivalent to the minimization of n =2 E P D KL P Y X Q Y X and D KL P Y,X Q Y,X with respect to Q I. Both these minima are attained both with value 0 at Q opt with Q Y X opt = PY X for = 2,...,n and Q Y,X opt = PY,X. Note that X have the same distribution under P and Q opt. To derive probabilistically the corresponding generalized Pythagorean rule we notice that: D KL V Q D KL V Q opt = n =2 E P D KL Q Y X opt QY X 72 +D KL Q Y,X opt QY,X. 73 On the right hand side of 73 we can, by conditional independence, replace E Q opt D KL Q Y X opt QY X,for = 2,...,n with E Q opt D KL Q Y X,Y opt Q Y X,Y. By another application of 68, we see that D KL V Q D KL V Q opt = D KL Q opt Q, which is the second Pythagorean rule Alternating minimization procedure As in Finesso and Sprei 2006, we set up the alternating minimization algorithm for obtaining min O D KL P O, where P is a given nonnegative tensor of arbitrary order. In view of Proposition 2.2. we can lift this problem in Ispace. Starting with an arbitrary Q 0 Iwith positive elements, we adopt the following minimization

25 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 443 scheme Q t V t Q t+ V t+ 74 where V t = V opt Q t, Q t+ = Q opt V t.aswecanseeby74 the update gain is independent of the order of the tensors. To relate this algorithm to the one defined by formulae 48 and 49, we combine two steps of the alternative minimization at a time. From the algorithms 74 we get: Q t+ = Q opt V opt Q t. 75 Computing the optimal solutions according to 57, 58 and 59 one gets from there the update rules 48 and 49. The Pythagorean rules allow us to compute easily the gain D KL P O t D KL P O t+ of the algorithm. Proposition 2.3 The update gain at each iteration of the algorithm 75 in terms of the tensors O t is given by: D KL P O t D KL P O t+ = D KL V t V t+ +D KL Q t+ Q t 76 The proof is provided in Appendix 9. If for matrices Q 0,...,Q0 n we have that qi l 0 > 0 then under the iterations in 48 and in 49 we shall have qi l t > 0 with t > 0. If in the tth step the update gain is zero then we will have D KL Q t Q t = 0. Hence, tensors Q t and Q t are identical. From this it follows by summation that Q t = Q t. Moreover, we shall have that Q t+ i, l...q n t+ i n, l = Q t i, l...q t n i n, l, since Q t = Q t, Q t is strictly positive and by summing for all a {2,...,, +,...,n} we have that Q t = Q t for all {2,...,n}. Hence, the update rules strictly decrease the obective function until the algorithm reaches a fixed point. 5.3 Auxiliary functions We consider now the minimization of D KL P O and its lifted version i.e., the minimization of D KL V Q. We shall consider that the problem starts by noting that Q t+ is found by minimizing Q D KL V opt O t O. This strongly motivates the choice of function Q, Q W Q, Q = D KL V opt Q t Q 77 as an auxiliary function for minimizing D KL P O with respect to O.

26 444 S. Zafeiriou, M. Petrou Using the decomposition of the divergence in Eq. 78 we can rewrite W as W Q, Q = D KL V opt Q t Q = D KL Popt Y Q Y +E P opt D KL P X Y opt Q X Y 78 Since P X Y opt = QX Y, and Popt Y = P we write 78 as W Q, Q = D KL P Q Y + E P D KL Q X Y Q X Y 79 From 79 it follows that W Q, Q D KL P O, and that W Q, Q = D KL P O, which are the properties of an auxiliary function for D KL P O. For our problem one can find n auxiliary functions for the original minimization of D KL P K n = ql the same auxiliary functions given in 33. In every auxiliary function for Q, =,...,n we fix all m {,...,n} {}. As in Finesso and Sprei 2006 the auxiliary function W Q, Q can be denoted as W Q,...,Q n, Q n,...,q. The auxiliary function minimization with fixed Qm with m {,...,n} {} can be taken as Q W O Q = W Q,...,Q n, Q,...,Q, Q, Q +,...,Q n 80 Now, we shall generalize the update gains in Finesso and Sprei 2006 and provide the update gains for arbitrary order NTF decompositions. Lemma 2.4 Consider the auxiliary functions W Q, Q and n auxiliary functions W Q,...,Q n, Q,...,Q, Q, Q +,...,Q n. Denote by Q the minimizers of auxiliary functions in all n + cases. Then, the following Lemma holds. D KL P D KL P K K D KL P K n = ql n = ql n = ql = D KL Q Y,X Q Y,X + W Q Q = DKL Q Y,X Q Y,X 8 W Q Q = E P opt D KL Q Y X Q Y X 82 W Q,...,Q n, Q,...,Q n n E Q D KL Q Y X Q Y X 83 =2 The proof of the above can be found in Appendix 20.

27 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 445 Corollary 2.2 The update gain of the multiplicative update rules in 48 and 49 is given by: D KL P O t D KL P O t+ = D KL Q t+y,x Q t Y,X + n =2 E Q t+ D KL Q t+y X Q t Y X +E P D KL Q t X Y Q t+ X Y. 84 In order to prove the above we start by writing: D KL P O t D KL P O t+ = D KL P O t W O t, O t+ +W O t, O t+ D KL P O t+ 85 and then, using 79 and 83 the corollary is proven. Finally, by using 76 we can prove that: D KL V t V t+ = E P Q t X Y t+ X Y Q D KL Q t Q t+ = D KL Q t X Y t+ X Y Q + n E Q t+ D KL Q t+y X Q t Y X 86 =2 6 Probabilistic latent tensor variable analysis In the previous section we studied some of the properties of the NTF optimization problem in 43 and the corresponding algorithm based on multiplicative update rules. In this section we shall try to relate the above algorithm with probabilistic tensor latent component analysis models. In Shashanka et al the authors commented about the possible extension of PLSA and PLVA models the algorithms of which are implemented using update rules similar to NMF using arbitrary order tensors but no algorithm was proposed. Here we shall show that the algorithm for solving 43 can indeed be used for multilinear PLTVA and by introducing additional factors we propose algorithms for PLTVA which are the generalization of the NMF algorithms used in Shashanka et al and Gaussier and Goutte In the following, we generalize both the symmetric and asymmetric PLSA models. 6. The symmetric model In the symmetric PLSA model for the two random variable case we seek for the latent probabilities matrices for Px 2 z and Px, z or Px z. Let x,...,x n be random

28 446 S. Zafeiriou, M. Petrou variables. Then approximation 6 can be generalized as: Px,...,x n z,z=,...,k In an EM manner the above is solved as: Px, z n Px z 87 =2 P t P t x, z n =2 P t x z z x,...,x n = z {,2,...,K } Pt x, z n =2 P t x z 88 and P t x, z = Px,...,x n P t z x,...,x n x 2,...,x n P t x,...,x,x +,...,x n Px,...,x n P t z x,...,x n x z = x,...,x n Px,...,x n P t, = 2,...,n. z x,...,x n 89 We can easily verify that the above update rules 88 and 89 are equivalent to update rules 57, 58 and 59 and also to update rules 48 and 49. Thus, approximation 87 is equivalent to optimization problem 43. Now, we shall expand further the approximation 87 by allowing another factor Pz, which is naturally introduced by expanding Px, z = PzPx z. Thus, factorization 87 becomes: Px,...,x n z,z=,...,k Pz In an EM approach the above problem is solved as: n Px z. 90 = P t P t z n = P t x z z x,...,x n = z {,2,...,K } Pt z n = P t x z 9 and P t x Px,...,x n P t z x,...,x n x z = x,...,x n Px,...,x n P t z x,...,x n P t z = Px,...,x n P t z x,...,x n. 92 x,...,x n

29 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 447 By substituting 9 into92 we get: x Px,...,x n P t x z = x,...,x n Px,...,x n P t z n r= Px r z,t z {,2,...,K } Pt z n r=,t Px r z P t z n r= Px r z,t z {,2,...,K } Pt z n r=,t Px r z P t z = P t z n = P t x z Px,...,x n x,...,x n z {,2,...,K } Pt z n = P t x z. 93 Now, let P be the probability tensor with p i...i n = Px = i, x 2 = i 2,...,x n = i n and let matrices [Q ] i l = Px = i z = l and vector a with [a] l = Pz = l. Then, approximation 90 is written as: P R = K a l n = ql. 94 In a matricized form the above nonnegative approximation can be written as: P R = Q Z T = Q A Q i T 95 for =,...,n where A = diaga and Z = i Q A. In terms of A it can be written as: vecp Q Q i a. 96 Using tensor unfolding and matricization procedures, the update rules 93 can be written as: Q t = Q t P Q t = Q t E Q t R t Z t T 97 where R t = Q t t Z T and Z t = Q t n... Q t... Q t + Q t.

30 448 S. Zafeiriou, M. Petrou The update rule for the probabilities of the latent variables is: a t = a t Q t Q t i T vecp Q t Qt i a t. 98 We can easily verify that the above update rules are the solution of the following optimization problem: min Q,...,Q n :Q T e=e,e T a= D KL P K a l n = ql. 99 In order to use the above algorithm for the factorization of an arbitrary tensor X K a l u l with w = I,...,I n i =,...,i n = X i,...,i n we should normalize the tensor into P = w X, then use update rules 97. Finally, the factors of the decomposition are given by b opt = w a opt and U opt = Q opt. 6.2 Asymmetric model Now we generalize the asymmetric PLSA model for the arbitrary order tensor case. In the asymmetric PLSA model for the two random variable case we seek for the latent probabilities matrices for Px z and Pz x 2. An example of the asymmetric model decomposition is the application of NMF to facial images. In this case the images are vectorized. The vectors are then normalized in order to sum up to one i.e., in order to form probability mass functions. Then, matrix P =[Px = i x 2 = ] contains as columns the facial images x is the random variable that corresponds to feature dimensionality and x 2 is the random variable that corresponds to number of faces in the dataset. The decomposition 6 can be reformulated as: Px x 2 = Pz x 2 Px z 00 z {,...,K } then using the EM in 8 can be reformulated as: P t P t z x 2 P t x z z x, x 2 = z {,...,K } Pt x zp t z x 2 0

31 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure 449 and P t z x 2 = x Px, x 2, z x,z {,...,K } Px, x 2, z = = P t z x 2 x Px x 2 z {,...,K } Pt x zp t z x 2 Pt x z x,z {,...,K } x Px zpz x, x 2 x,z {,...,K } Px zpz x, x 2 Px x 2 z {,...,K } Pt x zp t z x 2 Pt z x 2 P t x z 02 and P t x x z = 2 Px, x 2, z x,x 2 Px, x 2, z = x 2 Px 2 Px x 2 Pz x, x 2 x,x 2 Px 2 Px x 2 Pz x, x 2 = P t x z Px x 2 x 2 z {,...,K } Pt x zp t z x 2 Px 2P t z x 2 Px x 2 x,x 2 z {,...,K } Pt x zp t z x 2 Px 2P t z x 2 P t x z. 03 Let that the images are not vectorized, then each of the facial image is represented by a matrix and all the faces together for a tensor P =[Px = i, x 2 = i 2 x 3 = i 3 ] where x and x 2 are the random variables that correspond to height and width of the facial images and x 3 is the random variable that corresponds to the number of facial images in the dataset. In the general n-order case we have: Px,...,x n x n+ = z {,...,K } = z {,...,K } Pz x n+ Px,...,x n z Pz x n+ the above model in an EM manner can be solved as: n Px z 04 = P t P t z x n+ n = P t x z z x,...,x n+ = z {,...,K } Pt z x n+ n = P t x z 05 and P t z x n+ = P t z x n+ x n+ Px,...,x n x n+ z {,...,K } Pt z x n+ n r= P t x r z x z {,...,K } Px,...,x n x n+ nr= P t x r z z {,...,K } Pt z x n+ n r= P t x r z Pt z x n+ n = P t x z 06

32 450 S. Zafeiriou, M. Petrou and P t x z = P t x z x Px,...,x n x n+ z {,...,K } Pt z x n+ n r=,t Px r z Px n+p t z x n+ n r=,r =,t Px r z Px,...,x n x n+ x,...,x n+ z {,...,K } Pt z x n+ n r= Px r z Px n+p t z x n+ n r= Px z,t,t 07 Then, by defining tensor P with elements Pi,...,i n = Px = i,...,x n = i n x n+ = i n+ and Q i, l = Px = i z = l for =,...,n and Q n+ i n+, l = Pz = l x n+ = i n+. Then, we formulate the following optimization problem: min Q,...,Q n+ :Q T e=e,q n+ e=e D KL P K n+ = ql. 08 Using tensor unfolding and matricization procedures the update rules 93 can be written as: Q t = Q t P Q t = Q t E Q t R t [ Z t ] T 09 for =,...,n where R t = Q t t Z T, Z t = Q t... Q t + Qt... Qt and Z t = Q t n+ Qt n Q t... Qt, where Q t n+ = BQt matrix with b l = Px n+ = l i.e., the probability of every instance. For matrix Q t n+ we have the following update rules: n+ Qt n... Q t + n+ and B = diagb is a I n+ I n+ diagonal Q t n+ = Q n+ t P [ n+ R t n+ Q t n+ = n+ Q t 7 Experimental results Z t n+ Q t 0 n+e. We demonstrate the usefulness of the proposed approach using both simulated and real life data. ] T

33 Nonnegative tensor factorization as an alternative Csiszar Tusnady procedure Experiments using simulated data We demonstrate the power of PLTVA framework providing an empirical verification of the fact that the algorithm given by update rules 97 and 98 can produce better bases in terms of interpretability and sparseness than PLSA and its effect on the success of recreating the underlying model. One dataset that has been used to this end is the Swimmer database. The Swimmer database has been used in Donoho and Stodden 2004 for demonstrating a case when the PLTVA can provide a unique decomposition into parts. Some of the images of the Swimmer dataset can be seen in Fig. 2a. The Swimmer image set is comprised of 256 images of dimensions forming a tensor X R Each image contains a torso the invariant part of 2 pixels in the center and four limbs of 6 pixels that can be in one of 4 positions. We applied the NMF algorithm given by update rules 3, 4 and 5 which is equivalent to PLSA to matrix X 3 R matricization of tensor X in terms of the number of images. Matrix X 3 is first normalized so as is sums up to one. The resulted bases given by the columns of matrix Q are shown in Fig. 2b. As can be seen the NMF scheme Lee and Seung 2000 correctly resolves the local parts but fails on the torso, which is shown as a ghost part in all images [this has been also demonstrated in Donoho and Stodden 2004]. Next we applied the NTF given by 97 and 98 which is equivalent to PLTVA. The proposed NTF algorithm computed bases which are given by the outer product q l ql 2 and l =,...,K that contain a unique factorization and correctly resolve the parts Fig. 2b. Moreover, we performed eigenanalysis to the bases of NMF PLSA and NTFPLTVA and the mean number of non-zero eigenvalues for NMF was 3.92 and for NTF was. This shows that the proposed NTF correctly decomposed the data tensors to rank-one basic tensors or independent factors given the latent variable. 7.2 Face verification experiments The experiments conducted with the XM2VTS database used the protocol described in Messer et al The images were aligned semi-automatically according to the eyes position of each facial image using the eye coordinates. The facial images were down-scaled to a resolution of pixels. Histogram equalization was used for Fig. 2 Some images of the Swimmer database and the corresponding bases for PLSA and PLTVA, a Swimmer images; b PLSA bases images; c PLTVA bases images

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation Amnon Shashua School of Computer Science & Eng. The Hebrew University Matrix Factorization

More information

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Non-negative Matrix Factorization: Algorithms, Extensions and Applications Non-negative Matrix Factorization: Algorithms, Extensions and Applications Emmanouil Benetos www.soi.city.ac.uk/ sbbj660/ March 2013 Emmanouil Benetos Non-negative Matrix Factorization March 2013 1 / 25

More information

WHEN an object is represented using a linear combination

WHEN an object is represented using a linear combination IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL 20, NO 2, FEBRUARY 2009 217 Discriminant Nonnegative Tensor Factorization Algorithms Stefanos Zafeiriou Abstract Nonnegative matrix factorization (NMF) has proven

More information

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing 1 Zhong-Yuan Zhang, 2 Chris Ding, 3 Jie Tang *1, Corresponding Author School of Statistics,

More information

Fundamentals of Multilinear Subspace Learning

Fundamentals of Multilinear Subspace Learning Chapter 3 Fundamentals of Multilinear Subspace Learning The previous chapter covered background materials on linear subspace learning. From this chapter on, we shall proceed to multiple dimensions with

More information

Non-negative matrix factorization with fixed row and column sums

Non-negative matrix factorization with fixed row and column sums Available online at www.sciencedirect.com Linear Algebra and its Applications 9 (8) 5 www.elsevier.com/locate/laa Non-negative matrix factorization with fixed row and column sums Ngoc-Diep Ho, Paul Van

More information

Sparseness Constraints on Nonnegative Tensor Decomposition

Sparseness Constraints on Nonnegative Tensor Decomposition Sparseness Constraints on Nonnegative Tensor Decomposition Na Li nali@clarksonedu Carmeliza Navasca cnavasca@clarksonedu Department of Mathematics Clarkson University Potsdam, New York 3699, USA Department

More information

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space

More information

arxiv: v4 [math.na] 10 Nov 2014

arxiv: v4 [math.na] 10 Nov 2014 NEWTON-BASED OPTIMIZATION FOR KULLBACK-LEIBLER NONNEGATIVE TENSOR FACTORIZATIONS SAMANTHA HANSEN, TODD PLANTENGA, TAMARA G. KOLDA arxiv:134.4964v4 [math.na] 1 Nov 214 Abstract. Tensor factorizations with

More information

Probabilistic Latent Variable Models as Non-Negative Factorizations

Probabilistic Latent Variable Models as Non-Negative Factorizations 1 Probabilistic Latent Variable Models as Non-Negative Factorizations Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis Abstract In this paper, we present a family of probabilistic latent variable models

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

The matrix approach for abstract argumentation frameworks

The matrix approach for abstract argumentation frameworks The matrix approach for abstract argumentation frameworks Claudette CAYROL, Yuming XU IRIT Report RR- -2015-01- -FR February 2015 Abstract The matrices and the operation of dual interchange are introduced

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 Part 6: Some Other Stuff PD Dr.

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Non-Negative Tensor Factorization with Applications to Statistics and Computer Vision

Non-Negative Tensor Factorization with Applications to Statistics and Computer Vision Non-Negative Tensor Factorization with Applications to Statistics and Computer Vision Amnon Shashua shashua@cs.huji.ac.il School of Engineering and Computer Science, The Hebrew University of Jerusalem,

More information

NON-NEGATIVE SPARSE CODING

NON-NEGATIVE SPARSE CODING NON-NEGATIVE SPARSE CODING Patrik O. Hoyer Neural Networks Research Centre Helsinki University of Technology P.O. Box 9800, FIN-02015 HUT, Finland patrik.hoyer@hut.fi To appear in: Neural Networks for

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 6 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen May 23, 23 Non-Negative Datasets Some datasets are intrinsically non-negative: Counters (e.g., no. occurrences

More information

1. Structured representation of high-order tensors revisited. 2. Multi-linear algebra (MLA) with Kronecker-product data.

1. Structured representation of high-order tensors revisited. 2. Multi-linear algebra (MLA) with Kronecker-product data. Lect. 4. Toward MLA in tensor-product formats B. Khoromskij, Leipzig 2007(L4) 1 Contents of Lecture 4 1. Structured representation of high-order tensors revisited. - Tucker model. - Canonical (PARAFAC)

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Single-channel source separation using non-negative matrix factorization

Single-channel source separation using non-negative matrix factorization Single-channel source separation using non-negative matrix factorization Mikkel N. Schmidt Technical University of Denmark mns@imm.dtu.dk www.mikkelschmidt.dk DTU Informatics Department of Informatics

More information

Third-Order Tensor Decompositions and Their Application in Quantum Chemistry

Third-Order Tensor Decompositions and Their Application in Quantum Chemistry Third-Order Tensor Decompositions and Their Application in Quantum Chemistry Tyler Ueltschi University of Puget SoundTacoma, Washington, USA tueltschi@pugetsound.edu April 14, 2014 1 Introduction A tensor

More information

B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

CVPR A New Tensor Algebra - Tutorial. July 26, 2017

CVPR A New Tensor Algebra - Tutorial. July 26, 2017 CVPR 2017 A New Tensor Algebra - Tutorial Lior Horesh lhoresh@us.ibm.com Misha Kilmer misha.kilmer@tufts.edu July 26, 2017 Outline Motivation Background and notation New t-product and associated algebraic

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

RANK AND PERIMETER PRESERVER OF RANK-1 MATRICES OVER MAX ALGEBRA

RANK AND PERIMETER PRESERVER OF RANK-1 MATRICES OVER MAX ALGEBRA Discussiones Mathematicae General Algebra and Applications 23 (2003 ) 125 137 RANK AND PERIMETER PRESERVER OF RANK-1 MATRICES OVER MAX ALGEBRA Seok-Zun Song and Kyung-Tae Kang Department of Mathematics,

More information

AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION

AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION JOEL A. TROPP Abstract. Matrix approximation problems with non-negativity constraints arise during the analysis of high-dimensional

More information

Nonnegative Matrix Factorization

Nonnegative Matrix Factorization Nonnegative Matrix Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

On the equivalence between nonnegative tensor factorization and tensorial probabilistic latent semantic analysis

On the equivalence between nonnegative tensor factorization and tensorial probabilistic latent semantic analysis Appl Intell DOI 10.1007/s10489-010-0220-9 On the equivalence between nonnegative tensor factorization and tensorial probabilistic latent semantic analysis Wei Peng Tao Li Springer Science+Business Media,

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Non-Negative Matrix Factorization with Quasi-Newton Optimization Non-Negative Matrix Factorization with Quasi-Newton Optimization Rafal ZDUNEK, Andrzej CICHOCKI Laboratory for Advanced Brain Signal Processing BSI, RIKEN, Wako-shi, JAPAN Abstract. Non-negative matrix

More information

Institute for Computational Mathematics Hong Kong Baptist University

Institute for Computational Mathematics Hong Kong Baptist University Institute for Computational Mathematics Hong Kong Baptist University ICM Research Report 08-0 How to find a good submatrix S. A. Goreinov, I. V. Oseledets, D. V. Savostyanov, E. E. Tyrtyshnikov, N. L.

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

Fast Nonnegative Tensor Factorization with an Active-Set-Like Method

Fast Nonnegative Tensor Factorization with an Active-Set-Like Method Fast Nonnegative Tensor Factorization with an Active-Set-Like Method Jingu Kim and Haesun Park Abstract We introduce an efficient algorithm for computing a low-rank nonnegative CANDECOMP/PARAFAC(NNCP)

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

A Generalization of Principal Component Analysis to the Exponential Family

A Generalization of Principal Component Analysis to the Exponential Family A Generalization of Principal Component Analysis to the Exponential Family Michael Collins Sanjoy Dasgupta Robert E. Schapire AT&T Labs Research 8 Park Avenue, Florham Park, NJ 7932 mcollins, dasgupta,

More information

Generic properties of Symmetric Tensors

Generic properties of Symmetric Tensors 2006 1/48 PComon Generic properties of Symmetric Tensors Pierre COMON - CNRS other contributors: Bernard MOURRAIN INRIA institute Lek-Heng LIM, Stanford University 2006 2/48 PComon Tensors & Arrays Definitions

More information

Spectral inequalities and equalities involving products of matrices

Spectral inequalities and equalities involving products of matrices Spectral inequalities and equalities involving products of matrices Chi-Kwong Li 1 Department of Mathematics, College of William & Mary, Williamsburg, Virginia 23187 (ckli@math.wm.edu) Yiu-Tung Poon Department

More information

Letting be a field, e.g., of the real numbers, the complex numbers, the rational numbers, the rational functions W(s) of a complex variable s, etc.

Letting be a field, e.g., of the real numbers, the complex numbers, the rational numbers, the rational functions W(s) of a complex variable s, etc. 1 Polynomial Matrices 1.1 Polynomials Letting be a field, e.g., of the real numbers, the complex numbers, the rational numbers, the rational functions W(s) of a complex variable s, etc., n ws ( ) as a

More information

Auxiliary-Function Methods in Optimization

Auxiliary-Function Methods in Optimization Auxiliary-Function Methods in Optimization Charles Byrne (Charles Byrne@uml.edu) http://faculty.uml.edu/cbyrne/cbyrne.html Department of Mathematical Sciences University of Massachusetts Lowell Lowell,

More information

Faloutsos, Tong ICDE, 2009

Faloutsos, Tong ICDE, 2009 Large Graph Mining: Patterns, Tools and Case Studies Christos Faloutsos Hanghang Tong CMU Copyright: Faloutsos, Tong (29) 2-1 Outline Part 1: Patterns Part 2: Matrix and Tensor Tools Part 3: Proximity

More information

Definition 2.3. We define addition and multiplication of matrices as follows.

Definition 2.3. We define addition and multiplication of matrices as follows. 14 Chapter 2 Matrices In this chapter, we review matrix algebra from Linear Algebra I, consider row and column operations on matrices, and define the rank of a matrix. Along the way prove that the row

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

CP DECOMPOSITION AND ITS APPLICATION IN NOISE REDUCTION AND MULTIPLE SOURCES IDENTIFICATION

CP DECOMPOSITION AND ITS APPLICATION IN NOISE REDUCTION AND MULTIPLE SOURCES IDENTIFICATION International Conference on Computer Science and Intelligent Communication (CSIC ) CP DECOMPOSITION AND ITS APPLICATION IN NOISE REDUCTION AND MULTIPLE SOURCES IDENTIFICATION Xuefeng LIU, Yuping FENG,

More information

Non-Negative Matrix Factorization

Non-Negative Matrix Factorization Chapter 3 Non-Negative Matrix Factorization Part 2: Variations & applications Geometry of NMF 2 Geometry of NMF 1 NMF factors.9.8 Data points.7.6 Convex cone.5.4 Projections.3.2.1 1.5 1.5 1.5.5 1 3 Sparsity

More information

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence spectroscopy

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence spectroscopy Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence spectroscopy Caroline Chaux Joint work with X. Vu, N. Thirion-Moreau and S. Maire (LSIS, Toulon) Aix-Marseille

More information

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds Jiho Yoo and Seungjin Choi Department of Computer Science Pohang University of Science and Technology San 31 Hyoja-dong,

More information

Rank Determination for Low-Rank Data Completion

Rank Determination for Low-Rank Data Completion Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,

More information

Non-negative matrix factorization and term structure of interest rates

Non-negative matrix factorization and term structure of interest rates Non-negative matrix factorization and term structure of interest rates Hellinton H. Takada and Julio M. Stern Citation: AIP Conference Proceedings 1641, 369 (2015); doi: 10.1063/1.4906000 View online:

More information

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition Wing-Kin (Ken) Ma 2017 2018 Term 2 Department of Electronic Engineering The Chinese University

More information

Regularized Alternating Least Squares Algorithms for Non-negative Matrix/Tensor Factorization

Regularized Alternating Least Squares Algorithms for Non-negative Matrix/Tensor Factorization Regularized Alternating Least Squares Algorithms for Non-negative Matrix/Tensor Factorization Andrzej CICHOCKI and Rafal ZDUNEK Laboratory for Advanced Brain Signal Processing, RIKEN BSI, Wako-shi, Saitama

More information

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Fast Coordinate Descent methods for Non-Negative Matrix Factorization Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work

More information

Course 311: Michaelmas Term 2005 Part III: Topics in Commutative Algebra

Course 311: Michaelmas Term 2005 Part III: Topics in Commutative Algebra Course 311: Michaelmas Term 2005 Part III: Topics in Commutative Algebra D. R. Wilkins Contents 3 Topics in Commutative Algebra 2 3.1 Rings and Fields......................... 2 3.2 Ideals...............................

More information

/16/$ IEEE 1728

/16/$ IEEE 1728 Extension of the Semi-Algebraic Framework for Approximate CP Decompositions via Simultaneous Matrix Diagonalization to the Efficient Calculation of Coupled CP Decompositions Kristina Naskovska and Martin

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Automatic Rank Determination in Projective Nonnegative Matrix Factorization Automatic Rank Determination in Projective Nonnegative Matrix Factorization Zhirong Yang, Zhanxing Zhu, and Erkki Oja Department of Information and Computer Science Aalto University School of Science and

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

On the Simulatability Condition in Key Generation Over a Non-authenticated Public Channel

On the Simulatability Condition in Key Generation Over a Non-authenticated Public Channel On the Simulatability Condition in Key Generation Over a Non-authenticated Public Channel Wenwen Tu and Lifeng Lai Department of Electrical and Computer Engineering Worcester Polytechnic Institute Worcester,

More information

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Statistics 992 Continuous-time Markov Chains Spring 2004

Statistics 992 Continuous-time Markov Chains Spring 2004 Summary Continuous-time finite-state-space Markov chains are stochastic processes that are widely used to model the process of nucleotide substitution. This chapter aims to present much of the mathematics

More information

Computing nonnegative tensor factorizations

Computing nonnegative tensor factorizations Department of Computer Science Technical Report TR-2006-2 October 2006 (revised October 2007), University of British Columbia Computing nonnegative tensor factorizations Michael P. Friedlander Kathrin

More information

Descent methods for Nonnegative Matrix Factorization

Descent methods for Nonnegative Matrix Factorization Descent methods for Nonnegative Matrix Factorization Ngoc-Diep Ho, Paul Van Dooren and Vincent D. Blondel CESAME, Université catholique de Louvain, Av. Georges Lemaître 4, B-1348 Louvain-la-Neuve, Belgium.

More information

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang

More information

SOLUTION OF GENERALIZED LINEAR VECTOR EQUATIONS IN IDEMPOTENT ALGEBRA

SOLUTION OF GENERALIZED LINEAR VECTOR EQUATIONS IN IDEMPOTENT ALGEBRA , pp. 23 36, 2006 Vestnik S.-Peterburgskogo Universiteta. Matematika UDC 519.63 SOLUTION OF GENERALIZED LINEAR VECTOR EQUATIONS IN IDEMPOTENT ALGEBRA N. K. Krivulin The problem on the solutions of homogeneous

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Nonnegative Tensor Factorization with Smoothness Constraints

Nonnegative Tensor Factorization with Smoothness Constraints Nonnegative Tensor Factorization with Smoothness Constraints Rafal ZDUNEK 1 and Tomasz M. RUTKOWSKI 2 1 Institute of Telecommunications, Teleinformatics and Acoustics, Wroclaw University of Technology,

More information

JOS M.F. TEN BERGE SIMPLICITY AND TYPICAL RANK RESULTS FOR THREE-WAY ARRAYS

JOS M.F. TEN BERGE SIMPLICITY AND TYPICAL RANK RESULTS FOR THREE-WAY ARRAYS PSYCHOMETRIKA VOL. 76, NO. 1, 3 12 JANUARY 2011 DOI: 10.1007/S11336-010-9193-1 SIMPLICITY AND TYPICAL RANK RESULTS FOR THREE-WAY ARRAYS JOS M.F. TEN BERGE UNIVERSITY OF GRONINGEN Matrices can be diagonalized

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION Ehsan Hosseini Asl 1, Jacek M. Zurada 1,2 1 Department of Electrical and Computer Engineering University of Louisville, Louisville,

More information

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Applications of Information Geometry to Hypothesis Testing and Signal Detection CMCAA 2016 Applications of Information Geometry to Hypothesis Testing and Signal Detection Yongqiang Cheng National University of Defense Technology July 2016 Outline 1. Principles of Information Geometry

More information

Central Groupoids, Central Digraphs, and Zero-One Matrices A Satisfying A 2 = J

Central Groupoids, Central Digraphs, and Zero-One Matrices A Satisfying A 2 = J Central Groupoids, Central Digraphs, and Zero-One Matrices A Satisfying A 2 = J Frank Curtis, John Drew, Chi-Kwong Li, and Daniel Pragel September 25, 2003 Abstract We study central groupoids, central

More information

Tensor-Tensor Product Toolbox

Tensor-Tensor Product Toolbox Tensor-Tensor Product Toolbox 1 version 10 Canyi Lu canyilu@gmailcom Carnegie Mellon University https://githubcom/canyilu/tproduct June, 018 1 INTRODUCTION Tensors are higher-order extensions of matrices

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 05 Semi-Discrete Decomposition Rainer Gemulla, Pauli Miettinen May 16, 2013 Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD

More information

Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix

Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix René Vidal Stefano Soatto Shankar Sastry Department of EECS, UC Berkeley Department of Computer Sciences, UCLA 30 Cory Hall,

More information

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University TOPIC MODELING MODELS FOR TEXT DATA

More information

Tensor Analysis. Topics in Data Mining Fall Bruno Ribeiro

Tensor Analysis. Topics in Data Mining Fall Bruno Ribeiro Tensor Analysis Topics in Data Mining Fall 2015 Bruno Ribeiro Tensor Basics But First 2 Mining Matrices 3 Singular Value Decomposition (SVD) } X(i,j) = value of user i for property j i 2 j 5 X(Alice, cholesterol)

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

arxiv: v1 [math.ra] 13 Jan 2009

arxiv: v1 [math.ra] 13 Jan 2009 A CONCISE PROOF OF KRUSKAL S THEOREM ON TENSOR DECOMPOSITION arxiv:0901.1796v1 [math.ra] 13 Jan 2009 JOHN A. RHODES Abstract. A theorem of J. Kruskal from 1977, motivated by a latent-class statistical

More information

Image Registration Lecture 2: Vectors and Matrices

Image Registration Lecture 2: Vectors and Matrices Image Registration Lecture 2: Vectors and Matrices Prof. Charlene Tsai Lecture Overview Vectors Matrices Basics Orthogonal matrices Singular Value Decomposition (SVD) 2 1 Preliminary Comments Some of this

More information

1 Non-negative Matrix Factorization (NMF)

1 Non-negative Matrix Factorization (NMF) 2018-06-21 1 Non-negative Matrix Factorization NMF) In the last lecture, we considered low rank approximations to data matrices. We started with the optimal rank k approximation to A R m n via the SVD,

More information

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Haiping Lu 1 K. N. Plataniotis 1 A. N. Venetsanopoulos 1,2 1 Department of Electrical & Computer Engineering,

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Simplicial Nonnegative Matrix Tri-Factorization: Fast Guaranteed Parallel Algorithm

Simplicial Nonnegative Matrix Tri-Factorization: Fast Guaranteed Parallel Algorithm Simplicial Nonnegative Matrix Tri-Factorization: Fast Guaranteed Parallel Algorithm Duy-Khuong Nguyen 13, Quoc Tran-Dinh 2, and Tu-Bao Ho 14 1 Japan Advanced Institute of Science and Technology, Japan

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

A Single-letter Upper Bound for the Sum Rate of Multiple Access Channels with Correlated Sources

A Single-letter Upper Bound for the Sum Rate of Multiple Access Channels with Correlated Sources A Single-letter Upper Bound for the Sum Rate of Multiple Access Channels with Correlated Sources Wei Kang Sennur Ulukus Department of Electrical and Computer Engineering University of Maryland, College

More information

Introduction to Tensors. 8 May 2014

Introduction to Tensors. 8 May 2014 Introduction to Tensors 8 May 2014 Introduction to Tensors What is a tensor? Basic Operations CP Decompositions and Tensor Rank Matricization and Computing the CP Dear Tullio,! I admire the elegance of

More information

Linear Algebra in Computer Vision. Lecture2: Basic Linear Algebra & Probability. Vector. Vector Operations

Linear Algebra in Computer Vision. Lecture2: Basic Linear Algebra & Probability. Vector. Vector Operations Linear Algebra in Computer Vision CSED441:Introduction to Computer Vision (2017F Lecture2: Basic Linear Algebra & Probability Bohyung Han CSE, POSTECH bhhan@postech.ac.kr Mathematics in vector space Linear

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Analysis-3 lecture schemes

Analysis-3 lecture schemes Analysis-3 lecture schemes (with Homeworks) 1 Csörgő István November, 2015 1 A jegyzet az ELTE Informatikai Kar 2015. évi Jegyzetpályázatának támogatásával készült Contents 1. Lesson 1 4 1.1. The Space

More information

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT CONOLUTIE NON-NEGATIE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT Paul D. O Grady Barak A. Pearlmutter Hamilton Institute National University of Ireland, Maynooth Co. Kildare, Ireland. ABSTRACT Discovering

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Legendre-Fenchel transforms in a nutshell

Legendre-Fenchel transforms in a nutshell 1 2 3 Legendre-Fenchel transforms in a nutshell Hugo Touchette School of Mathematical Sciences, Queen Mary, University of London, London E1 4NS, UK Started: July 11, 2005; last compiled: August 14, 2007

More information

Sparse spectrally arbitrary patterns

Sparse spectrally arbitrary patterns Electronic Journal of Linear Algebra Volume 28 Volume 28: Special volume for Proceedings of Graph Theory, Matrix Theory and Interactions Conference Article 8 2015 Sparse spectrally arbitrary patterns Brydon

More information