Nonnegative Matrix Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 21
Lee and Seung, 1999 2 / 21
Nonnegative Matrix Factorization (NMF) U V Given a nonnegative target matrix X R D N, determine a 2-factor decomposition: X UV, X such that factor matrices U R D K and V R N K are nonnegative as well. Low-rank approximation of nonnegative data. Involves the optimization: min X UV 2, subject to U > 0 and V > 0. 3 / 21
Holistic Representation 4 / 21
Parts-Based Representation 5 / 21
Nonnegative Data (a) Document (b) Spectrogram (c) Image (d) Gene Expression 6 / 21
Matrix Factorization for Brain Computer Interface, 2008 7 / 21
Matrix Factorization for Collaborative Filtering, 2009 8 / 21
NMF for Clustering X U V 9 / 21
Nonnegative Matrix Factorization (NMF) Given a nonnegative matrix X R m n (with X ij 0 for i = 1,..., m and j = 1,..., n), NMF seeks a decomposition of X that is of the form: X UV, where U R m k and V R n k are restricted to be nonnegative matrices as well. When columns in X are treated as data points in m-dimensional space, columns in U are considered as basis vectors (or factor loadings) and each row in V is encoding that represents the extent to which each basis vector is used to reconstruct each data vector. Alternatively, when rows in X are data points in n-dimensional space, columns in V correspond to basis vectors and each row in U represents encoding. 10 / 21
NMF: Least Squares The LS error function is given by J LS = X UV 2 m n ( ) 2 = X ij [UV ] ij. i=1 j=1 Then, NMF involves the following optimization: arg min X UV 2. U 0,V 0 Gradient descent yields U ij U ij + η U ij V ij V ij + η V ij ) ([X V ] ij [UV V ] ij, ) ([X U] ij [V U U] ij. 11 / 21
Multiplicative Updates Gradient descent algorithms do not preserve that elements of U and V are nonnegative. Choose learning rates η U ij = η V ij = U ij [UV, V ] ij V ij [V U. U] ij Then we have multiplicative updates: U ij U ij [X V ] ij [UV V ] ij, V ij V ij [X U] ij [V U U] ij. 12 / 21
Multiplicative Updates for NMF: Alternative Derivation Suppose that the gradient of an error function has a decomposition that is of the form J = [ J ] + [ J ], where [ J ] + > 0 and [ J ] > 0. Then multiplicative update for parameters Θ has the form Compute derivatives: ( ) [ J ].η Θ Θ [ J ] +. U J = [ U J ] + [ U J ] = UV V X V, V J = [ V J ] + [ V J ] = V U U X U. Choosing η = 1 yields U U X V UV V, V V X U V U U. 13 / 21
Term-Document Matrix A term-document matrix X R m n is a collection of vector space representations of documents, where rows are terms (words) and columns are documents ( ) n X ij = t ij log, idf i where t ij is the term frequency of word i in document j and idf i is the number of documents containing word i. 14 / 21
Clustering by Factorization NMF yields a factorization X = UV : U ij : the degree to which term i belongs to cluster j. V ij : the degree document i is associated with cluster j. Document clustering is based on column vectors of V. Assign document i to cluster j if j = arg max V ij. j 15 / 21
Document Clustering by NMF 1. Construct a term-document matrix X. 2. Perform NMF on X, yielding X = UV. 3. Normalization U ij U ij, V ij V ij i U2 ij 4. Use V to determine the cluster label for each document. Assign document d i to cluster j if j = arg max V ij. j i U 2 ij. 16 / 21
k-means: Matrix Factorization Perspective Recall the objective function for k-means clustering where J = K k=1 This can be written as i C k x i µ k 2 = V ij = N i=1 j=1 { 1 if x i C j 0 otherwise J = X UV 2, K V ij x i µ j 2, where X = [x 1,..., x N ] and U = [µ 1,..., µ K ] contains centers (prototype vectors) in columns and V is the indicator matrix. 17 / 21
NMF: I-Divergence Consider the I -divergence between the data and the model: J I = i j X X [ ij ij log [ UV ] X ij + UV ] ij. ij Note that I-divergence is identical to Kullback-Leibler divergence when i j X ij = [ i j UV ] = 1. ij Multiplicative updates for U and V are determined by minimizing J I with nonnegativity constraints U 0, V 0 satisfied: k U ij U (X ij/[uv ] ik )V kj ij k V, kj k V ij V (X ki/[uv ] ki )U kj ij k U. kj Equivalence between NMF and PLSA was shown by Gaussier and Goutte (SIGIR-2005). 18 / 21
Weighted NMF In practice, the data matrix is often incomplete, i.e., some of entries are missing or unobserved. In order to handle missing entries in the decomposition, we consider an objective function that is a sum of weighted residuals: J = i,j W ij (X ij [UV ] ij ) 2 = W (X UV ) 2, where W ij are binary weights, i.e., { 1 if Xij is observed W ij = 0 if X ij is unobserved. Multiplicative updates for WNMF are given as: U U [W X ]V [W UV ]V, V V [W X ] U [W V U ]U. 19 / 21
Weighted NMF for Collaborative Filtering Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 User 1 User 2 User 3 User 4 User 5 User 6 = positive negative unobserved 20 / 21
Weighted NMF for Collaborative Filtering Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Coefficients for User 3 Latent movie factors User 1 User 2 User 3 User 4 User 5 User 6 User 7 = x comedy action horror thriller action comedy horror thriller Item Factor Matrix User-Item Rating Matrix User Factor Matrix X ij u i v j 21 / 21