Non-Negative Matrix Factorization

Size: px

Start display at page:

Download "Non-Negative Matrix Factorization"

Leslie Andrews
5 years ago
Views:

1 Chapter 3 Non-Negative Matrix Factorization Part 2: Variations & applications

2 Geometry of NMF 2

3 Geometry of NMF 1 NMF factors.9.8 Data points.7.6 Convex cone.5.4 Projections

4 Sparsity in NMF Skillicorn chapter 8; Berry et al. (27) 4

5 Sparsity: desiderata Sparse factor matrices are often preferred Simpler to interpret (zeroes can be ignored) Agrees with our intuition on parts of whole Faster computations, less space NMF is sometimes claimed to automatically yield sparse factors In practice, this is often not the case 5

6 Enforcing sparsity A common solution: change the target function to minimize Parameters Regularizers 1 2 ka WHk2 F + density(w)+ density(h) How to define sparsity? Naïve: density(w) = 1 nnz(w)/size(w) Non-convex, non-nice to optimize directly 6

7 Frobenius regularizer A.k.a. Tikhonov regularizer Minimize 1 2 Ä ka WHk 2 F + kwk2 F + khk2 F Doesn t help much with sparsity Used to impose smoothness 7

8 ALS-NMF with Frobenius regularizers The update rules for ALS with Frobenius regularizer are W H (AH T )(HH T + ) 1 + (W T W + ) 1 (W T A) + Uses the fact that X + = (X T X) 1 X T if X has full column rank (homework) 8

9 L 1 (Lasso) regularizer Using L 1 based instead of L 2 based regularizer helps obtaining sparse solutions 1 2 ka WHk2 F + P,j W j + P,j H j Larger values of α and β yield sparse solutions (e.g. α, β [.1,.5]) Still no guarantees on sparsity 9

10 ALS-NMF with Lasso The update rules are W (AH T 1 n k )(HH T ) 1 + H (W T W) 1 (W T A 1 k m ) + 1 n k is n-by-k matrix of all 1s Requires columns of W to be normalized to unit L 1 after each update W j W j / P W j 1

11 Hoyer s sparse NMF Hoyer (24) considers the following sparsity function for n-dimensional vector x sp rsity( )= p n k k1 / k k 2 p n 1 sparsity(x) = 1 iff nnz(x) = 1 sparsity(x) = iff x i = x j for all i, j Hoyer 24 11

12 Hoyer s sparse NMF Hoyer s algorithm obtains an NMF using factor matrices with user-defined level of sparsity After every update, the columns of W and rows of H are updated s.t. their L 2 is constant and L 1 is set to desired level of sparsity 12

13 Getting the desired level of sparsity Set s = x + (L 1 i x i )/dim(x) and Z = {} Fix L 1 repeat Set m i = L 1 /(dim(x) size(z)) for i Z and m i = o/w Set s = m + α(s m) where α is s.t. s 2 = L 2 Fix L 2 If s, return s Are we done? Set Z = Z {i : s i < } and s i = for all i Z Set c = ( i s i L 1 )/(dim(x) size(z)) Set s i = s i c for all i Z Truncate negative values and fix L 1 again 13

14 Other forms of NMF 14

15 Normalized NMF Columns of W (and/or rows of H) should be normalized to sum to unity Stability of the solution and interpretability If only W (or H) is normalized, the weights can be pushed to the other matrix To normalize both, use k-by-k diagonal Σ s.t. σ ii = W i 1 (H T ) i 1 Normalized NMF: WΣH 15

16 Semi-orthogonal NMF In semi-orthogonal NMF we restrict H to roworthogonal: minimize A WH F s.t. HH T = I and W and H are nonnegative Solutions are unique (up to permutations) The problem is equivalent to k-means In the sense that the optimal solutions have the same value Ding et al

17 Geometry of semiorthogonal NMF The orthogonal factors span a cone 17

18 NMF and clustering In k-means, we minimize P k j=1 P 2C j j 2 2 = P k j=1 P n =1 G j j 2 2 μ j is the centroid of the jth cluster C j G is n-by-k cluster assignment matrix G ĳ = 1 if i C j and otherwise Equivalently: ka GMk 2 F Type of NMF if A is nonnegative! M is k-by-m containing the centroids as its rows 18

19 Orthogonal tri-factor NMF We can find NMF where both W and H are (column/row) orthogonal Often too restrictive; cannot handle different scales In orthogonal nonnegative tri-factorization we add third non-negative matrix S: minimize A WSH F s.t. W T W = I, HH T = I, and all matrices are non-negative 19

20 More on tri-factorization S does not have to be square W is n-by-k, S is k-by-l, H is l-by-m Different number of row and column factors If orthogonal NMF clusters columns of A, this bi-clusters rows and columns simultaneously 2

21 Computing semiorthogonal NMF If H has to be orthogonal, either update as usual and set after every iteration H [HH T ] 1/2 H ; or update H j H j v t (W T A) j (W T AH T H) j W is updated as usual (w/o constraints) If W needs to be orthogonal, the update rules are changed accordingly 21

22 Computing the orthogonal tri-factorization The update rules for orthogonal trifactorization are H j H j v ut (WS) T A j (WS) T AH T H j W j W j v t (A(SH) T ) j (WW T A(SH) T ) j S j S j v t (W T AH T ) j (W T WSHH T ) j 22

23 Other optimization functions 23

24 Kullback Leibler divergence The Kullback Leibler divergence of Q from P, D KL (P Q), measures the expected number of extra bits required to code samples from P when using a code optimized for Q D KL (PkQ) = P P( ) ln P( ) Q( ) P and Q are probability distributions Non-negative and non-symmetric 24

25 Generalized KL-divergence and matrix factorizations The standard KL-divergence requires P and Q be probability distributions (e.g. i P(i) = 1) The generalized KL-divergence (or I-divergence) removes this requirement: P Ä D GKL (PkQ) = P( ) ln P( ) ä P( )+Q( ) Q( ) In NMF, P = A and P Q = WH : D GKL (AkWH)=,j A j ln A j (WH) j A j +(WH) j 25

26 KL v.s. GKL in NMF KL requires A to be considered as a probability distribution i,j A i,j = 1 (or row/column normalization) WH should be normalized the same way GKL only requires non-negativity But inherently assumes integer data Looses a bit of the probability interpretation 26

27 NMF for GKL The update rules for multiplicative GKL NMF algorithm are H kj H kj P n =1 W k(a j /(WH) j ) P n =1 W k W k W k P m j=1 (A j/(wh) j )H kj P m j=1 H kj The columns of W are normalized to sum to unity after every iteration 27

28 Applications of NMF 28

29 Component interpretation NMF s main sales argument is the component interpretation A w 1 h 1 + w 2 h w k h k Each rank-1 component has parts-ofwhole interpretation Nothing is ever removed 29

30 Hand-written digits Cichocki et al. 29 3

31 Factor interpretation NMF can be seen as a nonnegative mixture of nonnegative factors The factors can capture underlying nonnegative phenomena The nonnegative coefficients potentially help with the interpretation 31

32 Geometric interpretation NMF factors are not (generally) orthogonal They do not create a coordinate system Span a convex cone Projection to the space spanned by the factors can yield odd results Points that are far away in the original space get close in the cone and vice versa 32

33 Separation of various spectra In spectroscopy imaging we often have multiple observations of signals over some spectrum Observations-by-spectrum non-negative matrix The signals constitute an additive mixture of pure signals 33

34 Raman spectroscopy Ground truth Observation Estimated Estimated Cichocki et al

35 Text mining and plsa Consider a document term matrix A a ĳ is the number of times term j appears in document i Can we find these topics automatically? A = Environmet air water pollution democrat republican doc doc doc 3 B 1 11 doc 4@ 8 5 doc Politics 1 C A 35

36 The idea Normalized A that sums to 1 can be considered as a probability distribution P(d, w) = A d,w P(d, )= P k P(k)P(d k)p( k) Model with topics: P( d) = P z P( z)p(z d) 36

37 Generative process Pick a document according to P(d) P [ w d ] P [ z d ] P [ w z ] z economic Select a topic according to P(z d) imports Select a word according to TRADE embargo P(w z) documents d latent concepts z (aspects) terms w (words) 37

38 plsa as NMF In the NMF version of the probabilistic latent semantic analysis (plsa) we are given documents-by-terms matrix A and rank r We have to find n-by-r non-negative W (columns sum to unity) r-by-r diagonal non-negative Σ r-by-m non-negative H (rows sum to unity) Minimizing D GKL (A WΣH) 38

39 Geometry of plsa T. Hofmann Unsupervised learning by probabilistic latent semantic analysis

plsa example air wat pol dem rep air wat pol dem rep.4.3.12.39.

40 plsa example air wat pol dem rep air wat pol dem rep D L R A W Σ H Here, A is normalized How strong the topic is Overall frequency How strong the word is in the in the document? topic? 4

41 NMF algorithm for plsa Compute W and H using GKL NMF algorithm Normalize columns (rows) of W (H) and put the multipliers to Σ Normalize Σ to sum to unity Real implementations would require tempering to avoid over-fitting 41

42 plsa example Source: Thomas Hofmann, Tutorial at ADFOCS 24 42

43 plsa applications Topic modeling Clustering documents and terms Information retrieval Similar to LSA/LSI Generalizes better than LSA But outperformed by Latent Dirichlet Allocation (LDA) 43

44 NMF summary Parts-of-whole interpretation Often easier/more appropriate than SVD Hard to compute and non-unique Local updates (multiplicative, gradient, ALS) Many applications and specific variations Still under very active research 44

45 Literature Berry, Browne, Langville, Pauca & Plemmons (27): Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 52, pp Hoyer (24): Non-negative matrix factorizations with sparseness constraints. J. Mach. Learn. Res. 5, pp Ding, Li, Peng & Park (26): Orthogonal nonnegative matrix tri-factorizations for clustering. In KDD 6, pp Cichocki, Zdunek, Phan & Amari (29): Nonnegative matrix and tensor factorizations. John Wiley & Sons Chichester, UK 45

Data Mining and Matrices

Data Mining and Matrices 6 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen May 23, 23 Non-Negative Datasets Some datasets are intrinsically non-negative: Counters (e.g., no. occurrences