Non-Negative Matrix Factorization

Size: px

Start display at page:

Download "Non-Negative Matrix Factorization"

Maria Atkins
5 years ago
Views:

1 Chapter 3 Non-Negative Matrix Factorization Part : Introduction & computation

2 Motivating NMF Skillicorn chapter 8; Berry et al. (27) DMM, summer 27 2

3 Reminder A T U Σ V T T Σ, V U 2 Σ 2,2 V = The components of the SVD are not very interpretable DMM, summer 27 3

4 DMM, summer 27 Non-negative factors 4 A = W H + W H W 2 H 2 Forcing the factors to be non-negative can, and often will, improve the interpretability of the factorization

5 The definition DMM, summer 27 5

6 Definition of NMF Given a non-negative matrix A 2 R n m and an integer k, find non-negative matrices W 2 R n k + and is minimized. H 2 R k m + 2 ka WHk2 F such that + DMM, summer 27 6

7 Non-negative rank The non-negative rank of matrix A, rank + (A), is the size of the smallest exact non-negative factorization A = WH rank(a) rank + (A) min{n, m} DMM, summer 27 7

8 Some comments NMF is not unique If X is nonnegative and with nonnegative inverse, then WXX H is equivalent valid decomposition Computing NMF (and non-negative rank) is NP-hard This was open until 28 DMM, summer 27 8

9 DMM, summer 27 Example of nonuniqueness = = = + + +

10 NMF has no order The factors in NMF have no inherent order The first component is no more important than the second is no more important NMF is not hierarchical The factors of rank-(k+) decomposition can be completely different to those of rank-k decomposition DMM, summer 27

11 Example } Rank-.5 } = Rank-2 DMM, summer 27

12 Interpreting NMF DMM, summer 27 2

13 Parts-of-whole NMF works over anti-negative semiring There is no subtraction Each rank- component w i h i explains a part of the whole This can yield to sparse factors DMM, summer 27 3

14 NMF example: faces PCA/SVD Row of original = Row of reconstruction DMM, summer 27 4

15 NMF example: faces NMF Row of original = DMM, summer 27 5

16 NMF example: digits NMF factors correspond to patterns and background A H DMM, summer 27 6

exploration Neuroscience Image understanding Air

17 Some NMF applications Text mining (more later) Bioinformatics Microarray analysis Mineral exploration Neuroscience Image understanding Air pollution research Weather forecasting DMM, summer 27 7

18 Computing NMF DMM, summer 27 8

19 General idea NMF is not convex, but it is biconvex If W is fixed, 2 ka WHk2 F is convex Start from random W and repeat Fix W and update H Fix H and update W until the error doesn t decrease anymore DMM, summer 27 9

20 Notes on the general idea How to create a good random starting point? Is the algorithm robust to initial solutions? How to update W and H? When (and how quickly) has the process converged? Fixed number of iterations? Minimum change in error? DMM, summer 27 2

21 Alternating least squares Without the non-negativity constraint, this is the standard least-squares: w i argmin w wh a i F we can update W AH + and H W + A X + is the pseudo-inverse of X which is LS-optimal The method is called alternating least-squares (ALS) This can introduce negative values DMM, summer 27 2

22 Enforcing non-negativity in ALS Least-squares optimal update of W (or H) with non-negativity constraints is convex optimization problem In theory in P, in practice slow, but subject to much research Simple approach: truncate all negative values to Update W [AH + ] + DMM, summer 27 22

23 The NMF-ALS algorithm. W random(n, k) 2. repeat 2.. H [W + A] W [AH + ] + 3. until convergence DMM, summer 27 23

24 When has there been enough convergence? When the error doesn t change too much A W (k) H (k) F A W (k+) H (k+) F ε After some number of maximum iterations has been achieved Usually, whichever of these two happens first DMM, summer 27 24

25 Gradient descent We can compute the gradient of the error function (with one factor matrix fixed) ƒ (H) = 2 ka WHk2 F = 2 P k Wh k 2 F H j ƒ (H) =(W T A) j (W T WH) j We can move slightly towards the negative gradient How much is the step size and deciding it is a big problem DMM, summer 27 25

26 The NMF gradient descent algorithm. W random(n, k) 2. H random(k, m) 3. repeat H H H ƒ H W W W ƒ W 4. until convergence DMM, summer 27 26

27 Oblique Projected Landweber (OPL) for NMF OPL provides one way to select the step size With H H H ƒ H updates, the convergence radius is 2/λ max (W T W), where λ max is the largest eigenvalue λ max max(rowsums(w T W)) We can set the learning rates to /rowsums(w T W) for a good convergence DMM, summer 27 27

28 The OPL algorithm for updating H. η diag( / rowsums(w T W)) 2. repeat 2.. G W T WH W T A 2.2. H [H ηg] + 3. until a stopping criterion is met (small) number of iterations OR H doesn t change much DMM, summer 27 28

29 Interior Point Gradient (IPG) for NMF In OPL, we might (temporarily) have negative values in W or H In Interior Point Gradient (IPG) algorithm, we set the step sizes so that we never update to negative DMM, summer 27 29

30 The IPG algorithm for updating H. repeat until a stopping criterion is met.. G W T (WH A).2. D H / (W T WH) Gradient Scaling / and * are element-wise.3. P D * G.4. η* vec(p), vec(g) / vec(wp), vec(wp).5. η max{η : H + ηp } Update direction Positive step size Best step size.6. η min{τη, η*}.7. H H + ηp Update DMM, summer 27 3

31 Multiplicative updates The KKT conditions for H in NMF are H ; H A WH 2 /2 H * H A WH 2 /2 = * is element-wise product Substituting H A WH 2 /2 = W T WH W T A one gets H * (W T WH) = H * (W T A) This gives us an update rule for H DMM, summer 27 3

32 The NMF multiplicative updates algorithm. W random(n, k) 2. H random(k, m) 3. repeat 3.. h j h j (W T A) j (W T WH) j j j (AH T ) j (WHH T ) j + 4. until convergence DMM, summer 27 32

33 Notes on multiplicative updates Proposed by Lee & Seung (Nature, 999) Equivalent to gradient descent with dynamic step size Zeros in initial solutions will never turn into non-zeros; non-zeros will never turn into zeros Problems if the correct solution contains zeros DMM, summer 27 33

34 Summary NMF can provide factorizations that are more interpretable than those given by SVD Harder to compute than SVD, but many different approaches Or are they so different In two weeks: Applications & alternations of NMF Stay tuned! DMM, summer 27 34

Data Mining and Matrices

Data Mining and Matrices 6 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen May 23, 23 Non-Negative Datasets Some datasets are intrinsically non-negative: Counters (e.g., no. occurrences