The Math!! From Non-Negative Matrix Factorization to Deep Learning Intuitions and some Math too! luissarmento@gmailcom https://wwwlinkedincom/in/luissarmento/ October 18, 2017
The Math!! Introduction
Disclaimer Introduction The Math!! I really just wanted to talk about NMF I added Deep Learning to the title just to attract people Conclusion: Deep Learning actually works!
Relational Data Introduction The Math!! We are given data about two sets of entities: X = x 1, x 2, x 3 x m Y = y 1, y 2, y 3 x n which may co-occur under some observable event We are also given the number of times c (count) c N 0 + : 0, 1, 2 when x i and y j co-occurred under such condition: (x i, y j, c ij )
The Math!! Relational Data - (x i, y j, c ij ) Typical cases: User x i watched Movie y j (c = 0/1) Customer x i purchased Product y j n times Web User x i clicked on Ad/Page y j (c = 0/1) Web User x i searched for on Query y j c times But there are also cases where X and Y are the same type (or even set) of entities word x i occurs with word y j c times protein x i interacts with protein y j (c = 0/1)
The Math!! Relational Data - Representation We can consider (x i, y j, n ij ) as a list of tuple Or, given an appropriate indexing of the entities: X : x i with i = 1, 2, 3 m Y : y j with j = 1, 2, 3 n We can represent all the observations as a Matrix, C X Y { }} { c 11 c 12 c 1n c 21 c 22 c 2n c m1 c m2 c mn = C
The Math!! Relational Data - Matrix Representation A few (practical) observations about C: it contains count data (non-negative!) it probably contains many zeros: most possible events do not happen (count = 0) not very efficient representation Probably a very large matrix hundreds times hundreds (eg eg countries countries ) millions times millions (eg Amazon users products) Interestingly, under some conditions, C can still be used to make recommendations / predictions
The Math!! Users who did this also did that Classic way of making predictions using this sort of data Each user is represented by a row vector from the matrix C: x 1 = [c 11, c 12 c 1n ] x i = [c i1, c i2 c in ] x i = [c m1, c m2 c mn ] We can perform an all vs all comparison among x i vectors using some convenient vector distance metric (eg cosine) For each user x i we build the list of top k most similar users: x i S i = [x a, x b, x c ] Recommendation Strategy: For each users x i, recommend items (eg movies, products) that the most similar users x j S i have already interacted with, but which x i has not yet interacted with
The Math!! Users who did this also did that Easy to understand, but it breaks on high dimensional data If the space of items is very large then the probability: overlap between any two vectors is very low regions of overlap being mostly the same is high of a spurious/meaningless co-occurrence tends to be over-emphasized Think of Amazon Catalog (millions of products): lot s of people buy the best selling dish soap: it s not a meaningful overlap the fact that two people bought the same rare french movie DVD does not mean that they should use the same shampoo
A very shallow story Introduction The Math!! One important semantic observation : Matrix C tells a very shallow story C merely stores an observations: it stores a flat relation (eg user x 147 watched movie y 1023 ) C tells nothing about the process underlying the observations Therefore, C can t: explain the data in compact way model the underlying data generation process But we know that there is a process: there is a chain of events that leads to the observations
We know that Introduction The Math!! Movies can be categorized in certain genres g 1, g 2 g k : comedy, action, horror, sci-fi, etc a movie can belong to several genres we can assume there is a degree of belongingness there are way less genres (10 s) than movies (1000 s) Users have different preferences for / affinities to those genres Users, typically, choose a movie of genres they like even if within the genre their choice is rather random
The Math!! Any resemblance with reality is pure ML modeling So, here s a slightly more reasonable story: 1 user x i has a certain affinity to each movie genre g j 2 user x i decides to to watch a movie of a specific genre g k (depending on the affinity he/she has to each genre) 3 from all movies y i of genre g k user picks one (with certain probability depending on the affinity of that movie to that genre) Instead of the flat user x watched move y relation, we now have two relations: user genres (preferences / affinities) movies genres (belongingness) And the story goes like: observations: user genres movies
The Math!! Oh wait!! Did you say relation? But relations can be stored in matrices! So, in this model we have two additional matrices! k genres { }} { k genres u 11 u 1k { }} { u 21 u 2k m 11 m 1k m users m 21 m 2k n movies m n1 m nk u m1 u mk M U Both U and M are very thin and tall matrices If k in the number of genres: k m (nr users) k n (nr movies)
The Math!! Putting all the pieces together So, our story was observations: user genres movies
The Math!! Putting all the pieces together So, our story was: observations: (user genres) movies This part is matrix U (m users): genres { }} { u 11 u 1k u 21 u 2k users u m1 u mk U
The Math!! Putting all the pieces together n movies So, our story was: observations: user (genres movies) This part is not matrix M (n movies) It is it s transpose, M T : k genres {}}{ m 11 m 1k m 21 m 2k m n1 m nk M k genres M T is a long and short matrix (many columns!) n movies {}}{ m 11 m 21 m n1 m 1k m 2k m nk M T
The Math!! Putting all the pieces together Then, if we assume linearity, our story: observations: (user (genres) movies) can be mapped to: C : U M T c 11 c 12 c 1n u 11 u 1k c 21 c 22 c 2n u 21 u 2k : c 1n c 2n c mn u m1 u mk } {{ } } {{ } m users n movies m users k genres (m n) : (m k) (k n) m 11 m 21 m n1 m 1k m 2k m nk } {{ } k genres n movies
The Math!! Let s dive deeper on C : U M T c 11 c 12 c 1n c 21 c 22 c 2n c 1n c 2n c mn : u 11 u 1k u 21 u 2k u m1 u mk m 11 m 21 m n1 m 1k m 2k m nk c 12 - how user 1 relates to movie 2 (count) u 1[1k] - vector describing affinity of user 1 to k genres m 2[1k] - vector describing affinity of movie 2 to k genres c 22 = u 1[1k] m 2[1k] linear model: looks great! the higher the overlap between genres, the higher counts
The Math!! Here is something interesting u 11 u 1k u 21 u 2k u m1 u mk }{{} m users k genres m 11 m 21 m n1 m 1k m 2k m nk }{{} k genres n movies Nr of parameters in : U & M T : m k + k n = (m + n) k C: m n
The Math!! In practice k m and k n : [ ] } {{ } k genres n movies } {{ } k genres n movies }{{} m users k genres So: (m + n) k m n Eg: 100k users, 10k movies, 200 genres C : 100k 1k = 1000M cells / params U & M T : (100 + 10)k 200 = 22M 50 compression
Less is More Introduction The Math!! We are explaining m n observations using only (m + n) k params This is good: we have a compact / lower dimensional model! We are concentrating the information in less parameters Our model for recommendation is now based on a deeper story! The link between users and movies is made via genres This helps solving the problems related with sparsity! We can try to find the most similar users : but instead of using sparse vectors from C (n dims) 0 1 0 1 vs 0 0 1 0 we use the lower dimensional dense vectors from U (k dims) 02 071 vs 05 002
Less is More (?) Introduction The Math!! Except we have a little problem!
Less is More (?) Introduction The Math!! In fact, we have two little problems
The Math!! Problem #1: List of Genres For most things we may not have a list of existing genres Maybe for movies we have some sets, but even in that case: it is not obvious to define what those are (committee) they are new genres coming in (eg zombie ) definitions may change over time (concept drift, specialization) It is not trivial to assign items (users and movies) to genres : typically categories are subject to multiple interpretations we many not even have information to make the assignment How would we know in advance the profile of each user
The Math!! Problem #2: How many Genres are enough? This is a more technical issue but the intuition is easy Extreme case: we have only 1 defined movie genre movies are either of that genre or not ( undefinable movies) users have affinity to the genre or not( undefinable movies) Then, we can only explain simple observation patterns 1 0 1 0 1 [ 0 0 0 1 1 1 ] 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1
Adding Genres Introduction The Math!! We can add genres (and continuous levels of belongingness): Adding just one genre provides more explanatory power 0 1 0 0 1 1 0 1 1 1 [ 0 0 0 1 1 1 0 1 0 1 0 1 ] 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 The observation matrix has now a more complex pattern
Adding Genres Introduction The Math!! We can add even more genres and this will allows us to explain even more complex patterns In fact, if we add enough genres, we are guaranteed to find factorization that explain observation matrix For example, assume the observation matrix is: 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2
The Math!! Here s an obvious factorization k = 4 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 } {{ } Works as a row selector 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 } {{ } replicates the observation matrix This is not an interesting factorization we are using it just to illustrate an intuition But, notice that two of the rows are equal
The Math!! So, we don t actually need so many genres k = 3 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 1 0 0 0 1 0 0 0 1 0 1 0 } {{ } Works as a row selector 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 } {{ } distinct rows of the observation matrix we only need to choose among 3 distinct rows row-oriented perspective: matrix on the right is a basis of rows that can be composed / selected to build each row of the data matrix
The Math!! So, we don t actually need so many genres (II) k = 3 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 1 0 0 0 1 0 0 0 1 0 1 0 }{{} Basis of columns to be composed 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 }{{} coefficients to compose output column vectors we are now composing 3 distinct columns column-oriented perspective: matrix on the left is a basis of columns that can be composed / selected to build each column of the data matrix
More generally Introduction The Math!! In fact if we set nr of genres k to: with: k min(m, n) m (number of rows / users) n (number of columns / users) we are guaranteed to find a factorization that fully explains the observations even if that factorization is pretty boring
And? Introduction The Math!! But the really interesting cases happen when: k min(m, n) In such cases, we may: not be able to fully recover the observation matrix but still be able to recover a close enough one In this case, we get a compact representation of users and movies, that does not suffer from the sparsity problems
The Math!! So, how do we make this happen? if we don t have a list of genres if we don t even know how many genres we need if we just have a matrix with observations Who do we make this happen???
The Math!! The Math!!
Intro and Terminology Introduction The Math!! I am going to jump directly into them math We already went through the intuitions I am going to follow the notation used in Wikipedia 1 This will make easier any further study using Wikipedia And makes my life much easier too! So: observation matrix C V user matrix U W movie matrix M H Let s forget the user movies scenario to think abstractly But still think about some intuitions 1 https: //enwikipediaorg/wiki/non-negative_matrix_factorization
The Math!! Some notation and conventions Historically NMF formulations follow the column oriented perspective } {{ } V: data matrix }{{} W: dictionary / features / basis matrix }{{} H: activation / coefficients matrix We build each column of V by using the k coefficients in H to combined the k available columns of W And we want, for a given k (number of categories): V WH
V WH Introduction The Math!! For a given k, there is a Residual: R k = V WH which is a matrix of the same dimensions of a matrix V Wish to residue to be the smallest possible under a given matrix norm min R k = min V WH subject to W 0, H 0 W,H W,H Fro now, the only restrictions on W and H are: all element are positive are m k and k n but we could impose more restriction (eg W T W = I )
The Math!! Choosing a Norm to Measure Residual Matrices are complex objects, so there are many possible ways of sizing them through various norms There are many matrix norms, so there are many way for us to measure the size of the residual matrix R k = V WH The norm we choose will define how we weigh the errors of approximation WH of V And we can have several assumption about how errors can be measured: Is having many small errors better that having any large error? Do errors become disproportionately worse with amplitude? Do we care about how many errors we make or only about their amplitudes? This leads to many interesting (and fundamental) questions, but I am going to skip all this!
Frobenius Norm Introduction The Math!! A very intuitive norm is the Frobenius Norm: m n R F = rij 2 i=1 j=1 It s simply the sum of the square of all cells in the matrix Two other formulations: Matrix operators formulation: R F = trace(r T R) min{m,n} If σ i (R) are singular values of R: R F = i=1 σi 2(R) R F only vanishes if all rij 2 = 0, that is when V = WH
But Introduction The Math!! The square root usually leads to nasty calculus Much more very convenient is the square of the Frobenius Norm: m n R 2 F = rij 2 i=1 j=1 We got rid of the square root, but the minimizers of R 2 F and R F are still the same Remember that we want to minimize min W,H R k The derivatives of R 2 F only involve deriving powers of 2: we will need to take derivatives to compute the minimizers these are easy!
The Math!! Let s get back to our minimization problem We want to minimize (now using square of Frobenius norm) subject to W 0, H 0 min W,H R 2 F = min V W,H WH 2 F We have two unknowns W and H So we need to find two variables but we only have one equation There is not even an unique solution: WH = (WB)(B 1 H) which means that any (complementary) permutation or scaling of W or H will leads to the same result
The Math!! Let s get back to our minimization problem We want to minimize (now using square of Frobenius norm) min W,H R 2 F = min V W,H WH 2 F Interestingly, if we only had one unknown: eg having W and trying to find H eg having H and trying to find W we would fall in a very well-known case: Ordinary Least Squares ( Linear Regression with squared loss ) for which there is a closed form 2 solution! 2 I am not showing it here, but this closed form solution is possible because we have an easy way of computing the derivatives
The Math!! Ordinary Least Squares (OLS) Suppose that we knew W and wanted to find H that minimized the error: min H R 2 F = min V H WH 2 F The OLS solution for H would be: H = (W T W ) 1 W T V with (W T W ) 1 W T being the left pseudo-inverse of W (Moore-Penrose inverse)
The Math!! Ordinary Least Squares (OLS) Suppose now that we knew H and wanted to find W that minimized the error: min W The OLS solution W would be: R 2 F = min V W WH 2 F W T = (HH T ) 1 HV T again with (HH T ) 1 H being the left pseudo-inverse of H (Moore-Penrose inverse)
The Math!! Ordinary Least Squares (OLS) This looks very complicated but it is just: matrix transposition matrix multiplication and matrix inversion of (small) k k matrices: W is m k so W T W is k k H is k n so HH T is k k It s just a shame that we don t have any of the W or H matrices! But as with many things in life
The Math!! We can fake it until we make it! And that s ok! No need to feel embarrassed about it!
The Math!! Alternate Least Squares (ALS) This is an algorithm of a larger class of algorithms that works by making leaps of faith We are going to assume that we know one of the matrices and compute the other using OLS ALS Algorithm: Generate Random W 0 with positive entries Repeat n=1: 1 compute H n = OLS(V, W n 1) 2 correct negative entries in H n 3 compute W n = OLS(V, H n) 4 correct negative entries in W n until R 2 F is small enough or max iterations is reached
The Math!! Alternate Least Squares (ALS) It s very fast (+) heavy operation would be the inversion but we are inverting only k by k matrices It is easy to understand in practice and to implement (+) Works well in practice (+) ie finds one of solution to he optimization problem) Does not ensure non-negativeness (-) we have to force correction Does not allow control on the structure of W and H (-) sparseness, orthogonality
The Math!! So, how do we do it better? There are other ways of computing NMF We can formulate the problem with regularization constraints sparseness, orthogonality, dependencies, blocks Use many well known Gradient Descent strategies Better ways of performing updates (multiplicative updates) We can scale to big data using stochastic approaches There are smart ways of performing computations on GPU s There are probabilistic interpretations of NMF There are ways of dealing with missing data etc There is 30 years or more of research in NMF we are not going to talk about all of it now
The Math!!
Key Points Introduction The Math!! NMF is a powerful modeling tool It transforms a set of X, Y observation stored in a matrix in a story The story has new set of k entities in between X and Y Users - Movies : User - Genres - Movies This implies factorizing the observation matrix in two matrices: one matrix stores information X and k entities another matrix stores information Y and k entities We showed that we can compute the factorization; ALS is just one of the ways of performing such factorization
Key Points Introduction The Math!! By factorizing we build a story that is closer to reality: observations do not just happen Our model now include an underlying (yet unobserved) set of factor that helps explaining the observation (eg the genres) This means our predictions are not dependent of specific observations: we can ground them on these new unobserved parameters they are more explanatory and more stable (changing one observation does not change the underlying parameters too much) So, this is good!
So Introduction The Math!! Can we build models that involve even deeper concepts and more intricate relations between them? Yes For example we can had more hierarchy Users Groups of Users Genres Groups of Movies Movies This would amount to decomposing C in 4 Matrices! We are going deeper!
Is this Deep Learning? Introduction The Math!! Is this Deep Learning? Yes and No depending on who you ask! But there is a deep 3 connection between (Non-negative) Matrix Factorization and Deep Learning Multi-Layer Perceptron architectures They both try to add explanatory entities in between the inputs and outputs! 3 No pun intended!
However Introduction The Math!! That s for another talk!
And now it s time for Introduction The Math!! Questions!
The Math!! Thank you! luissarmento@gmailcom Connect with me on LinkedIn https://wwwlinkedincom/in/luissarmento/