From Non-Negative Matrix Factorization to Deep Learning

Similar documents
Jeffrey D. Ullman Stanford University

Singular Value Decompsition

Recommendation Systems

Data Mining Recitation Notes Week 3

Introduction to Algebra: The First Week

Dimensionality Reduction

Quick Introduction to Nonnegative Matrix Factorization

The Gram-Schmidt Process

Linear Least-Squares Data Fitting

Solving with Absolute Value

Do students sleep the recommended 8 hours a night on average?

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

MITOCW ocw f99-lec09_300k

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

Differential Equations

Lecture 2: Linear regression

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Lecture 4: Applications of Orthogonality: QR Decompositions

Vector, Matrix, and Tensor Derivatives

Talk Science Professional Development

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Descriptive Statistics (And a little bit on rounding and significant digits)

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Collaborative Filtering Matrix Completion Alternating Least Squares

Lecture - 24 Radial Basis Function Networks: Cover s Theorem

1 Review of the dot product

Collaborative Filtering. Radek Pelánek

Sniffing out new laws... Question

MATRIX MULTIPLICATION AND INVERSION

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Linear Independence Reading: Lay 1.7

Ordinary Least Squares Linear Regression

CHAPTER 7: TECHNIQUES OF INTEGRATION

Iterative solvers for linear equations

Introduction to Information Retrieval

Mathematical Methods in Engineering and Science Prof. Bhaskar Dasgupta Department of Mechanical Engineering Indian Institute of Technology, Kanpur

Linear Algebra, Summer 2011, pt. 2

Lecture 4: Constructing the Integers, Rationals and Reals

Solving Quadratic & Higher Degree Equations

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

CMU CS 462/662 (INTRO TO COMPUTER GRAPHICS) HOMEWORK 0.0 MATH REVIEW/PREVIEW LINEAR ALGEBRA

Artificial Neural Networks 2

MITOCW ocw nov2005-pt1-220k_512kb.mp4

AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS

Singular Value Decomposition

Using SVD to Recommend Movies

Lecture 10: Powers of Matrices, Difference Equations

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

TAYLOR POLYNOMIALS DARYL DEFORD

5.4 Continuity: Preliminary Notions

Math 304 Handout: Linear algebra, graphs, and networks.

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

MITOCW R11. Double Pendulum System

Machine Learning for Signal Processing Sparse and Overcomplete Representations

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Neural Networks 2. 2 Receptive fields and dealing with image inputs

+ y = 1 : the polynomial

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

MITOCW MITRES18_006F10_26_0602_300k-mp4

Chapter 1 Review of Equations and Inequalities

Lecture 2: Linear Algebra Review

base 2 4 The EXPONENT tells you how many times to write the base as a factor. Evaluate the following expressions in standard notation.

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

CS60021: Scalable Data Mining. Dimensionality Reduction

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Lecture 4: Training a Classifier

CSC321 Lecture 7: Optimization

Gaussian elimination

Quadratic Equations Part I

DIFFERENTIAL EQUATIONS

CIS 2033 Lecture 5, Fall

9 Searching the Internet with the SVD

Deep Algebra Projects: Algebra 1 / Algebra 2 Go with the Flow

Guide to Proofs on Sets

Structured matrix factorizations. Example: Eigenfaces

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Please bring the task to your first physics lesson and hand it to the teacher.

Iterative Methods for Solving A x = b

Introduction to Logistic Regression

The Haar Wavelet Transform: Compression and. Reconstruction

Chapter 26: Comparing Counts (Chi Square)

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

1 Maintaining a Dictionary

Lecture 9: September 28

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Singular Value Decomposition

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

MITOCW watch?v=pqkyqu11eta

Determinants of 2 2 Matrices

MITOCW ocw f99-lec05_300k

Optimization and Gradient Descent

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Transcription:

The Math!! From Non-Negative Matrix Factorization to Deep Learning Intuitions and some Math too! luissarmento@gmailcom https://wwwlinkedincom/in/luissarmento/ October 18, 2017

The Math!! Introduction

Disclaimer Introduction The Math!! I really just wanted to talk about NMF I added Deep Learning to the title just to attract people Conclusion: Deep Learning actually works!

Relational Data Introduction The Math!! We are given data about two sets of entities: X = x 1, x 2, x 3 x m Y = y 1, y 2, y 3 x n which may co-occur under some observable event We are also given the number of times c (count) c N 0 + : 0, 1, 2 when x i and y j co-occurred under such condition: (x i, y j, c ij )

The Math!! Relational Data - (x i, y j, c ij ) Typical cases: User x i watched Movie y j (c = 0/1) Customer x i purchased Product y j n times Web User x i clicked on Ad/Page y j (c = 0/1) Web User x i searched for on Query y j c times But there are also cases where X and Y are the same type (or even set) of entities word x i occurs with word y j c times protein x i interacts with protein y j (c = 0/1)

The Math!! Relational Data - Representation We can consider (x i, y j, n ij ) as a list of tuple Or, given an appropriate indexing of the entities: X : x i with i = 1, 2, 3 m Y : y j with j = 1, 2, 3 n We can represent all the observations as a Matrix, C X Y { }} { c 11 c 12 c 1n c 21 c 22 c 2n c m1 c m2 c mn = C

The Math!! Relational Data - Matrix Representation A few (practical) observations about C: it contains count data (non-negative!) it probably contains many zeros: most possible events do not happen (count = 0) not very efficient representation Probably a very large matrix hundreds times hundreds (eg eg countries countries ) millions times millions (eg Amazon users products) Interestingly, under some conditions, C can still be used to make recommendations / predictions

The Math!! Users who did this also did that Classic way of making predictions using this sort of data Each user is represented by a row vector from the matrix C: x 1 = [c 11, c 12 c 1n ] x i = [c i1, c i2 c in ] x i = [c m1, c m2 c mn ] We can perform an all vs all comparison among x i vectors using some convenient vector distance metric (eg cosine) For each user x i we build the list of top k most similar users: x i S i = [x a, x b, x c ] Recommendation Strategy: For each users x i, recommend items (eg movies, products) that the most similar users x j S i have already interacted with, but which x i has not yet interacted with

The Math!! Users who did this also did that Easy to understand, but it breaks on high dimensional data If the space of items is very large then the probability: overlap between any two vectors is very low regions of overlap being mostly the same is high of a spurious/meaningless co-occurrence tends to be over-emphasized Think of Amazon Catalog (millions of products): lot s of people buy the best selling dish soap: it s not a meaningful overlap the fact that two people bought the same rare french movie DVD does not mean that they should use the same shampoo

A very shallow story Introduction The Math!! One important semantic observation : Matrix C tells a very shallow story C merely stores an observations: it stores a flat relation (eg user x 147 watched movie y 1023 ) C tells nothing about the process underlying the observations Therefore, C can t: explain the data in compact way model the underlying data generation process But we know that there is a process: there is a chain of events that leads to the observations

We know that Introduction The Math!! Movies can be categorized in certain genres g 1, g 2 g k : comedy, action, horror, sci-fi, etc a movie can belong to several genres we can assume there is a degree of belongingness there are way less genres (10 s) than movies (1000 s) Users have different preferences for / affinities to those genres Users, typically, choose a movie of genres they like even if within the genre their choice is rather random

The Math!! Any resemblance with reality is pure ML modeling So, here s a slightly more reasonable story: 1 user x i has a certain affinity to each movie genre g j 2 user x i decides to to watch a movie of a specific genre g k (depending on the affinity he/she has to each genre) 3 from all movies y i of genre g k user picks one (with certain probability depending on the affinity of that movie to that genre) Instead of the flat user x watched move y relation, we now have two relations: user genres (preferences / affinities) movies genres (belongingness) And the story goes like: observations: user genres movies

The Math!! Oh wait!! Did you say relation? But relations can be stored in matrices! So, in this model we have two additional matrices! k genres { }} { k genres u 11 u 1k { }} { u 21 u 2k m 11 m 1k m users m 21 m 2k n movies m n1 m nk u m1 u mk M U Both U and M are very thin and tall matrices If k in the number of genres: k m (nr users) k n (nr movies)

The Math!! Putting all the pieces together So, our story was observations: user genres movies

The Math!! Putting all the pieces together So, our story was: observations: (user genres) movies This part is matrix U (m users): genres { }} { u 11 u 1k u 21 u 2k users u m1 u mk U

The Math!! Putting all the pieces together n movies So, our story was: observations: user (genres movies) This part is not matrix M (n movies) It is it s transpose, M T : k genres {}}{ m 11 m 1k m 21 m 2k m n1 m nk M k genres M T is a long and short matrix (many columns!) n movies {}}{ m 11 m 21 m n1 m 1k m 2k m nk M T

The Math!! Putting all the pieces together Then, if we assume linearity, our story: observations: (user (genres) movies) can be mapped to: C : U M T c 11 c 12 c 1n u 11 u 1k c 21 c 22 c 2n u 21 u 2k : c 1n c 2n c mn u m1 u mk } {{ } } {{ } m users n movies m users k genres (m n) : (m k) (k n) m 11 m 21 m n1 m 1k m 2k m nk } {{ } k genres n movies

The Math!! Let s dive deeper on C : U M T c 11 c 12 c 1n c 21 c 22 c 2n c 1n c 2n c mn : u 11 u 1k u 21 u 2k u m1 u mk m 11 m 21 m n1 m 1k m 2k m nk c 12 - how user 1 relates to movie 2 (count) u 1[1k] - vector describing affinity of user 1 to k genres m 2[1k] - vector describing affinity of movie 2 to k genres c 22 = u 1[1k] m 2[1k] linear model: looks great! the higher the overlap between genres, the higher counts

The Math!! Here is something interesting u 11 u 1k u 21 u 2k u m1 u mk }{{} m users k genres m 11 m 21 m n1 m 1k m 2k m nk }{{} k genres n movies Nr of parameters in : U & M T : m k + k n = (m + n) k C: m n

The Math!! In practice k m and k n : [ ] } {{ } k genres n movies } {{ } k genres n movies }{{} m users k genres So: (m + n) k m n Eg: 100k users, 10k movies, 200 genres C : 100k 1k = 1000M cells / params U & M T : (100 + 10)k 200 = 22M 50 compression

Less is More Introduction The Math!! We are explaining m n observations using only (m + n) k params This is good: we have a compact / lower dimensional model! We are concentrating the information in less parameters Our model for recommendation is now based on a deeper story! The link between users and movies is made via genres This helps solving the problems related with sparsity! We can try to find the most similar users : but instead of using sparse vectors from C (n dims) 0 1 0 1 vs 0 0 1 0 we use the lower dimensional dense vectors from U (k dims) 02 071 vs 05 002

Less is More (?) Introduction The Math!! Except we have a little problem!

Less is More (?) Introduction The Math!! In fact, we have two little problems

The Math!! Problem #1: List of Genres For most things we may not have a list of existing genres Maybe for movies we have some sets, but even in that case: it is not obvious to define what those are (committee) they are new genres coming in (eg zombie ) definitions may change over time (concept drift, specialization) It is not trivial to assign items (users and movies) to genres : typically categories are subject to multiple interpretations we many not even have information to make the assignment How would we know in advance the profile of each user

The Math!! Problem #2: How many Genres are enough? This is a more technical issue but the intuition is easy Extreme case: we have only 1 defined movie genre movies are either of that genre or not ( undefinable movies) users have affinity to the genre or not( undefinable movies) Then, we can only explain simple observation patterns 1 0 1 0 1 [ 0 0 0 1 1 1 ] 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1

Adding Genres Introduction The Math!! We can add genres (and continuous levels of belongingness): Adding just one genre provides more explanatory power 0 1 0 0 1 1 0 1 1 1 [ 0 0 0 1 1 1 0 1 0 1 0 1 ] 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 The observation matrix has now a more complex pattern

Adding Genres Introduction The Math!! We can add even more genres and this will allows us to explain even more complex patterns In fact, if we add enough genres, we are guaranteed to find factorization that explain observation matrix For example, assume the observation matrix is: 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2

The Math!! Here s an obvious factorization k = 4 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 } {{ } Works as a row selector 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 } {{ } replicates the observation matrix This is not an interesting factorization we are using it just to illustrate an intuition But, notice that two of the rows are equal

The Math!! So, we don t actually need so many genres k = 3 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 1 0 0 0 1 0 0 0 1 0 1 0 } {{ } Works as a row selector 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 } {{ } distinct rows of the observation matrix we only need to choose among 3 distinct rows row-oriented perspective: matrix on the right is a basis of rows that can be composed / selected to build each row of the data matrix

The Math!! So, we don t actually need so many genres (II) k = 3 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 0 1 0 2 1 2 1 0 0 0 1 0 0 0 1 0 1 0 }{{} Basis of columns to be composed 0 0 0 0 0 0 0 1 0 2 1 2 0 1 0 1 0 1 }{{} coefficients to compose output column vectors we are now composing 3 distinct columns column-oriented perspective: matrix on the left is a basis of columns that can be composed / selected to build each column of the data matrix

More generally Introduction The Math!! In fact if we set nr of genres k to: with: k min(m, n) m (number of rows / users) n (number of columns / users) we are guaranteed to find a factorization that fully explains the observations even if that factorization is pretty boring

And? Introduction The Math!! But the really interesting cases happen when: k min(m, n) In such cases, we may: not be able to fully recover the observation matrix but still be able to recover a close enough one In this case, we get a compact representation of users and movies, that does not suffer from the sparsity problems

The Math!! So, how do we make this happen? if we don t have a list of genres if we don t even know how many genres we need if we just have a matrix with observations Who do we make this happen???

The Math!! The Math!!

Intro and Terminology Introduction The Math!! I am going to jump directly into them math We already went through the intuitions I am going to follow the notation used in Wikipedia 1 This will make easier any further study using Wikipedia And makes my life much easier too! So: observation matrix C V user matrix U W movie matrix M H Let s forget the user movies scenario to think abstractly But still think about some intuitions 1 https: //enwikipediaorg/wiki/non-negative_matrix_factorization

The Math!! Some notation and conventions Historically NMF formulations follow the column oriented perspective } {{ } V: data matrix }{{} W: dictionary / features / basis matrix }{{} H: activation / coefficients matrix We build each column of V by using the k coefficients in H to combined the k available columns of W And we want, for a given k (number of categories): V WH

V WH Introduction The Math!! For a given k, there is a Residual: R k = V WH which is a matrix of the same dimensions of a matrix V Wish to residue to be the smallest possible under a given matrix norm min R k = min V WH subject to W 0, H 0 W,H W,H Fro now, the only restrictions on W and H are: all element are positive are m k and k n but we could impose more restriction (eg W T W = I )

The Math!! Choosing a Norm to Measure Residual Matrices are complex objects, so there are many possible ways of sizing them through various norms There are many matrix norms, so there are many way for us to measure the size of the residual matrix R k = V WH The norm we choose will define how we weigh the errors of approximation WH of V And we can have several assumption about how errors can be measured: Is having many small errors better that having any large error? Do errors become disproportionately worse with amplitude? Do we care about how many errors we make or only about their amplitudes? This leads to many interesting (and fundamental) questions, but I am going to skip all this!

Frobenius Norm Introduction The Math!! A very intuitive norm is the Frobenius Norm: m n R F = rij 2 i=1 j=1 It s simply the sum of the square of all cells in the matrix Two other formulations: Matrix operators formulation: R F = trace(r T R) min{m,n} If σ i (R) are singular values of R: R F = i=1 σi 2(R) R F only vanishes if all rij 2 = 0, that is when V = WH

But Introduction The Math!! The square root usually leads to nasty calculus Much more very convenient is the square of the Frobenius Norm: m n R 2 F = rij 2 i=1 j=1 We got rid of the square root, but the minimizers of R 2 F and R F are still the same Remember that we want to minimize min W,H R k The derivatives of R 2 F only involve deriving powers of 2: we will need to take derivatives to compute the minimizers these are easy!

The Math!! Let s get back to our minimization problem We want to minimize (now using square of Frobenius norm) subject to W 0, H 0 min W,H R 2 F = min V W,H WH 2 F We have two unknowns W and H So we need to find two variables but we only have one equation There is not even an unique solution: WH = (WB)(B 1 H) which means that any (complementary) permutation or scaling of W or H will leads to the same result

The Math!! Let s get back to our minimization problem We want to minimize (now using square of Frobenius norm) min W,H R 2 F = min V W,H WH 2 F Interestingly, if we only had one unknown: eg having W and trying to find H eg having H and trying to find W we would fall in a very well-known case: Ordinary Least Squares ( Linear Regression with squared loss ) for which there is a closed form 2 solution! 2 I am not showing it here, but this closed form solution is possible because we have an easy way of computing the derivatives

The Math!! Ordinary Least Squares (OLS) Suppose that we knew W and wanted to find H that minimized the error: min H R 2 F = min V H WH 2 F The OLS solution for H would be: H = (W T W ) 1 W T V with (W T W ) 1 W T being the left pseudo-inverse of W (Moore-Penrose inverse)

The Math!! Ordinary Least Squares (OLS) Suppose now that we knew H and wanted to find W that minimized the error: min W The OLS solution W would be: R 2 F = min V W WH 2 F W T = (HH T ) 1 HV T again with (HH T ) 1 H being the left pseudo-inverse of H (Moore-Penrose inverse)

The Math!! Ordinary Least Squares (OLS) This looks very complicated but it is just: matrix transposition matrix multiplication and matrix inversion of (small) k k matrices: W is m k so W T W is k k H is k n so HH T is k k It s just a shame that we don t have any of the W or H matrices! But as with many things in life

The Math!! We can fake it until we make it! And that s ok! No need to feel embarrassed about it!

The Math!! Alternate Least Squares (ALS) This is an algorithm of a larger class of algorithms that works by making leaps of faith We are going to assume that we know one of the matrices and compute the other using OLS ALS Algorithm: Generate Random W 0 with positive entries Repeat n=1: 1 compute H n = OLS(V, W n 1) 2 correct negative entries in H n 3 compute W n = OLS(V, H n) 4 correct negative entries in W n until R 2 F is small enough or max iterations is reached

The Math!! Alternate Least Squares (ALS) It s very fast (+) heavy operation would be the inversion but we are inverting only k by k matrices It is easy to understand in practice and to implement (+) Works well in practice (+) ie finds one of solution to he optimization problem) Does not ensure non-negativeness (-) we have to force correction Does not allow control on the structure of W and H (-) sparseness, orthogonality

The Math!! So, how do we do it better? There are other ways of computing NMF We can formulate the problem with regularization constraints sparseness, orthogonality, dependencies, blocks Use many well known Gradient Descent strategies Better ways of performing updates (multiplicative updates) We can scale to big data using stochastic approaches There are smart ways of performing computations on GPU s There are probabilistic interpretations of NMF There are ways of dealing with missing data etc There is 30 years or more of research in NMF we are not going to talk about all of it now

The Math!!

Key Points Introduction The Math!! NMF is a powerful modeling tool It transforms a set of X, Y observation stored in a matrix in a story The story has new set of k entities in between X and Y Users - Movies : User - Genres - Movies This implies factorizing the observation matrix in two matrices: one matrix stores information X and k entities another matrix stores information Y and k entities We showed that we can compute the factorization; ALS is just one of the ways of performing such factorization

Key Points Introduction The Math!! By factorizing we build a story that is closer to reality: observations do not just happen Our model now include an underlying (yet unobserved) set of factor that helps explaining the observation (eg the genres) This means our predictions are not dependent of specific observations: we can ground them on these new unobserved parameters they are more explanatory and more stable (changing one observation does not change the underlying parameters too much) So, this is good!

So Introduction The Math!! Can we build models that involve even deeper concepts and more intricate relations between them? Yes For example we can had more hierarchy Users Groups of Users Genres Groups of Movies Movies This would amount to decomposing C in 4 Matrices! We are going deeper!

Is this Deep Learning? Introduction The Math!! Is this Deep Learning? Yes and No depending on who you ask! But there is a deep 3 connection between (Non-negative) Matrix Factorization and Deep Learning Multi-Layer Perceptron architectures They both try to add explanatory entities in between the inputs and outputs! 3 No pun intended!

However Introduction The Math!! That s for another talk!

And now it s time for Introduction The Math!! Questions!

The Math!! Thank you! luissarmento@gmailcom Connect with me on LinkedIn https://wwwlinkedincom/in/luissarmento/