Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016 Sham Kakade 2016 1
Collaborative Filtering Goal: Find movies of interest to a user based on movies watched by the user and others Methods: matrix factorization Sham Kakade 2016 2
Women on the Verge of a Nervous Breakdown The Celebration What do I recommend??? City of God Wild Strawberries La Dolce Vita Sham Kakade 2016 3
Netflix Prize Given 100 million ratings on a scale of 1 to 5, predict 3 million ratings to highest accuracy 17770 total movies 480189 total users Over 8 billion total ratings How to fill in the blanks? Sham Kakade 2016 Figures from Ben Recht 4
Matrix Completion Problem X = X ij known for black cells X ij unknown for white cells Rows index users movies Columns index movies users Filling missing data? Sham Kakade 2016 5
Interpreting Low-Rank Matrix Completion (aka Matrix Factorization) X = L R Sham Kakade 2016 6
Identifiability of Factors X = L R If r uv is described by L u, R v what happens if we redefine the topics as Then, Sham Kakade 2016 7
Matrix Completion via Rank Minimization Given observed values: Find matrix Such that: But Introduce bias: Two issues: Sham Kakade 2016 8
Approximate Matrix Completion Minimize squared error: (Other loss functions are possible) Choose rank k: Optimization problem: Sham Kakade 2016 9
Coordinate Descent for Matrix Factorization Fix movie factors, optimize for user factors First observation: Sham Kakade 2016 10
Minimizing Over User Factors For each user u: In matrix form: Second observation: Solve by Sham Kakade 2016 11
Coordinate Descent for Matrix Factorization: Alternating Least-Squares Fix movie factors, optimize for user factors Independent least-squares over users Fix user factors, optimize for movie factors Independent least-squares over movies System may be underdetermined: Converges to Sham Kakade 2016 12
Effect of Regularization X = L R Sham Kakade 2016 13
What you need to know Matrix completion problem for collaborative filtering Over-determined -> low-rank approximation Rank minimization is NP-hard Minimize least-squares prediction for known values for given rank of matrix Must use regularization Coordinate descent algorithm = Alternating Least Squares Sham Kakade 2016 14
Case Study 4: Collaborative Filtering SGD for Matrix Completion, more algorithms (Matrix-norm Minimization) Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016 Sham Kakade 2016 15
Stochastic Gradient Descent Observe one rating at a time r uv Gradient observing r uv : Updates: Sham Kakade 2016 16
Local Optima v. Global Optima We are solving: We (kind of) wanted to solve: Which is NP-hard How do these things relate??? Sham Kakade 2016 17
Eigenvalue Decompositions for PSD Matrices Given a (square) symmetric positive semidefinite matrix: Eigenvalues: Thus rank is: Approximation: Property of trace: Thus, approximate rank minimization by: Sham Kakade 2016 18
Generalizing the Trace Trick Non-square matrices have no trace For (square) positive semidefinite matrices, eigendecomposition: For rectangular matrices, singular value decomposition: Nuclear norm: Sham Kakade 2016 19
Nuclear Norm Minimization Optimization problem: Possible to relax equality constraints: Both are convex problems! (solved by semidefinite programming) Sham Kakade 2016 20
Nuclear Norm Minimization vs. Direct (Bilinear) Low Rank Solutions Nuclear norm minimization: Annoying because: Instead: Annoying because: But So And Under certain conditions [Burer, Monteiro 04] Sham Kakade 2016 21
Nuclear Norm Minimization vs. Direct (Bilinear) Low Rank Solutions Nuclear norm minimization: Annoying because: Instead: Annoying because: But So And Under certain conditions [Burer, Monteiro 04] Sham Kakade 2016 22
Theory Suppose true matrix is exactly low rank, what might we hope for? Exact recovery? Statistically? Computationally? Is this possible? Assumptions: Sham Kakade 2016 23
Analysis of Nuclear Norm Nuclear norm minimization = convex relaxation of rank minimization: Theorem [Candes, Recht 08]: If there is a true matrix of rank k, And, we observe at least random entries of true matrix Then true matrix is recovered exactly with high probability via convex nuclear norm minimization! Under certain conditions Sham Kakade 2016 24
Alternating Minimization Alt. Min. used in practice. Alt. Min. was widely thought to be a search heuristic, does it work? Sham Kakade 2016 25
SGD SGD also widely used in practice (streaming, fast) Does it work? Sham Kakade 2016 26
What you need to know Stochastic gradient descent for matrix factorization Norm minimization as convex relaxation of rank minimization Trace norm for PSD matrices Nuclear norm in general Intuitive relationship between nuclear norm minimization and direct (bilinear) minimization Sham Kakade 2016 27
Case Study 4: Collaborative Filtering Nonnegative Matrix Factorization Projected Gradient Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox May 6 th, 2015 Sham Kakade 2016 28
Matrix factorization solutions can be unintuitive Many, many, many applications of matrix factorization E.g., in text data, can do topic modeling (alternative to LDA): X = L R Would like: But Sham Kakade 2016 29
Nonnegative Matrix Factorization X = L R Just like before, but Constrained optimization problem Many, many, many, many solution methods we ll check out a simple one Sham Kakade 2016 30
Recall: Projected Gradient Standard optimization: Want to minimize: Use, e.g., gradient updates: Constrained optimization: Given convex set C of feasible solutions Want to find minima within C: Projected gradient: Take a gradient step (ignoring constraints): Projection into feasible set: Sham Kakade 2016 31
Projected Stochastic Gradient Descent for Nonnegative Matrix Factorization Gradient step observing r uv ignoring constraints: Convex set: Projection step: Sham Kakade 2016 32
What you need to know In many applications, want factors to be nonnegative Corresponds to constrained optimization problem Many possible approaches to solve, e.g., projected gradient Sham Kakade 2016 33
Case Study 4: Collaborative Filtering Cold Start Problem Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016 Sham Kakade 2016 34
Cold-Start Problem Challenge: Cold-start problem (new movie or user) Methods: use features of movie/user IN THEATERS Sham Kakade 2016 35
Cold-Start Problem More Formally Consider a new user u and predicting that user s ratings No previous observations Objective considered so far: Optimal user factor: Predicted user ratings: Sham Kakade 2016 36
An Alternative Formulation A simpler model for collaborative filtering We would not have this issue if we assumed all users were identical What about for new movies? What if we had side information? What dimension should w be? Fit linear model: Minimize: Sham Kakade 2016 37
Personalization If we don t have any observations about a user, use wisdom of the crowd Address cold-start problem Clearly, not all users are the same Just as in personalized click prediction, consider model with global and userspecific parameters As we gain more information about the user, forget the crowd Sham Kakade 2016 38
User Features In addition to movie features, may have information about the user: Combine with features of movie: Unified linear model: Sham Kakade 2016 39
Feature-Based Approach vs. Matrix Factorization Feature-based approach: Feature representation of user and movies fixed Can address cold-start problem Matrix factorization approach: Suffers from cold-start problem User & movie features are learned from data A unified model: Sham Kakade 2016 40
Unified Collaborative Filtering via SGD Gradient step observing r uv For L,R For w and w u : Sham Kakade 2016 41
What you need to know Cold-start problem Feature-based methods for collaborative filtering Help address cold-start problem Unified approach Sham Kakade 2016 42