Collaborative Filtering Matrix Completion Alternating Least Squares

Similar documents
Collaborative Filtering

Using SVD to Recommend Movies

Lecture Notes 10: Matrix Factorization

Low Rank Matrix Completion Formulation and Algorithm

Ad Placement Strategies

Matrix Factorizations: A Tale of Two Norms

Structured matrix factorizations. Example: Eigenfaces

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

CS168: The Modern Algorithmic Toolbox Lecture #18: Linear and Convex Programming, with Applications to Sparse Recovery

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Collaborative Filtering

Lecture 9: September 28

Collaborative Filtering. Radek Pelánek

High-dimensional Statistical Models

High-dimensional Statistics

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Maximum Margin Matrix Factorization

1-Bit Matrix Completion

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Provable Alternating Minimization Methods for Non-convex Optimization

1-Bit Matrix Completion

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

Nonnegative Matrix Factorization

Recommendation Systems

1-Bit Matrix Completion

Rank minimization via the γ 2 norm

Restricted Boltzmann Machines for Collaborative Filtering

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Spectral k-support Norm Regularization

Recommendation Systems

Matrix Completion: Fundamental Limits and Efficient Algorithms

* Matrix Factorization and Recommendation Systems

A Quick Tour of Linear Algebra and Optimization for Machine Learning

7 Principal Component Analysis

Quick Introduction to Nonnegative Matrix Factorization

Adaptive Gradient Methods AdaGrad / Adam

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Convex sets, conic matrix factorizations and conic rank lower bounds

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Lecture 8: Linear Algebra Background

Generative Models for Discrete Data

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Recommendation Systems

Kernelized Perceptron Support Vector Machines

Data Science Mastery Program

Ad Placement Strategies

ECS289: Scalable Machine Learning

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Generalized Conditional Gradient and Its Applications

EE 381V: Large Scale Learning Spring Lecture 16 March 7

A Simple Algorithm for Nuclear Norm Regularized Problems

CS281 Section 4: Factor Analysis and PCA

The convex algebraic geometry of rank minimization

1 Non-negative Matrix Factorization (NMF)

Lecture 5 : Projections

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

14 Singular Value Decomposition

Matrix Factorization and Collaborative Filtering

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

From Compressed Sensing to Matrix Completion and Beyond. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison

Lecture 8: February 9

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

10-725/36-725: Convex Optimization Prerequisite Topics

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

Data Mining Techniques

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Scaling Neighbourhood Methods

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Signal Recovery from Permuted Observations

Overparametrization for Landscape Design in Non-convex Optimization

Linear dimensionality reduction for data analysis

6.034 Introduction to Artificial Intelligence

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Andriy Mnih and Ruslan Salakhutdinov

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Jeffrey D. Ullman Stanford University

Non-convex Robust PCA: Provable Bounds

Compressed Sensing and Neural Networks

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Homework 4. Convex Optimization /36-725

From Non-Negative Matrix Factorization to Deep Learning

Information Retrieval

Statistical Performance of Convex Tensor Decomposition

Graphical Models for Collaborative Filtering

Algorithms for Collaborative Filtering

Classification Logistic Regression

Matrix Factorization Techniques for Recommender Systems

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Chapter 2. Optimization. Gradients, convexity, and ALS

Transcription:

Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016 Sham Kakade 2016 1

Collaborative Filtering Goal: Find movies of interest to a user based on movies watched by the user and others Methods: matrix factorization Sham Kakade 2016 2

Women on the Verge of a Nervous Breakdown The Celebration What do I recommend??? City of God Wild Strawberries La Dolce Vita Sham Kakade 2016 3

Netflix Prize Given 100 million ratings on a scale of 1 to 5, predict 3 million ratings to highest accuracy 17770 total movies 480189 total users Over 8 billion total ratings How to fill in the blanks? Sham Kakade 2016 Figures from Ben Recht 4

Matrix Completion Problem X = X ij known for black cells X ij unknown for white cells Rows index users movies Columns index movies users Filling missing data? Sham Kakade 2016 5

Interpreting Low-Rank Matrix Completion (aka Matrix Factorization) X = L R Sham Kakade 2016 6

Identifiability of Factors X = L R If r uv is described by L u, R v what happens if we redefine the topics as Then, Sham Kakade 2016 7

Matrix Completion via Rank Minimization Given observed values: Find matrix Such that: But Introduce bias: Two issues: Sham Kakade 2016 8

Approximate Matrix Completion Minimize squared error: (Other loss functions are possible) Choose rank k: Optimization problem: Sham Kakade 2016 9

Coordinate Descent for Matrix Factorization Fix movie factors, optimize for user factors First observation: Sham Kakade 2016 10

Minimizing Over User Factors For each user u: In matrix form: Second observation: Solve by Sham Kakade 2016 11

Coordinate Descent for Matrix Factorization: Alternating Least-Squares Fix movie factors, optimize for user factors Independent least-squares over users Fix user factors, optimize for movie factors Independent least-squares over movies System may be underdetermined: Converges to Sham Kakade 2016 12

Effect of Regularization X = L R Sham Kakade 2016 13

What you need to know Matrix completion problem for collaborative filtering Over-determined -> low-rank approximation Rank minimization is NP-hard Minimize least-squares prediction for known values for given rank of matrix Must use regularization Coordinate descent algorithm = Alternating Least Squares Sham Kakade 2016 14

Case Study 4: Collaborative Filtering SGD for Matrix Completion, more algorithms (Matrix-norm Minimization) Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016 Sham Kakade 2016 15

Stochastic Gradient Descent Observe one rating at a time r uv Gradient observing r uv : Updates: Sham Kakade 2016 16

Local Optima v. Global Optima We are solving: We (kind of) wanted to solve: Which is NP-hard How do these things relate??? Sham Kakade 2016 17

Eigenvalue Decompositions for PSD Matrices Given a (square) symmetric positive semidefinite matrix: Eigenvalues: Thus rank is: Approximation: Property of trace: Thus, approximate rank minimization by: Sham Kakade 2016 18

Generalizing the Trace Trick Non-square matrices have no trace For (square) positive semidefinite matrices, eigendecomposition: For rectangular matrices, singular value decomposition: Nuclear norm: Sham Kakade 2016 19

Nuclear Norm Minimization Optimization problem: Possible to relax equality constraints: Both are convex problems! (solved by semidefinite programming) Sham Kakade 2016 20

Nuclear Norm Minimization vs. Direct (Bilinear) Low Rank Solutions Nuclear norm minimization: Annoying because: Instead: Annoying because: But So And Under certain conditions [Burer, Monteiro 04] Sham Kakade 2016 21

Nuclear Norm Minimization vs. Direct (Bilinear) Low Rank Solutions Nuclear norm minimization: Annoying because: Instead: Annoying because: But So And Under certain conditions [Burer, Monteiro 04] Sham Kakade 2016 22

Theory Suppose true matrix is exactly low rank, what might we hope for? Exact recovery? Statistically? Computationally? Is this possible? Assumptions: Sham Kakade 2016 23

Analysis of Nuclear Norm Nuclear norm minimization = convex relaxation of rank minimization: Theorem [Candes, Recht 08]: If there is a true matrix of rank k, And, we observe at least random entries of true matrix Then true matrix is recovered exactly with high probability via convex nuclear norm minimization! Under certain conditions Sham Kakade 2016 24

Alternating Minimization Alt. Min. used in practice. Alt. Min. was widely thought to be a search heuristic, does it work? Sham Kakade 2016 25

SGD SGD also widely used in practice (streaming, fast) Does it work? Sham Kakade 2016 26

What you need to know Stochastic gradient descent for matrix factorization Norm minimization as convex relaxation of rank minimization Trace norm for PSD matrices Nuclear norm in general Intuitive relationship between nuclear norm minimization and direct (bilinear) minimization Sham Kakade 2016 27

Case Study 4: Collaborative Filtering Nonnegative Matrix Factorization Projected Gradient Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox May 6 th, 2015 Sham Kakade 2016 28

Matrix factorization solutions can be unintuitive Many, many, many applications of matrix factorization E.g., in text data, can do topic modeling (alternative to LDA): X = L R Would like: But Sham Kakade 2016 29

Nonnegative Matrix Factorization X = L R Just like before, but Constrained optimization problem Many, many, many, many solution methods we ll check out a simple one Sham Kakade 2016 30

Recall: Projected Gradient Standard optimization: Want to minimize: Use, e.g., gradient updates: Constrained optimization: Given convex set C of feasible solutions Want to find minima within C: Projected gradient: Take a gradient step (ignoring constraints): Projection into feasible set: Sham Kakade 2016 31

Projected Stochastic Gradient Descent for Nonnegative Matrix Factorization Gradient step observing r uv ignoring constraints: Convex set: Projection step: Sham Kakade 2016 32

What you need to know In many applications, want factors to be nonnegative Corresponds to constrained optimization problem Many possible approaches to solve, e.g., projected gradient Sham Kakade 2016 33

Case Study 4: Collaborative Filtering Cold Start Problem Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016 Sham Kakade 2016 34

Cold-Start Problem Challenge: Cold-start problem (new movie or user) Methods: use features of movie/user IN THEATERS Sham Kakade 2016 35

Cold-Start Problem More Formally Consider a new user u and predicting that user s ratings No previous observations Objective considered so far: Optimal user factor: Predicted user ratings: Sham Kakade 2016 36

An Alternative Formulation A simpler model for collaborative filtering We would not have this issue if we assumed all users were identical What about for new movies? What if we had side information? What dimension should w be? Fit linear model: Minimize: Sham Kakade 2016 37

Personalization If we don t have any observations about a user, use wisdom of the crowd Address cold-start problem Clearly, not all users are the same Just as in personalized click prediction, consider model with global and userspecific parameters As we gain more information about the user, forget the crowd Sham Kakade 2016 38

User Features In addition to movie features, may have information about the user: Combine with features of movie: Unified linear model: Sham Kakade 2016 39

Feature-Based Approach vs. Matrix Factorization Feature-based approach: Feature representation of user and movies fixed Can address cold-start problem Matrix factorization approach: Suffers from cold-start problem User & movie features are learned from data A unified model: Sham Kakade 2016 40

Unified Collaborative Filtering via SGD Gradient step observing r uv For L,R For w and w u : Sham Kakade 2016 41

What you need to know Cold-start problem Feature-based methods for collaborative filtering Help address cold-start problem Unified approach Sham Kakade 2016 42