Algorithms for Collaborative Filtering

Similar documents
Restricted Boltzmann Machines for Collaborative Filtering

Collaborative Filtering Applied to Educational Data Mining

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Scaling Neighbourhood Methods

Collaborative Filtering via Ensembles of Matrix Factorizations

Collaborative Filtering. Radek Pelánek

Large-scale Ordinal Collaborative Filtering

Andriy Mnih and Ruslan Salakhutdinov

Matrix Factorization Techniques for Recommender Systems

Using SVD to Recommend Movies

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

A Simple Algorithm for Nuclear Norm Regularized Problems

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Recommendation Systems

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

Generative Models for Discrete Data

CS 175: Project in Artificial Intelligence. Slides 4: Collaborative Filtering

Mixed Membership Matrix Factorization

* Matrix Factorization and Recommendation Systems

Recommendation Systems

Collaborative Recommendation with Multiclass Preference Context

A Modified PMF Model Incorporating Implicit Item Associations

Maximum Margin Matrix Factorization for Collaborative Ranking

Collaborative Filtering on Ordinal User Feedback

Matrix Factorization Techniques for Recommender Systems

Mixed Membership Matrix Factorization

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Collaborative topic models: motivations cont

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Recommendation Systems

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

Lessons Learned from the Netflix Contest. Arthur Dunbar

Decoupled Collaborative Ranking

Collaborative Filtering

Gradient Boosting (Continued)

The BigChaos Solution to the Netflix Prize 2008

Graphical Models for Collaborative Filtering

The Pragmatic Theory solution to the Netflix Grand Prize

CS425: Algorithms for Web Scale Data

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Introduction to Logistic Regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Collaborative Filtering: A Machine Learning Perspective

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Ensemble Methods for Machine Learning

Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures

Nonnegative Matrix Factorization

Large-scale Collaborative Ranking in Near-Linear Time

Collaborative Filtering Matrix Completion Alternating Least Squares

Matrix Factorization and Factorization Machines for Recommender Systems

Data Science Mastery Program

LLORMA: Local Low-Rank Matrix Approximation

Modeling User Rating Profiles For Collaborative Filtering

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Data Mining Techniques

Matrix Factorizations: A Tale of Two Norms

6.034 Introduction to Artificial Intelligence

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Data Mining Techniques

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Machine Learning Techniques

Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods

Lecture Notes 10: Matrix Factorization

Probabilistic Neighborhood Selection in Collaborative Filtering Systems

arxiv: v2 [cs.ir] 14 May 2018

Impact of Data Characteristics on Recommender Systems Performance

a Short Introduction

ECS171: Machine Learning

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Matrix and Tensor Factorization from a Machine Learning Perspective

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Scalable Bayesian Matrix Factorization

Midterm exam CS 189/289, Fall 2015

CS246 Final Exam, Winter 2011

arxiv: v1 [cs.lg] 26 Oct 2012

Collaborative Filtering

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Optimization and Gradient Descent

Predicting the Performance of Collaborative Filtering Algorithms

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

VBM683 Machine Learning

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

ECE 5424: Introduction to Machine Learning

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

CS249: ADVANCED DATA MINING

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

Matrix Factorization and Collaborative Filtering

Joint user knowledge and matrix factorization for recommender systems

Cross-domain recommendation without shared users or items by sharing latent vector distributions

Transcription:

Algorithms for Collaborative Filtering or How to Get Half Way to Winning $1million from Netflix Todd Lipcon Advisor: Prof. Philip Klein

The Real-World Problem E-commerce sites would like to make personalized product recommendations to customers Often, there is no data about the customers, and/or the products are hard to describe explicitly (e.g. movies, music, art, books)

The Simplified Problem Regression/prediction problem Given a {user, item} pair we would like to predict the user s ordinal rating of that item In the case of the Netflix Prize, we are scored based on squared error of our predictions

Collaborative Filtering We do have past preference data for repeat users Purchase history (binary) Ratings (ordinal) We can infer traits of users and items from this preference data

The Problem Take 2 (this time with equations) We form a Rating Matrix is the rating by user on item i A R m n j where A i j In Netflix, 99% of the entries in A are missing We would like to solve: X = argmin X R m n i j A (A i j X i j ) 2 subject to some constraint or regularization penalty on X to ensure generalization to new examples i j

Item-KNN Approach We assume that items in the system can be described by their similarity to other items We can then predict that a given user will assign similar ratings to similar items We build a list of neighbors for each movie based on Pearson Correlation between their overlapping rating data We adjust correlation by calculating a confidence interval

Item-KNN Prediction RMSE NMAE Netflix 0.9411 0.4274 MovieLens 0.9126 0.4208 State of the art ( Ensemble MMMF ) achieves NMAE of 0.4054 on MovieLens Best Netflix result is RMSE = 0.8831 CineMatch RMSE = 0.9514

Item-KNN Data Mining Movies Similar to Finding Nemo (Fullscreen) Correlation Movie Title 0.650 Finding Nemo (Widescreen) 0.316 Monsters, Inc. 0.281 A Bug s Life 0.259 Toy Story 0.256 Toy Story 2 0.251 The Incredibles 0.243 Shrek 2 0.226 Ice Age 0.218 The Lion King: Special Edition 0.215 Harry Potter and the Prisoner of Azkaban Same movie! }Pixar } CG/Kids } Animation } Kids

Matrix Factorization We replace objective X = argmin X R m n i j A (A i j X i j ) 2 with the factorized objective X = argmin U R m r,v R n r i j A ( Ai j (UV T ) i j ) which is easier to learn due to smaller number of free parameters m + n instead of m n

Learning U and V The factorized objective isn t convex We d rather just learn the matrices one column at a time such that each additional column reduces the error as much as possible Learning the columns u and v is still nonconvex, but works well in practice anyway Learning the columns incrementally is an example of a gradient boosting machine [Friedman]

Gradient Boosting Matrix Factorization Pseudocode: i j X : X i j = BASELINE(i, j) any convex loss function can be substituted here! for i = 1 to M i j X : X i j = {u,v} = UVLEARN( X) X i j (A i j X i j ) 2 shrinkage parameter (more later) U = [U νu];v = [V νv];x = X + νuv T end for

UVLearn Objective Aim to learn best rank-1 approximation of the input matrix Given X R m n X, solve: {u,v} = argmin u R m,v R n i j (X i j u i v j ) 2 But this will overfit training data needs regularization!

UVLearn Objective Given parameter X R m n λ (regularized), solve: {u,v}= argmin u R m,v R n λ m i=1 and regularization [ i j u i ū 2 + λ (X i j u i v j ) 2 + n j=1 v j v 2 ]

UVLearn Regularization Penalization is based around distance from empirical prior means and We simultaneously optimize the parameters and v along with their u changing means ū v

UVLearn Optimization Stochastic Gradient Descent partial derivatives: E i j =2v j (u i v j R i j ) + 2λ(u i ū) u i E i j =2u i (u i v j R i j ) + 2λ(v j v) v j E i j ū = 2λ(u i ū) E i j v = 2λ(v j v)

UVLearn Optimization Stochastic GD update equations: u i=u i η(2v j (u i v j X i j ) + 2λ(u i ū)) v j=v j η(2u i (u i v j X i j ) + 2λ(v j v)) ū =ū η( 2λ(u i ū)) v = v η( 2λ(v j v)) H (E(u i,v j,ū, v)) = 2v 2 j + 2λ 4u iv j 2R i j 2λ 0 4u i v j 2R i j 2u 2 + 2λ i 0 2λ 2λ 0 2λ 0 0 2λ 0 2λ

GBMF Regularization UVLearn regularization parameter λ UVLearn GD stopping threshold τ Gradient boosting shrinkage parameter Number of gradient boosting iterations ν M

Shrinkage Effects 0.935 0.93! = 0.05, " = 0.029730! = 0.25, " = 0.029730! = 1, " = 0.029730 0.92 0.915! = 0.05, " = 0.029730! = 0.25, " = 0.059460! = 1, " = 0.084090 0.925 0.91 Probe RMSE 0.92 0.915 Probe RMSE 0.905 0.91 0.9 0.905 0.9 0.895 0.895 0 100 200 300 400 500 600 700 800 900 1000 GBM Depth 0.89 0 100 200 300 400 500 600 700 800 900 1000 GBM Depth Different shrinkage, same UVLearn regularization Different shrinkage, best UVLearn regularization

GBMF Prediction RMSE NMAE Netflix 0.9134 0.4099 MovieLens 0.8859 0.4051 State of the art ( Ensemble MMMF ) achieves NMAE of 0.4054 on MovieLens Best Netflix result is RMSE = 0.8831 CineMatch RMSE = 0.9514

Bagged GBMF Bootstrap aggregation (bagging) - general variance-reduction framework that does not increase bias [Breiman] We construct ensembles of 60 GBMF models trained on bootstrap samples of the training set Ensemble predictions are the mean of member predictions

Bagged GBMF Prediction RMSE NMAE Netflix???? MovieLens 0.8850 0.4045 Netflix jobs are still running and getting better Weren t able to do good grid search yet even for MovieLens because of resource limitations on SGE cluster State of the art ( Ensemble MMMF ) achieves NMAE of 0.4054 on MovieLens Best Netflix result is RMSE = 0.8831 CineMatch RMSE = 0.9514

Combining Predictors A single data set has many different types of component data (varying sparsity, etc) A single algorithm (or choice of tuning parameters for an algorithm) may not give the best results for every type of component data We can do better by smartly blending between different predictors

Learning Combinations Train individual predictors on most of the data Train combination algorithm on remaining data using k-fold cross-validation for model selection Then retrain individual predictors on the full data and use the combination predictor learned above

Learning Combinations For each entry in the validation set, we create a vector with movie mean, user mean, movie count, user count, and the predictions by all component predictors. We then can use standard regression techniques to learn the combination function.

Linear Combination Use simple least squares regression: RMSE NMAE Netflix 0.9092 0.4061 MovieLens 0.8817 0.4062 Add 1st-order interaction terms: RMSE NMAE Netflix 0.9062 0.4049 MovieLens 0.8758 0.4043 Adding 2nd-order interaction terms overfits

Support Vector Machine Combination Use nu-svr support vector regression to minimize squared loss Use support vector ordinal regression to minimize absolute loss Work still in progress - initial results for NMAE significantly improve on Ensemble MMMF (MovieLens NMAE ~ 0.4000)

Other Strong Netflix Approaches Restricted Boltzmann Machines (ML @ UToronto team) [Salakhutdinov, Mnih, and Hinton 2007] Probabilistic Latent Semantic Analysis ( PLSA team) [Hofmann 2003] Gibbs Sampling of Latent Variables (Paul Harrison) Most of the top teams are doing combination of some kind

Future Work Better grid search for bagging experiments Stochastic Meta-Descent optimization Continued work on SVR combination Conditional rating prediction Alternative loss functions (GBMF optimizes any convex loss)

Questions?