Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Similar documents
Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Andriy Mnih and Ruslan Salakhutdinov

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Restricted Boltzmann Machines for Collaborative Filtering

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

CS 175: Project in Artificial Intelligence. Slides 4: Collaborative Filtering

Collaborative Filtering. Radek Pelánek

Matrix Factorization Techniques for Recommender Systems

Using SVD to Recommend Movies

Algorithms for Collaborative Filtering

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Recommendation Systems

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Large-scale Ordinal Collaborative Filtering

Decoupled Collaborative Ranking

The Pragmatic Theory solution to the Netflix Grand Prize

Matrix Factorization Techniques for Recommender Systems

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Adaptive one-bit matrix completion

1-Bit Matrix Completion

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Collaborative Filtering

Collaborative Filtering: A Machine Learning Perspective

Lecture Notes 10: Matrix Factorization

Collaborative Filtering Applied to Educational Data Mining

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

* Matrix Factorization and Recommendation Systems

Generative Models for Discrete Data

1-Bit Matrix Completion

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Nonnegative Matrix Factorization

Introduction to Logistic Regression

Recommender systems, matrix factorization, variable selection and social graph data

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

a Short Introduction

Data Mining Techniques

Matrix Factorization and Collaborative Filtering

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Collaborative Filtering on Ordinal User Feedback

Ad Placement Strategies

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Stochastic Gradient Descent

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CS425: Algorithms for Web Scale Data

CS246 Final Exam, Winter 2011

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Recommendation Systems

Collaborative Recommendation with Multiclass Preference Context

A Modified PMF Model Incorporating Implicit Item Associations

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

1-Bit Matrix Completion

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Generalized Linear Models in Collaborative Filtering

Recommendation Systems

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lessons Learned from the Netflix Contest. Arthur Dunbar

CPSC 340 Assignment 4 (due November 17 ATE)

Expectation Maximization

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Machine Learning, Fall 2012 Homework 2

arxiv: v2 [cs.ir] 14 May 2018

ECE521 week 3: 23/26 January 2017

Binary matrix completion

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Scaling Neighbourhood Methods

HOMEWORK #4: LOGISTIC REGRESSION

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Impact of Data Characteristics on Recommender Systems Performance

Machine Learning Gaussian Naïve Bayes Big Picture

Tufts COMP 135: Introduction to Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

c Springer, Reprinted with permission.

Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures

CS 6375 Machine Learning

Lecture 2: Logistic Regression and Neural Networks

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

A few applications of the SVD

Sample questions for Fundamentals of Machine Learning 2018

Collaborative Filtering Matrix Completion Alternating Least Squares

Logistic Regression. COMP 527 Danushka Bollegala

Quantum Recommendation Systems

Applied Natural Language Processing

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Transcription:

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task László Kozma, Alexander Ilin, Tapani Raiko first.last@tkk.fi Helsinki University of Technology Adaptive Informatics Research Center Saarbrücken, 19th October 2009

Introduction Recommender systems Predict user preference in order to offer more relevant items (Amazon.com, Netflix, itunes Genius, Pandora) Content-based approach: use similarity between items (e.g. book text, genre, title, author, actors in a movie, etc.) Collaborative filtering: predict preferences from previous user-item relationships

Collaborative filtering: Netflix prize Netflix data set Training set: 100,480,507 ratings from n = 480,189 users on m = 17,770 movies Probe (test) set: 1,408,395 ratings are given for validation Quiz set: 2,817,131 user/movie pairs with ratings withheld Train, Probe, Quiz sets contain ratings from the same users and movies Ratings from 1 to 5, time of voting provided

Collaborative filtering: Netflix prize Netflix prize competition Score: RMSE on quiz set ratings, goal: 10% better RMSE than Cinematch (Netflix own algorithm trained on same data: 0.9514) Leading solution: A blend of many methods (>100): k-nn, factorization (SVD), restricted Boltzmann machines and many other http://www.netflixprize.com/ Robert M. Bell, Yehuda Koren and Chris Volinsky: The BellKor solution to the Netflix Prize

Collaborative filtering..can be formulated as missing value reconstruction: 5 2 2? 5 X n m = 3 1?? 3 5? n = 480,189 users, m = 17,770 movies, 100,480,507 ratings (over 98% missing)

Singular Value Decomposition Rank-c approximation of ratings-matrix X: X n m A n c S c m, min A,S X AS 2 F Since X is incomplete, the cost function is: (x ij a i s j ) 2 + λ( A 2 F + S 2 F ) (i,j) O The last term is added to avoid overfitting O is the set of observed ratings Unknown ratings are predicted as x ij = a i s j, (i, j) / O Code: http://www.cis.hut.fi/alexilin/software

Logistic PCA Similar to SVD, but we use non-linearity: Method P(y ij = 1) = σ(a i s j ) (1) Sigmoid function: σ(x) = 1 1 + e x (2) To avoid having a bias term, we fix the last row of S to all ones. Then, the elements in the last column of A play the role of the bias term.

Method Binarizing Data Each rating value (1-5) encoded on 4 bits: 1 0000 2 0001 3 0011 4 0111 5 1111 We create 4 binary matrices from the original matrix of ratings Each element tells whether a rating is greater or smaller than a threshold Binarization scheme could be used also for continuous data

Method

Method

Method Logistic PCA P(y ij = 1) = σ(a i s j ) (3) Both A and S are unknown and have to be estimated to fit the available ratings. We express the likelihood as a product of Bernoulli distributions based on Y L = σ(a T i s j ) y ij (1 σ(a i s j )) (1 y ij) ij O O are the observed ratings in Y (4)

Method Regularization Maximum likelihood estimate prone to overfitting We use Gaussian priors for A and S P(s kj ) = N (0, 1) P(a ik ) = N (0, v k ) i = 1,...,m; k = 1,...,c; j = 1,...,n

Method Learning We use MAP-estimation for the model parameters: F = y ij log σ(a T i s j ) ij O + (1 y ij )log(1 σ(a T i s j )) ij O m i=1 k=1 c k=1 j=1 c [ 1 2 log 2πv k + 1 ] a 2 ik 2v k n [ 1 2 log 2π + 1 2 kj] s2 (5)

Learning The derivatives of the log-posterior: F = a i j ij O F = s j i ij O Method s T j (y ij σ(a T i s j )) diag(1/v k )a i (6) a T i (y ij σ(a T i s j )) s j, (7) Gradient-ascent simultaneous update rules: a i a i + α F (8) a i m F s k s k + α (9) n s j

Learning Gradient-ascent update rules: Method a i a i + α F (10) a i m F s k s k + α (11) n s j Scaling Update of the learning rate α: decrease by half if step unsuccessful, increase by 20% if successful Variance-parameters point-estimated to maximize F : v k = 1 d a 2 ik (12) d i=1

Method Prediction We split σ(a T i s j) into X1, X2, X3, X4. The probability of each rating value is then: P x=1 = (1 x 1 )(1 x 2 )(1 x 3 )(1 x 4 ) (13) P x=2 = (1 x 1 )(1 x 2 )(1 x 3 )x 4 (14) P x=3 = (1 x 1 )(1 x 2 )x 3 x 4 (15) P x=4 = (1 x 1 )x 2 x 3 x 4 (16) P x=5 = x 1 x 2 x 3 x 4 (17) We estimate the rating as expectation: ˆx = 1P x=1 + 2P x=2 + 3P x=3 + 4P x=4 + 5P x=5 P x=1 + P x=2 + P x=3 + P x=4 + P x=5. (18)

Experiments: MovieLens # components c Train rms Test rms binary PCA, movies users 10 0.8956 0.9248 binary PCA, users movies 10 0.8449 0.9028 20 0.8413 0.9053 30 0.8577 0.9146 VBPCA, users movies 10 0.7674 0.8905 20 0.7706 0.8892 30 0.7696 0.8880 Table: Performance obtained for MovieLens 100K data

Experiments: MovieLens Predictions on the test set with PCA (x-axis) and logistic PCA (y-axis):

Experiments: Netflix Method Probe rms Quiz subset VBPCA, 50 components 0.9055 0.9070 BinPCA, 20 components 0.9381 0.9392 Blend 0.9046 0.9062 Table: Performance obtained for Netflix data Training and test error: 0.5 0.45 0.4 0.35 PCAbin(20) PCAbin(30) PCAbin(45) 0.3 0 1 2 4 8 16 32 64 128 0.5 0.45 0.4 0.35 0.3 0 1 2 4 8 16 32 64 128

Experiments: Netflix

Conclusions We introduced an algorithm for binary logistic PCA that scales well to very high dimensional and very sparse data We experimented on a large scale collaborative filtering task The method captures different aspects of the data than traditional PCA We list some possible improvements in the paper Questions?

Experiments: Netflix Predicted 5 True X1 X2 X3 X4 4 3 2 1

Experiments: Netflix 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Experiments: Netflix A1 A2 10 10 5 0 0 10 20 10 5 0 10 20 30 10 10 0 10 A3 10 10 5 5 0 0 5 5 10 5 0 20 30 A4 5 10 10 5 0 5 10

Experiments: Netflix 3 3 2 2 1 1 0 2 0 2 0 2 0 2

Experiments: Netflix 1.01 1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.345 0.35 0.355 0.36 0.365 0.37 0.375 0.38

Blending Linear regression using least squares error on known ratings from probe set [ ] predictions of method 1 predictions of method 2 [ α1 α 2 ] = [ probe set ratings ]

A real recommender system Low RMSE Accuracy on top few picks Easy to interpret Speed of response Update state (learn online) Discover novel items Collaborative filtering

Implementation Implementation in Matlab Extension of SVD code from http://www.cis.hut.fi/alexilin/software Most critical parts (sigmoid(), A S) implemented in C++, linked with mex library Use only sparse matrices, make sure they are never internally transformed into full matrices Example: Y = spfun(@(x) 1./x, Y); Three different values: missing 0, 0 ǫ, 1 1 Minibatch: no need to keep whole Y in memory at all times, either load from disk in every iteration or recompute

We need to compute L 1 = log(σ(x)) L 2 = log(1 σ(x)) where σ(x) = 1 1+e x For numerical stability we compute if(x >= 0) L 1 = log(1 + e x ) L 2 = x + L 1 if(x < 0) L 2 = log(1 + e x ) L 1 = x + L 2 Implementation

Collaborative filtering Other ideas Trouble with data: spam accounts, more people with same account Use information about quiz data set: data is not missing (entirely) at random Use external data: identify movies/users Rating depending on time: influenced by previously seen movies If we both hate Titanic which everyone liked, that tells more about our similar tastes than if we both liked it etc.