a Short Introduction

Similar documents
Recommender Systems: Overview and. Package rectools. Norm Matloff. Dept. of Computer Science. University of California at Davis.

Quick Introduction to Nonnegative Matrix Factorization

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Collaborative Filtering: A Machine Learning Perspective

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Lecture Notes 10: Matrix Factorization

Using SVD to Recommend Movies

Nonnegative Matrix Factorization

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Recommendation Systems

Collaborative Filtering. Radek Pelánek

1 Singular Value Decomposition and Principal Component

CS281 Section 4: Factor Analysis and PCA

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Matrix and Tensor Factorization from a Machine Learning Perspective

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

* Matrix Factorization and Recommendation Systems

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Collaborative topic models: motivations cont

Data Preprocessing Tasks

Jeffrey D. Ullman Stanford University

Factor Analysis (10/2/13)

Advanced Introduction to Machine Learning

Lecture: Face Recognition and Feature Reduction

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Lecture 5 Singular value decomposition

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Robustness of Principal Components

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors


CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

EE 381V: Large Scale Learning Spring Lecture 16 March 7

Lecture: Face Recognition and Feature Reduction

Online Dictionary Learning with Group Structure Inducing Norms

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

Scaling Neighbourhood Methods

Structured matrix factorizations. Example: Eigenfaces

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Collaborative Filtering

Machine Learning, Fall 2009: Midterm

Decoupled Collaborative Ranking

EECS 275 Matrix Computation

Machine Learning - MT & 14. PCA and MDS

NCDREC: A Decomposability Inspired Framework for Top-N Recommendation

B553 Lecture 5: Matrix Algebra Review

PV211: Introduction to Information Retrieval

Large-scale Ordinal Collaborative Filtering

Latent Semantic Analysis. Hongning Wang

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Mathematical Formulation of Our Example

Package sgpca. R topics documented: July 6, Type Package. Title Sparse Generalized Principal Component Analysis. Version 1.0.

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Conditions for Robust Principal Component Analysis

Principal Component Analysis

Problems. Looks for literal term matches. Problems:

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

L11: Pattern recognition principles

4 Bias-Variance for Ridge Regression (24 points)

Collaborative Filtering Matrix Completion Alternating Least Squares

Dimension Reduction. David M. Blei. April 23, 2012

Recommendation Systems

Matrix Factorization Techniques for Recommender Systems

Notes on Latent Semantic Analysis

Introduction to Information Retrieval

Lecture 6: Methods for high-dimensional problems

Restricted Boltzmann Machines for Collaborative Filtering

Recommendation Systems

1 Data Arrays and Decompositions

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Algorithms for Collaborative Filtering

c Springer, Reprinted with permission.

Spectral k-support Norm Regularization

Some Statistical Aspects of Photometric Redshift Estimation

Probabilistic Latent Semantic Analysis

Machine Learning Practice Page 2 of 2 10/28/13

Collaborative Filtering

Data Science Mastery Program

PCA, Kernel PCA, ICA

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Statistical Machine Learning

Based on slides by Richard Zemel

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Linear Algebra for Machine Learning. Sargur N. Srihari

SQL-Rank: A Listwise Approach to Collaborative Ranking

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Linear Algebra and Eigenproblems

Data Mining Techniques

Functions as Infinitely Long Column Vectors

Lecture 21: Spectral Learning for Graphical Models

Using Matrix Decompositions in Formal Concept Analysis

Analysis of Spectral Kernel Design based Semi-supervised Learning

Similarity and recommender systems

Transcription:

Collaborative Filtering in Recommender Systems: a Short Introduction Norm Matloff Dept. of Computer Science University of California, Davis matloff@cs.ucdavis.edu December 3, 2016 Abstract There is a strong interest in the machine learning community in recommender systems, especially using collaborative filtering. A rich variety of methods has been proposed. Here we briefly introduce several classes of methods. 1 Overview Say we have user ratings data, with Y ij denoting the rating user i gives to item j, for i = 1,..., n, j = 1,..., m. The term rating here can be interpreted quite broadly it could for instance be a binary variable indicating whether there is (or will be in the future) a link between two vertices in a graph but for now, we will take as a concrete example one of the most famous recommender system data sets, MovieLens, 1 in which users rate films. This is a very active branch of machine learning research, and has attracted considerable attention in the popular press [6]. Let M denote the matrix of the Y ij. Note that most of M will be undefined, as each user typically has rated only a small fraction of the available items. This sparsity of the ratings matrix is a central issue. The goals include matrix completion, i.e. predicting the missing Y ij values, and understanding the underlying processes, i.e. identifying factors affecting ratings. This process is called collaborative filtering, alluding to a collaboration between users in predicting ratings. In this paper, we will first give an overview of collaborative filtering methods in Section 2, then show some of the details. Finally, we introduce our rectools software package in Section 4. 2 Collaborative Filtering We are concerned here with collaborative filtering, meaning that the predictions for a given Y ij will be informed by prediction patterns of similar users. We will focus here on three general categories of such methods. 2 1 http://grouplens.org/datasets/movielens/ 2 There is some inconsistency in the literature regarding the definition of collaborative filtering. Some authors use the term only for the nearest-neighbor method, the first of our three categories. Others use the term for any method involving user data, which will be our meaning of the term here. 1

2.1 Quick Summary Below is a very quick summary of the three categories of methods, to find a predicted value Ŷij for Yij. Nearest-neighbor methods: To predict the rating of item j by user i, find the other users in the data set who have rated item j and who have similar rating patterns on other items to user i. Take Ŷij to be the average of the ratings of item j among these neighbors. We will refer to this as the k-nearest Neighbor (knn) method. Matrix factorization methods: Assume the matrix M can be approximated as M UV (1) where the matrices U and V of sizes n k and k m are calculated from the known Y ij values. Typically k is much smaller than n and m. Then set the dot product of row i of U and column j of V. Statistical random effects models: The basic model is Ŷ ij = U i.v.j (2) Y ij = overall mean + user i effect + item j effect + noise (3) The latent user and item effects can be estimated by Maximum Likelihood methods, but the much faster Method of Moments (MM) approach has less stringent assumptions, and yields closed-form expressions, Ŷ ij = Y i. + Y.j Y.. (4) where the three quantities on the far right are the mean in row i, the mean in column j and the overall mean, among known values in M. The first category is very much in line with statistical thinking, and the third is explicitly so. The second category is not statistical as described and as typically perceived in the field but it too can be viewed in this way, as will be seen below. Note that RS usage involves two stages: Exploratory stage: Here the RS administrator explores various approaches, e.g. knn, MF and RE, and any associated parameters, such as the rank k of U and V in the MF method. Eventually the administrator settles on some method and parameter values. Production stage: Here new cases come in continually, and the model chosen in the exploratory stage is applied to these cases for predictions one-by-one. The model from the first stage may be updated occasionally with this new data. 3 Details of Collaborative Filtering Methods 3.1 Nearest-Neighbor Methods To predict the rating of a certain item by a certain user, we compare her ratings vector to those for all our users in our data, finding the ones closest to the given user, among those in which a rating for this item are available. Early work in the field tended to use this approach. Of course, one of the issues that arises is the definition of closest, a problem especially in miew of the fact that most of our data is missing. We will return to this issue below. 2

3.2 Matrix Factorization Methods From a statistical point of view, matrix factorization methods for recommender systems amount to a principal components analysis (PCA) of the data, or something similar. 3.2.1 Intuition Row i of UV is a linear combination of the rows of V. Since row i of UV is also the approximation of row i of M, the rows of V then form a basis for a space approximating the rows of M. In other words: (a) The rows of V serve as synthetic typical users, with each row consisting of the ratings such a sythetic user gives to the various items. (b) For any real user i in our data set, row i of U gives us weights with which we can express this user s item ratings as a linear combination of the sythetic users. 3.2.2 What Actually Is Done Our digression to the matrix M M above was done for the purpose of developing intuition. Now let s look at M itself. Singular value decomposition (SVD) is similar to PCA, writing M as M = QSR (5) where Q and R are orthogonal, and S is a diagonal matrix consisting of the singular values, which are the square roots of the eigenvalues/variances mentioned above. Again discarding the small singular values, retaining the k largest, we have matrices U and V such that M UV (6) for matrices U, n k, and V, k r, of rank k. U and V are calculated from available data, after which the product UV gives us predicted values for the unknown Y ij: Ŷ ij = U i.v.j (7) where the hat means predicted value and the notations i. and.j mean row i and column j of the given matrix. Here k is the analog of 75 in our intuitive discussion above, as follows. In (6), row i of M, the ratings of user i, is now give approximately as a linear combination of the rows of V (with the coefficients being row i of U). Thus the rows of the matrix V serve as k representative users to serve in place of our original n. (They are actually composites of users.) Similarly, the columns of U serve as k representative films, in place of our original m. The idea of avoiding overfitting in general is to use fewer predictors, and here in this context of predicting ratings, a smaller k fits that description. Thus k hopefully much less than the rank of M. On the other hand, if k is too small, we are underfitting. Finding a good middle ground for k is one of the major challenges in statistics, and certainly is so here. 3 3.2.3 One More Refinement Since the matrix M has nonnegative entries, we might wish to find the matrices U and V with that same property. This is called nonnegative matrix factorization (NMF). The appeal of NMF here is that it makes the k representative users and k representative films interpretations above more plausible. 3 Another approach to dealing with overfitting is regularization, in which the matrices U and V are constrained from becoming too large. We do not cover this approach here. 3

NMF is probably the most popular method today. 4 3.3 Statistical Random Effects Models A popular extension of matrix factorization methods is to incorporate user and item biases, reflecting the tendency of some users to be more generous in rating than others, and the tendency of some items to be rated higher than others. These quantities will be defined below, but for now, let s just give them names, α i for user i and β j for item j. But the point is that the popular approach is to first subtract them from the known Y ij, and then use matrix factorization as above. Random effects models form a branch of traditional statistical methodology, ironically useful in a modern application such as recommender systems. The model is Y ij = µ + α i + β j + ɛ ij (8) with µ fixed but unknown, and with the other quantities random and unobservable, with mean 0. The term random effects here refers to the fact that the users in our data are treated as a random sample from some population of users, with a similar treatment for items. The quantity µ is then the overall mean rating, among all possible users and items; α i is the difference between the mean rating that user i would give all possible items; and β j is the mean rating that all possible users would give to item j. The ɛ term accounts for variability not incorporated in the other quantities. The three random quantities are assumed independent over all i and j. In the Maximum Likelihood formulation, the random quantities are also assumed to be normally distributed. According to the history given in [5], random effects models of various kinds were used in the early history of recommender systems, but rejected on computational grounds when data sets became much larger. More recently, [2] and [5] proposed circumventing these problems by using Method of Moments (MM) estimation. MM matches expressions for population moments (means, variances etc.) with their sample analogs. In our setting here, it is quite straightforward to derive closed-form expressions for the estimators: µ = Y.. (9) α i = Y i. Y.. (10) β j = Y.j Y.. (11) where Y.. is the overall sample mean in our data, Y i. is the sample mean for user i in our sample, and Y.j is the sample mean for item j. The missing Y ij are then predicted by Ŷ ij = µ + α i + β j = Y i. + Y.j Y.. (12) 4 The rectools Package This is an R package for recommender systems, developed by the present author and Pooja Rajkumar [4], downloadable from https://github.com/pooja-rajkumar/rectools. It offers all three categories of collaborative 4 See [3] for a tutorial showing the use of NMF in other applications. Also, [1] develops conditions under which NMF gives the correct factorization, though in practice this is just assumed, and the result is for exact, not approximate, factorization. 4

filtering methodology discussed here, with some novel enhancements and some entirely new methods. The research proposed in Section 3 will be incorporated into this software (some of it is already in the package in experimental form). One of the key features of the package is graphics. These are aimed at model checking and exploring possible subpopulations. There are also miscellaneous features, such as a focus group finder, which identifies typical users. References [1] D. Donoho and V. Stodden, When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?, NIPS 2003. [2] K. Gao and A. Owen, Efficient Moment Calculations for Variance Components in Large Unbalanced Crossed Random Effects Models, technical report, 2016. [3] N. Matloff, Quick Introduction to Nonnegative Matrix Factorization, http://heather.cs.ucdavis.edu/nmftutorial.pdf, 2016. [4] P. Rajkumar and N. Matloff, Rectools: an Advanced Recommender System, user! 2016. [5] P. Perry, Fast Moment-Based Estimation for Hierarchical Models, JRSS-B, 2016. [6] C. Thompson, If You Liked This, Youre Sure to Love That, New York Times Magazine, November 21, 2008. 5