Collaborative Filtering: A Machine Learning Perspective

Similar documents
Modeling User Rating Profiles For Collaborative Filtering

Weighted Low Rank Approximations

CS281 Section 4: Factor Analysis and PCA

Using SVD to Recommend Movies

Probabilistic Latent Semantic Analysis

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Machine Learning - MT & 14. PCA and MDS

Principal Component Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Expectation Maximization

Computational functional genomics

a Short Introduction

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

PCA and admixture models

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Learning to Learn and Collaborative Filtering

ECE 5984: Introduction to Machine Learning

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Jeffrey D. Ullman Stanford University

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Nonnegative Matrix Factorization

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Unsupervised Learning

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Principal Component Analysis

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

PCA, Kernel PCA, ICA

Introduction to Machine Learning


Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Summary of Week 9 B = then A A =

Review problems for MA 54, Fall 2004.

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Learning representations

Mixtures of Gaussians. Sargur Srihari

UNIT 6: The singular value decomposition.

Data Mining Techniques

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Graphical Models for Collaborative Filtering

Advanced Introduction to Machine Learning

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Dimension Reduction and Iterative Consensus Clustering

7 Principal Component Analysis

Singular Value Decomposition

Data Mining Techniques

Latent Variable View of EM. Sargur Srihari

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Lecture 5 Singular value decomposition

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Dimension Reduction. David M. Blei. April 23, 2012

EECS 275 Matrix Computation

* Matrix Factorization and Recommendation Systems

Singular Value Decomposition

Andriy Mnih and Ruslan Salakhutdinov

Spectral k-support Norm Regularization

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Dimension Reduction (PCA, ICA, CCA, FLD,

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Numerical Methods I Singular Value Decomposition

14 Singular Value Decomposition

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

A Simple Algorithm for Nuclear Norm Regularized Problems

Restricted Boltzmann Machines for Collaborative Filtering

Latent Semantic Analysis. Hongning Wang

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

Recommendation Systems

1 Singular Value Decomposition and Principal Component

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

Machine Learning (Spring 2012) Principal Component Analysis

Principal Component Analysis and Linear Discriminant Analysis

Lecture: Face Recognition and Feature Reduction

Linear Methods for Regression. Lijun Zhang

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Language Information Processing, Advanced. Topic Models

IV. Matrix Approximation using Least-Squares

15 Singular Value Decomposition

Clustering VS Classification

Relations Between Adjacency And Modularity Graph Partitioning: Principal Component Analysis vs. Modularity Component Analysis

Lecture Notes 10: Matrix Factorization

K-Means and Gaussian Mixture Models

Lecture: Face Recognition and Feature Reduction

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

CS540 Machine learning Lecture 5

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

DS-GA 1002 Lecture notes 10 November 23, Linear models

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Maximum Margin Matrix Factorization

Basic Calculus Review

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

STA141C: Big Data & High Performance Statistical Computing

Transcription:

Collaborative Filtering: A Machine Learning Perspective Chapter 6: Dimensionality Reduction Benjamin Marlin Presenter: Chaitanya Desai Collaborative Filtering: A Machine Learning Perspective p.1/18

Topics we ll cover Dimensionality Reduction for rating prediction: Singular Value Decomposition Principal Component Analysis Collaborative Filtering: A Machine Learning Perspective p.2/18

Rating Prediction Problem Description: We have M distinct items and N distinct users in our corpus Let r y u be rating assigned by user u for item y. Thus, we have (M dimensional) rating vectors r u for the N users Task: Given the partially filled rating vector r a of an active user a, we want to estimate ˆr a y for all items y that have not yet been rated by a Collaborative Filtering: A Machine Learning Perspective p.3/18

Singular Value Decomposition Given a data matrix D of size N M, the SVD of D is D = UΣV T where U N N and V M M are orthogonal and Σ N M is diagonal columns of U are eigenvectors of DD T and columns of V are eigenvectors of D T D Σ is diagonal and entries comprise of eigenvalues ordered according to eigenvectors (i.e. columns of U and V ) Collaborative Filtering: A Machine Learning Perspective p.4/18

Low Rank Approximation Given the solution to SVD, we know that ˆD = U k Σ k V k T is the best rank k approximation to D under the Frobenius norm The Frobenius norm is given by F(D ˆD) N M = (D nm ˆD nm ) 2 n=1 m=1 Collaborative Filtering: A Machine Learning Perspective p.5/18

Weighted Low Rank Approximation D contains many missing values Collaborative Filtering: A Machine Learning Perspective p.6/18

Weighted Low Rank Approximation D contains many missing values SVD on a matrix is undefined if there are missing values Collaborative Filtering: A Machine Learning Perspective p.6/18

Weighted Low Rank Approximation D contains many missing values SVD on a matrix is undefined if there are missing values Solution? Collaborative Filtering: A Machine Learning Perspective p.6/18

Weighted Low Rank Approximation D contains many missing values SVD on a matrix is undefined if there are missing values Solution? Assign 0/1 weights to elements of D Collaborative Filtering: A Machine Learning Perspective p.6/18

Weighted Low Rank Approximation D contains many missing values SVD on a matrix is undefined if there are missing values Solution? Assign 0/1 weights to elements of D Our goal is to find ˆD so that the weighted Frobenius norm is minimized F w (D ˆD) N M = W nm (D nm ˆD nm ) 2 n=1 m=1 Collaborative Filtering: A Machine Learning Perspective p.6/18

Weighted low rank approximation Srebro and Jaakkola in their paper Weighted Low Rank Approximations provide 2 approaches to finding ˆD Numerical Optimization using Gradient Descent in U and V Expectation Maximization (EM) Collaborative Filtering: A Machine Learning Perspective p.7/18

Generalized EM Given a joint distribution p(x, Z θ) over observed variables X and latent variables Z, governed by parameters θ, the goal is to maximize the likelihood function p(x θ) with respect to θ 1. Choose an initial setting for the parameter θ old 2. E step Evaluate p(z X,θ old ) 3. M step Evaluate θ new given by θ new = arg max θ Q(θ,θ old ) where Q(θ,θ old ) = p(z X,θ old ) ln p(x,z θ) Z represents the expectation of the complete data log-likelihood for some general parameter θ 4. If convergence criterion is not satisfied, let and return to step 2 θ old θ new Collaborative Filtering: A Machine Learning Perspective p.8/18

EM for weighted SVD E step: Fill in the missing values of D from the low rank reconstruction ˆD forming a complete matrix X. (1) X = W D + (1 W) ˆD Collaborative Filtering: A Machine Learning Perspective p.9/18

EM for weighted SVD E step: Fill in the missing values of D from the low rank reconstruction ˆD forming a complete matrix X. (1) X = W D + (1 W) ˆD M step: Find low rank approximation using standard SVD on X, which is completely specified. [U, Σ,V T ] = SV D(X) ˆD = U k Σ k V T k Collaborative Filtering: A Machine Learning Perspective p.9/18

The complete algorithm Input: R,W,L,K Output: Σ,V ˆR 0 while(f W (R ˆR) not converged) do X W R + (1 W) ˆR [U, Σ,V T ] = SV D(X) U U L, Σ Σ L,V V L ˆR UΣV T if (L > K) then Reduce L end if end while Collaborative Filtering: A Machine Learning Perspective p.10/18

ting Prediction using the new concept spa Given ˆR, rating value for user u for item y is simply ˆr u y = ˆR uy Collaborative Filtering: A Machine Learning Perspective p.11/18

ting Prediction using the new concept spa Given ˆR, rating value for user u for item y is simply ˆr u y = ˆR uy Map the new user s profile into the K dimensional latent space Collaborative Filtering: A Machine Learning Perspective p.11/18

ting Prediction using the new concept spa Given ˆR, rating value for user u for item y is simply ˆr u y = ˆR uy Map the new user s profile into the K dimensional latent space If r is the user s rating vector in the original space and l is the user s vector in the latent space, then r = lσv T and thus l = rv Σ 1 Collaborative Filtering: A Machine Learning Perspective p.11/18

Algorithm (rating vector estimation) Input: r a,w a, Σ,V,K Output: ˆr a ˆr a 0 while(f w a(r a ˆr a ) not converged) do x w a r a + (1 w a ) ˆr a l a xv Σ 1 ˆr a l a ΣV T end while Collaborative Filtering: A Machine Learning Perspective p.12/18

Results of SVD Data Sets: EachMovie: Collected by the Compaq Systems Research Center over an 18 month period beginning in 1997. Base data set contains 72916 users, 1628 movies and 2811983 ratings. Ratings are on a scale from 1 to 6. MovieLens Collected by GroupLens research group at the University of Minnesota Contains 6040 users, 3900 movies, and 1000209 ratings collected from users who joined the MovieLens recommendation service in 2000 Ratings are on a scale from 1 to 5. Collaborative Filtering: A Machine Learning Perspective p.13/18

Results of SVD... Results reported are NMAE NMAE = MAE E[M AE] N MAE = N 1 ˆr y u ry u n=1 Greater than 1 indicates worse than random Collaborative Filtering: A Machine Learning Perspective p.14/18

Principal Component Analysis The idea is to discover latent structure in the data Let A = 1 N 1 DT D be the co-variance matrix of D A(i, j) indicates co-variance between items i and j We are interested in retaining K dimensions of highest variance This is the subspace spanned by the K largest eigenvectors of A Collaborative Filtering: A Machine Learning Perspective p.15/18

Rating Prediction with PCA Cannot do PCA when D has missing data Goldberg, Roeder, Gupta and Perkins propose an algorithm called Eigentaste in their paper Eigentaste: A Constant Time Collaborative Filtering Algorithm Collaborative Filtering: A Machine Learning Perspective p.16/18

Eigentaste Pick a set of items called the gauge set that all users must rate Collaborative Filtering: A Machine Learning Perspective p.17/18

Eigentaste Pick a set of items called the gauge set that all users must rate Map the new data into a lower dimensional latent space and retain only the 1st 2 components Collaborative Filtering: A Machine Learning Perspective p.17/18

Eigentaste Pick a set of items called the gauge set that all users must rate Map the new data into a lower dimensional latent space and retain only the 1st 2 components Cluster users in this 2 dimensional latent space using divisive hierarchical clustering Collaborative Filtering: A Machine Learning Perspective p.17/18

Eigentaste Pick a set of items called the gauge set that all users must rate Map the new data into a lower dimensional latent space and retain only the 1st 2 components Cluster users in this 2 dimensional latent space using divisive hierarchical clustering Compute mean rating vector µ c for each cluster c Collaborative Filtering: A Machine Learning Perspective p.17/18

Eigentaste Pick a set of items called the gauge set that all users must rate Map the new data into a lower dimensional latent space and retain only the 1st 2 components Cluster users in this 2 dimensional latent space using divisive hierarchical clustering Compute mean rating vector µ c for each cluster c For a new user, map the user s rating profile (r a ) for the gauge set into the 2 dimensional concept space (ˆr a ) and determine what cluster the user belongs to. Collaborative Filtering: A Machine Learning Perspective p.17/18

Eigentaste Pick a set of items called the gauge set that all users must rate Map the new data into a lower dimensional latent space and retain only the 1st 2 components Cluster users in this 2 dimensional latent space using divisive hierarchical clustering Compute mean rating vector µ c for each cluster c For a new user, map the user s rating profile (r a ) for the gauge set into the 2 dimensional concept space (ˆr a ) and determine what cluster the user belongs to. If an item (ˆr a y) is not rated, assign it the mean rating vector s value for that item (i.e. ˆr a y = µ cy ) Collaborative Filtering: A Machine Learning Perspective p.17/18

Problems with Eigentaste? Gauge Set must be the same for all users and all users must rate all gauge items Collaborative Filtering: A Machine Learning Perspective p.18/18

Problems with Eigentaste? Gauge Set must be the same for all users and all users must rate all gauge items Rating some items is easier and faster than others (e.g. jokes vs. books) Collaborative Filtering: A Machine Learning Perspective p.18/18

Problems with Eigentaste? Gauge Set must be the same for all users and all users must rate all gauge items Rating some items is easier and faster than others (e.g. jokes vs. books) Selection of items (items that are good discriminators) Collaborative Filtering: A Machine Learning Perspective p.18/18