Collaborative Filtering. Radek Pelánek

Similar documents
Matrix Factorization Techniques for Recommender Systems

* Matrix Factorization and Recommendation Systems

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Recommendation Systems

CS 175: Project in Artificial Intelligence. Slides 4: Collaborative Filtering

Collaborative Filtering

Recommendation Systems

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Recommendation Systems

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

Data Science Mastery Program

Matrix Factorization and Collaborative Filtering

Andriy Mnih and Ruslan Salakhutdinov

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

CS425: Algorithms for Web Scale Data

Using SVD to Recommend Movies

Collaborative Filtering Applied to Educational Data Mining

Generative Models for Discrete Data

Introduction to Logistic Regression

CS425: Algorithms for Web Scale Data

Scaling Neighbourhood Methods

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Impact of Data Characteristics on Recommender Systems Performance

Decoupled Collaborative Ranking

CS249: ADVANCED DATA MINING

Large-scale Information Processing, Summer Recommender Systems (part 2)

Collaborative Filtering on Ordinal User Feedback

Algorithms for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering

A Modified PMF Model Incorporating Implicit Item Associations

Data Mining Techniques

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Matrix Factorization Techniques for Recommender Systems

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Collaborative Recommendation with Multiclass Preference Context

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Domokos Miklós Kelen. Online Recommendation Systems. Eötvös Loránd University. Faculty of Natural Sciences. Advisor:

Introduction to Computational Advertising

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Collaborative Filtering Matrix Completion Alternating Least Squares

Data Mining Techniques

CS246 Final Exam, Winter 2011

Large-scale Ordinal Collaborative Filtering

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties

Content-based Recommendation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Department of Computer Science, Guiyang University, Guiyang , GuiZhou, China

a Short Introduction

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Collaborative topic models: motivations cont

The Pragmatic Theory solution to the Netflix Grand Prize

Predicting the Performance of Collaborative Filtering Algorithms

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Ranking and Filtering

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Linear and Logistic Regression. Dr. Xiaowei Huang

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Learning to Recommend Point-of-Interest with the Weighted Bayesian Personalized Ranking Method in LBSNs

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

Joint user knowledge and matrix factorization for recommender systems

Statistical Machine Learning

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Collaborative Filtering with Temporal Dynamics with Using Singular Value Decomposition

Leverage Sparse Information in Predictive Modeling

Gaussian Models

Machine learning for pervasive systems Classification in high-dimensional spaces

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

arxiv: v2 [cs.ir] 14 May 2018

A Simple Algorithm for Nuclear Norm Regularized Problems

Collabora've Filtering

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Matrix and Tensor Factorization from a Machine Learning Perspective


Structured matrix factorizations. Example: Eigenfaces

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Lecture Notes 10: Matrix Factorization

Prediction of Citations for Academic Papers

Mathematical Formulation of Our Example

Latent Semantic Analysis. Hongning Wang

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Click Prediction and Preference Ranking of RSS Feeds

Context-aware Ensemble of Multifaceted Factorization Models for Recommendation Prediction in Social Networks

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Quick Introduction to Nonnegative Matrix Factorization

Online Advertising is Big Business

Lecture 7: DecisionTrees

Transcription:

Collaborative Filtering Radek Pelánek 2017

Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine learning techniques, which are briefly described

Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains widely used in practice

Basic CF Approach input: matrix of user-item ratings (with missing values, often very sparse) output: predictions for missing values

Netflix Prize Netflix video rental company contest: 10% improvement of the quality of recommendations prize: 1 million dollars data: user ID, movie ID, time, rating

Main CF Techniques memory based nearest neighbors (user, item) model based latent factors matrix factorization

Neighborhood Methods: Illustration Matrix factorization techniques for recommender systems

Latent Factors: Illustration Matrix factorization techniques for recommender systems

Latent Factors: Netflix Data Matrix factorization techniques for recommender systems

Ratings explicit e.g., stars (1 to 5 Likert scale) to consider: granularity, multidimensionality issues: users may not be willing to rate data sparsity implicit proxy data for quality rating clicks, page views, time on page the following applies directly to explicit ratings, modifications may be needed for implicit (or their combination)

Non-personalized Predictions most popular items compute average rating for each item recommend items with highest averages problems?

Non-personalized Predictions averages, issues: number of ratings, uncertainty average 5 from 3 ratings average 4.9 from 100 ratings bias, normalization some users give systematically higher ratings (specific example later)

Exploitation vs Exploration pure exploitation always recommend top items what if some other item is actually better, rating is poorer just due to noise? exploration trying items to get more data how to balance exploration and exploitation?

Multi-armed Bandit standard model for exploitation vs exploration arm (unknown) probabilistic reward how to choose arms to maximize reward? well-studied, many algorithms (e.g., upper confidence bounds ) typical application: advertisements

Core Idea do not use just averages quantify uncertainty (e.g., standard deviation) combine average & uncertainty for decisions example: TrueSkill, ranking of players (leaderboard)

Note on Improving Performance simple predictors often provide reasonable performance further improvements often small but can have significant impact on behavior (not easy to evaluate) evaluation lecture Introduction to Recommender Systems, Xavier Amatriain

User-based Nearest Neighbor CF user Alice: item i not rated by Alice: find similar users to Alice who have rated i compute average to predict rating by Alice recommend items with highest predicted rating

User-based Nearest Neighbor CF Recommender Systems: An Introduction (slides)

User Similarity Pearson correlation coefficient (alternatives: e.g. spearman cor. coef., cosine similarity) Recommender Systems: An Introduction (slides)

Pearson Correlation Coefficient: Reminder r = n n i=1 (X i X )(Y i Ȳ ) i=1 (X i X ) 2 n i=1 (Y i Ȳ )2

Making Predictions: Naive r ai rating of user a, item i neighbors N = k most similar users prediction = average of neighbors ratings improvements? pred(a, i) = b N r bi N

Making Predictions: Naive r ai rating of user a, item i neighbors N = k most similar users prediction = average of neighbors ratings improvements? pred(a, i) = b N r bi N user bias: consider difference from average rating (r bi r b ) user similarities: weighted average, weight sim(a, b)

Making Predictions b N pred(a, i) = r a + sim(a, b) (r bi r b ) sim(a, b) r ai rating of user a, item i r a, r b user averages b N

Improvements number of co-rated items agreement on more exotic items more important case amplification more weight to very similar neighbors neighbor selection

Item-based Collaborative Filtering compute similarity between items use this similarity to predict ratings more computationally efficient, often: number of items << number of users practical advantage (over user-based filtering): feasible to check results using intuition

Item-based Nearest Neighbor CF Recommender Systems: An Introduction (slides)

Cosine Similarity rating by Bob rating by Alice cos(α) = A B A B

Similarity, Predictions (adjusted) cosine similarity similar to Pearson s r, works slightly better pred(u, p) = i R sim(i, p)r ui sim(i, p) i R neighborhood size limited (20 to 50)

Notes on Similarity Measures Pearson s r? (adjusted) cosine similarity? other? no fundamental reason for choice of one metric mostly based on practical experiences may depend on application

Preprocessing O(N 2 ) calculations still large original article: Item-item recommendations by Amazon (2003) calculate similarities in advance (periodical update) supposed to be stable, item relations not expected to change quickly reductions (min. number of co-ratings etc)

Matrix Factorization CF main idea: latent factors of users/items use these to predict ratings related to singular value decomposition

Notes singular value decomposition (SVD) theorem in linear algebra in CF context the name SVD usually used for an approach only slightly related to SVD theorem related to latent semantic analysis introduced during the Netflix prize, in a blog post (Simon Funk) http://sifter.org/~simon/journal/20061211.html

Singular Value Decomposition (Linear Algebra) X = USV T U, V orthogonal matrices S diagonal matrix, diagonal entries singular values low-rank matrix approximation (use only top k singular values) http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/svd.html

SVD CF Interpretation X = USV T X matrix of ratings U user-factors strengths V item-factors strengths S importance of factors

Latent Factors Matrix factorization techniques for recommender systems

Latent Factors Matrix factorization techniques for recommender systems

Sidenote: Embeddings, Word2vec

Missing Values matrix factorization techniques (SVD) work with full matrix ratings sparse matrix solutions: value imputation expensive, imprecise alternative algorithms (greedy, heuristic): gradient descent, alternating least squares

Notation u user, i item r ui rating ˆr ui predicted rating b, b u, b i bias q i, p u latent factor vectors (length k)

Simple Baseline Predictors [ note: always use baseline methods in your experiments ] naive: ˆr ui = µ, µ is global mean biases: ˆr ui = µ + b u + b i b u, b i biases, average deviations some users/items systematically larger/lower ratings

Latent Factors (for a while assume centered data without bias) ˆr ui = qi T p u vector multiplication user-item interaction via latent factors illustration (3 factors): user (p u ): (0.5, 0.8, 0.3) item (q i ): (0.4, 0.1, 0.8)

Latent Factors vector multiplication ˆr ui = q T i p u user-item interaction via latent factors we need to find q i, p u from the data (cf content-based techniques) note: finding q i, p u at the same time

Learning Factor Vectors we want to minimize squared errors (related to RMSE, more details leater) min q,p (u,i) T (r ui q T i p u ) 2 regularization to avoid overfitting (standard machine learning approach) min q,p (u,i) T How to find the minimum? (r ui q T i p u ) 2 + λ( q i 2 + p u 2 )

Stochastic Gradient Descent standard technique in machine learning greedy, may find local minimum

Gradient Descent for CF prediction error e ui = r ui q T i p u update (in parallel): q i := q i + γ(e ui p u λq i ) p i := p u + γ(e ui q i λp u ) math behind equations gradient = partial derivatives γ, λ constants, set pragmatically learning rate γ (0.005 for Netflix) regularization λ (0.02 for Netflix)

Advice if you want to learn/understand gradient descent (and also many other machine learning notions) experiment with linear regression can be (simply) approached in many ways: analytic solution, gradient descent, brute force search easy to visualize good for intuitive understanding relatively easy to derive the equations (one of examples in IV122 Math & programming)

Advice II recommended sources: Koren, Yehuda, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer 42.8 (2009): 30-37. Koren, Yehuda, and Robert Bell. Advances in collaborative filtering. Recommender Systems Handbook. Springer US, 2011. 145-186.

Adding Bias predictions: ˆr ui = µ + b u + b i + q T i p u function to minimize: (r ui µ b u b i qi T p u ) 2 +λ( q i 2 + p u 2 +bu+b 2 i 2 )] min q,p (u,i) T

Improvements additional data sources (implicit ratings) varying confidence level temporal dynamics

Temporal Dynamics Netflix data Y. Koren, Collaborative Filtering with Temporal Dynamics

Temporal Dynamics Netflix data, jump early in 2004 Y. Koren, Collaborative Filtering with Temporal Dynamics

Temporal Dynamics baseline = behaviour influenced by exterior considerations interaction = behaviour explained by match between users and items Y. Koren, Collaborative Filtering with Temporal Dynamics

Results for Netflix Data Matrix factorization techniques for recommender systems

Slope One Slope One Predictors for Online Rating-Based Collaborative Filtering average over such simple prediction

Slope One accurate within reason easy to implement updateable on the fly efficient at query time expect little from first visitors

Other CF Techniques clustering association rules classifiers

Clustering main idea: cluster similar users non-personalized predictions ( popularity ) for each cluster

Clustering Introduction to Recommender Systems, Xavier Amatriain

Clustering unsupervised machine learning many algorithms k-means, EM algorithm,...

Clustering: K-means

Association Rules relationships among items, e.g., common purchases famous example (google it for more details): beer and diapers Customers Who Bought This Item Also Bought... advantage: provides explanation, useful for building trust

Classifiers general machine learning techniques positive / negative classification train, test set logistic regression, support vector machines, decision trees, Bayesian techniques,...

Limitations of CF cold start problem popularity bias difficult to recommend items from the long tail impact of noise (e.g., one account used by different people) possibility of attacks

Cold Start Problem How to recommend new items? What to recommend to new users?

Cold Start Problem use another method (non-personalized, content-based...) in the initial phase ask/force user to rate items use defaults (means) better algorithms e.g., recursive CF

Collaborative Filtering: Summary requires only ratings, widely applicable neighborhood methods, latent factors use of machine learning techniques