Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Similar documents
Models, Data, Learning Problems

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Collaborative Filtering. Radek Pelánek

Matrix Factorization Techniques for Recommender Systems

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Recommendation Systems

Bayesian Learning (II)

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Linear Classifiers IV

Matrix Factorization Techniques for Recommender Systems

Point-of-Interest Recommendations: Learning Potential Check-ins from Friends

Andriy Mnih and Ruslan Salakhutdinov

Recommendation Systems

CS425: Algorithms for Web Scale Data

Recommendation Systems

* Matrix Factorization and Recommendation Systems

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Collaborative Filtering

Algorithms for Collaborative Filtering

Data Mining Techniques

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Matrix Factorization and Collaborative Filtering

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

CS249: ADVANCED DATA MINING

Generative Models for Discrete Data

Matrix Factorization and Factorization Machines for Recommender Systems

Graphical Models for Collaborative Filtering

Machine Teaching. for Personalized Education, Security, Interactive Machine Learning. Jerry Zhu

Collaborative topic models: motivations cont

Restricted Boltzmann Machines for Collaborative Filtering

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Machine Learning Techniques

Large-scale Collaborative Ranking in Near-Linear Time

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

6.034 Introduction to Artificial Intelligence

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Overfitting, Bias / Variance Analysis

Collaborative Filtering Matrix Completion Alternating Least Squares

Presentation in Convex Optimization

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear Regression (continued)

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

Nonnegative Matrix Factorization

Maximum Margin Matrix Factorization for Collaborative Ranking

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Neural networks and optimization

APPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS

Large-scale Ordinal Collaborative Filtering

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

From Non-Negative Matrix Factorization to Deep Learning

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

CS 175: Project in Artificial Intelligence. Slides 4: Collaborative Filtering

Decoupled Collaborative Ranking

Big Data Analytics. Lucas Rego Drumond

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Midterm exam CS 189/289, Fall 2015

Optimization Methods for Machine Learning

6.036 midterm review. Wednesday, March 18, 15

arxiv: v2 [cs.ir] 14 May 2018

ECS289: Scalable Machine Learning

Modeling User Rating Profiles For Collaborative Filtering

Data Science Mastery Program

Introduction to Logistic Regression

Kernel methods, kernel SVM and ridge regression

l 1 and l 2 Regularization

CS425: Algorithms for Web Scale Data

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Lecture Notes 10: Matrix Factorization

Mixed Membership Matrix Factorization

Lecture 2 Machine Learning Review

Probabilistic Graphical Models & Applications

Bits of Machine Learning Part 1: Supervised Learning

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Large-scale Information Processing, Summer Recommender Systems (part 2)

Collaborative Filtering on Ordinal User Feedback

Machine Learning And Applications: Supervised Learning-SVM

SQL-Rank: A Listwise Approach to Collaborative Ranking

Using SVD to Recommend Movies

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Collaborative Recommendation with Multiclass Preference Context

Data Mining Techniques

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Mixed Membership Matrix Factorization

Statistical Data Mining and Machine Learning Hilary Term 2016

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Kernel Methods and Support Vector Machines

CSCI567 Machine Learning (Fall 2014)

Collaborative Filtering: A Machine Learning Perspective

1-Bit Matrix Completion

Max Margin-Classifier

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Recommendation Tobias Scheffer

Recommendation Engines Recommendation of products, music, contacts,.. Based on user features, item features, and past transactions: sales, reviews, clicks, User-specific recommendations, no global ranking of items. Feedback loop: choice of recommendations influences available transaction and click data. 2

Netflix Prize Data analysis challenge, 2006-2009 Netflix made rating data available: 500,000 users, 18,000 movies, 100 million ratings Challenge: predict ratings that were held back for evaluation; improve by 10% over Netflix s recommendation Award: $ 1 million. 3

Problem Setting Users U = {1,, m} Items X = {1,, m } Ratings Y = {(u 1, x 1, y 1 ), (u n, x n, y n )} Rating space y i Y E.g., Y = 1, +1, Y = {,., } Loss function l(y i, y j ) E.g., missing a good movie is bad but watching a terrible movie is worse. Find rating model: f θ : u, x y. 4

Problem Setting: Matrix Notation Users U = {1,, m} Items X = {1,, m } y 11 y 12 Ratings Y = y 21 y 23 y 33 Rating space y i Υ E.g.,Υ = item 1 2 3 Loss function l(y i, y j ) user 1 user 2 user 3 1, +1,Υ = {,., } Incomplete matrix 5

Problem Setting Model f θ (u, x) Find model parameters that minimize risk θ = argmin θ l y, f θ u, x p u, x, y dxdudr As usual: p u, x, y is unknown minimize regularized empirical risk θ = argmin θ l y i, f θ u i, x i n i=1 + λω(θ) 6

Content-Based Recommendation Idea: User may like movies that are similar to other movies which they like. Requirement: attributes of items, e.g., Tags, Genre, Actors, Director, 7

Content-Based Recommendation Feature space for items E.g., Φ = comedy, action, year, dir tarantino, dir cameron T φ avatar = 0, 1, 2009, 0, 1 T 8

Content-Based Recommendation Users U = {1,, m} Items X = {1,, m } Ratings Y = {(u 1, x 1, y 1 ), (u n, x n, y n )} Rating space y i Y E.g.,Υ = 1, +1,Υ = {,., } Loss function l(y i, y j ) E.g., missing a good movie is bad but watching a terrible movie is worse. Feature function for items: φ: x R d Find rating model: f θ : u, x y. 9

Independent Learning Problems for Users Minimize regularized empirical risk θ = argmin θ l y i, f θ u i, x i n i=1 One model per user: f θu x Υ One learning problem per user: θ u = argmin θu l y i, f θu x i i:u i =u + λω(θ) + λω(θ u ) 10

Independent Learning Problems for Users One learning problem per user: u: θ u = argmin θu l y i, f θu x i i:u i =u + λω(θ u ) Use any model class and learning mechanism; e.g., f θu x i = φ x i T θ u Logistic loss + l 2 regularization: logistic regression Hinge loss + l 2 regularization: SVM Squared loss + l 2 regularization: ridge regression 11

Independent Learning Problems for Users Obvious disadvantages of independent problems: Commonalities of users are not exploited, User does not benefit from ratings given by other users, Poor recommendations for users who gave few ratings. Rather use joint prediction model: Recommendations for each user should benefit from other osers ratings. 12

Independent Learning Problems θ 1 Parameter vectors of independent prediction models for users Regularizer θ 2 13

Joint Learning Problem θ 1 Regularizer Parameter vectors of independent prediction models for users θ 2 14

Joint Learning Problem Standard l 2 regularization follows from the assumption that model parameters are governed by normal distribution with mean vector zero. Instead assume that there is a non-zero population mean vector. 15

Joint Learning Problem 0 θ Σ θ 1 θ 2 θ m Graphical model of hierarchical prior 16

Joint Learning Problem Population mean vector θ ~N 0, 1 I λ User-specific mean vector: θ u ~N θ, 1 λ I Substitution: θ u = θ + θ u ; now θ and θ u have mean vector zero. -Log-prior = regularizer Ω θ + θ u = λ θ 2 + λ θu 2 17

Joint Learning Problem Joint optimization problem: min θ 1,,θ m,θ u i:u i =u l y i, f θu +θ x i + λω θ u + λ Ω(θ ) Parameters θ u are independent, θ is shared. Hence, θ u are coupled. θ u = θ + θ u Coupling strength Global regularization 18

Discussion Each user benefits from other users ratings. Does not take into account that users have different tastes. Two sci-fi fans may have similar preferences, but a horror-movie fan and a romantic-comedy fan do not. Idea: look at ratings to determine how similar users are. 19

Collaborative Filtering Idea: People like items that are liked by people who have similar preferences. People who give similar ratings to items probably have similar preferences. This is independent of item features. 20

Collaborative Filtering Users U = {1,, m} Items X = {1,, m } Ratings Y = {(u 1, x 1, y 1 ), (u n, x n, y n )} Rating space y i Υ E.g.,Υ = 1, +1,Υ = {,., } Loss function l(y i, y j ) Find rating model: f θ : u, x y. 21

Collaborative Filtering by Nearest Neighbor Define distance function on users: d(u, u ) Predicted rating: f θ u, x = k nearest neighbors u i of u 1 k y u i,x Predicted rating is the average rating of the k nearest neighbors in terms of d u, u. No learning involved. Performance hinges on d(u, u ). 22

Collaborative Filtering by Nearest Neighbor Define distance function on users: d u, u = 1 m m x=1 y u,x, y u,x 2 Euclidean distance between ratings for all items. Skip items that have not been rated by both users. 23

Extensions Normalize ratings (subtract mean rating of user, divide by user s standard deviation) Weight influence of neighbors by inverse of distance. Weight influence of neighbors with number of jointly rated items. 1 d(u, u i ) y u i,x f θ u, x = k nearest neighbors u i of u k nearest neighbors u i of u 1 d(u, u i ) 24

Matrix Zombiland Titanic Death Proof Collaborative Filtering: Example Y = 4 5 4 5 5 1 5 3 4 Alice Bob Carol How much would Alice enjoy Zombiland? 25

M Z T D Collaborative Filtering: Example Y = 4 5 4 5 5 1 5 3 4 d u, u = 1 m m x=1 Alice Bob Carol y u,x, y u,x 2 d A, B = d A, C = d B, C = 26

M Z T D Collaborative Filtering: Example Y = 4 5 4 5 5 1 5 3 4 d u, u = 1 m m x=1 Alice Bob Carol y u,x, y u,x 2 d A, B = 2.9 d A, C = 1 d B, C = 1.4 27

M Z T D Collaborative Filtering: Example Y = f θ A, Z = 4 5 4 5 5 1 5 3 4 Alice Bob Carol 1 2 nearest d(a,u i ) y u i,z neighbors u i of A 1 2 nearest d(a,u i ) neighbors u i of A = 28

M Z T D Collaborative Filtering: Example Y = f θ A, Z = 4 5 4 5 5 1 5 3 4 Alice Bob Carol 1 2 nearest d(a,u i ) y u i,z neighbors u i of A 1 2 nearest d(a,u i ) neighbors u i of A = 1 2.9 5+1 1 3 1 2.9 +1 1 29

Collaborative Filtering: Discussion K nearest neigbor and similar methods are called memory-based approaches. There are no model parameters, no optimization criterion is being optimized. Each prediction reuqires an iteration over all training instances impractical! Better to train a model by minimizing an appropriate loss function over a space of model parameter, then use model to make predictions quickly. 30

Latent Features Idea: Instead of ad-hoc definition of distance between users, learn features that actually represent preferences. If, for every user u, we had a feature vector ψ u that describes their preferences, Then we could learn parameters θ x for item x such that θ x T ψ u quantifies how much u enjoys x. 31

Latent Features Or, turned around, If, for every item x we had a feature vector φ x that characterizes its properties, We could learn parameters θ u such that θ u T φ x quantifies how much u enjoys x. In practice some user attributes ψ u and item attributes φ x are usually available, but they are insufficient to understand u s preferences and x s relevant properties. 32

Latent Features Idea: construct user attributes ψ u and item attributes φ x such that ratings in training data can be predicted accurately. Decision function: f Ψ,Φ u, x = ψ u T φ x Prediction is product of user preferences and item properties. Model parameters: Matrix Ψ of user features ψ u for all users, Matrix Φ of item features φ x for all items. 33

Latent Features Optimization criterion: Ψ, Φ = argminψ,φ x,u + λ ψ u 2 u l(yu,x, f Ψ,Φ u, x ) + φ x 2 x Feature vectors of all users and all Items are regularized 34

Latent Features Both item and user features are the solution of an optimization problem. Number of features k has to be set. Meaning of the features is not pre-determined. Sometimes they turn out to be interpretable. 35

Matrix Factorization Decision function: f Ψ,Φ u, x In matrix notation: Matrix elements: y 11 y 1m y m1 y mm = ψ u T φ x Y Ψ,Φ = ΨΦ T = ψ 11 ψ 1k ψ m1 ψ mk φ 11 φ m 1 φ 1k φ m k 36

M Z T D Matrix Factorization Decision function in matrix notation: Predicted rating of Matrix for Alice y 11 y 1m y m1 y mm = ψ 11 ψ 1k ψ m1 Alice Bob Carol ψ mk φ 11 φ m 1 φ 1k φ m k Latent features of Alice Latent features of Matrix 37

M Z T D Matrix Factorization Decision function in matrix notation: Latent features of Alice y 11 y 1m y m1 y mm = ψ 11 ψ 1k ψ m1 Alice Bob Carol ψ mk φ 11 φ m 1 φ 1k φ m k Latent features of Carol Latent features of Matrix Latent features of Death Proof 38

Matrix Factorization Optimization criterion: Ψ, Φ = argminψ,φ x,u + λ Ψ 2 + Φ 2 Criterion is not convex: l(yu,x, f Ψ,Φ u, x ) For instance, multiplying all feature vectors with -1 gives an equally good solution: f Ψ,Φ u, x = ψ u T φ x = ( ψ u T )( φ x ) Limiting the number of latent features to k restricts the rank of matrix Y. 39

Matrix Factorization Optimization criterion: Ψ, Φ = argminψ,φ Optimization by x,u + λ Ψ 2 + Φ 2 l(yx,u, Stochastic gradient descent or Alternating least squares. f Ψ,Φ u, x ) 40

Matrix Factorization by Stochastic Gradient Descent Iterate through ratings y u,x in training sample Let ψ u ψ u α f Ψ,Φ(u,x) ψ u Let φ x φ x α f Ψ,Φ(u,x) φ x Until convergence. Requires differentiable loss function; e.g., squared loss, 41

Matrix Factorization by Alternating Least Squares For squared loss and parallel architectures. Alternate between 2 optimization processes: Keep Φ fixed, optimize ψ u in parallel for all u. Keep Ψ fixed, optimize φ x in parallel for all x. 42

Matrix Factorization by Alternating Least Squares For squared loss and parallel architectures. Alternate between 2 optimization processes: Keep Φ fixed, optimize ψ u in parallel for all u. Keep Ψ fixed, optimize φ x in parallel for all x. Optimization criterion for Ψ: ψ u = argmin ψu y u y u 2 λ ψ u 2 = argmin ψu y u ψ u T Φ T 2 λ ψ u 2 y u1 y um = ψ u1 ψ uk φ 11 φ m 1 φ 1k φ m k 43

Matrix Factorization by Alternating Least Squares For squared loss and parallel architectures. Alternate between 2 optimization processes: Keep Φ fixed, optimize ψ u in parallel for all u. Keep Ψ fixed, optimize φ x in parallel for all x. Optimization criterion for Ψ: ψ u = argmin ψu y u ψ u T Φ T 2 λ ψ u 2 ψ u = ΦΦ T + λi 1 Φy u 44

Matrix Factorization by Alternating Least Squares For squared loss and parallel architectures. Alternate between 2 optimization processes: Keep Φ fixed, optimize ψ u in parallel for all u. Keep Ψ fixed, optimize φ x in parallel for all x. Optimization criterion for Ψ: ψ u = argmin ψu y u ψ u T Φ T 2 λ ψ u 2 ψ u = ΦΦ T + λi 1 Φy u Optimization criterion for Φ: φ x = argmin φx y x φ x T Ψ T 2 λ φ x 2 φ x = ΨΨ T + λi 1 Ψy x 45

Matrix Factorization by Alternating Least Squares For squared loss and parallel architectures. Initialize Ψ, Φ randomly. Repeat until convergence: Keep Ψ fixed, for all u in parallel: ψ u = ΦΦ T + λi 1 Φy u Keep Φ fixed, for all x in parallel: φ u = ΨΨ T + λi 1 Ψy x 46

Extensions: Biases Some users just give optimistic or pessimistic ratings; some items are hyped. Decision function: f Ψ,Φ,Bu,B x u, x = b u + b x + ψ u T φ x Optimization criterion: Ψ, Φ, B u, B x = argmin Ψ,Φ l(y x,u, f Ψ,Φ,Bu,B x u, x ) x,u + λ Ψ 2 + Φ 2 + B u 2 + Bx 2 47

Extensions: Explicit Features Often, explicit user and item features are available. Concatenate vectors ψ u and φ x ; explicit features are fixed, latent features are free paremeters. 48

Extensions: Temporal Dynamics How much a user likes an item depends on the point in time when the rating takes place. f Ψ,Φ,Bu,B x,t u, x = b u t + b x (t) + ψ u t T φ x 49

Summary Purely content-based recommendation: users don t benefit from other users ratings. Collaborative filtering by nearest neighbors: fixed definition of similarity of users. No model parameters, no learning. Has to iterate over data to make recommendation. Latent factor models, matrix factorization: user preferences and item properties are free parameters, optimized to minimized discrepancy between inferred and actual ratings. 50