Data Science Mastery Program

Similar documents
Collaborative Filtering. Radek Pelánek

Recommendation Systems

Recommendation Systems

Recommendation Systems

Matrix Factorization and Collaborative Filtering

Scaling Neighbourhood Methods

CS425: Algorithms for Web Scale Data

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Generative Models for Discrete Data

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Matrix Factorization Techniques for Recommender Systems

Collaborative Filtering Matrix Completion Alternating Least Squares

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

a Short Introduction

Collaborative topic models: motivations cont

Collaborative Filtering

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Algorithms for Collaborative Filtering

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

* Matrix Factorization and Recommendation Systems

Collaborative Topic Modeling for Recommending Scientific Articles

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

CS246 Final Exam, Winter 2011

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Department of Computer Science, Guiyang University, Guiyang , GuiZhou, China

Andriy Mnih and Ruslan Salakhutdinov

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

Quick Introduction to Nonnegative Matrix Factorization

From Non-Negative Matrix Factorization to Deep Learning

Matrix Factorization Techniques for Recommender Systems

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Using SVD to Recommend Movies

Collaborative Filtering

CS425: Algorithms for Web Scale Data

Circle-based Recommendation in Online Social Networks

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties

Prediction of Citations for Academic Papers

Ad Placement Strategies

Matrix Factorization and Factorization Machines for Recommender Systems

Decoupled Collaborative Ranking

Data Mining Techniques

Multiclass Classification-1

Click Prediction and Preference Ranking of RSS Feeds

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Jeffrey D. Ullman Stanford University

Clustering based tensor decomposition

Structured matrix factorizations. Example: Eigenfaces

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

6.034 Introduction to Artificial Intelligence

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Data Mining Recitation Notes Week 3

Cost and Preference in Recommender Systems Junhua Chen LESS IS MORE

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Collaborative Recommendation with Multiclass Preference Context

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Lecture 9: September 28

Joint user knowledge and matrix factorization for recommender systems

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Collaborative Filtering Applied to Educational Data Mining

Recommender Systems: Overview and. Package rectools. Norm Matloff. Dept. of Computer Science. University of California at Davis.

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

CS47300: Web Information Search and Management

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Information Retrieval

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 5: Web Searching using the SVD

Impact of Data Characteristics on Recommender Systems Performance

CS 175: Project in Artificial Intelligence. Slides 4: Collaborative Filtering

Dimensionality Reduction

Introduction to Data Mining

Point-of-Interest Recommendations: Learning Potential Check-ins from Friends

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

13 Searching the Web with the SVD

Nonnegative Matrix Factorization

CS249: ADVANCED DATA MINING

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Introduction to Information Retrieval

Lecture Notes 10: Matrix Factorization

Text Analytics (Text Mining)

Lecture: Face Recognition and Feature Reduction

Low Rank Matrix Completion Formulation and Algorithm

Collaborative Filtering on Ordinal User Feedback

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

Data Mining Techniques

CS 6375 Machine Learning

Ensemble Methods for Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 21: Spectral Learning for Graphical Models

Domokos Miklós Kelen. Online Recommendation Systems. Eötvös Loránd University. Faculty of Natural Sciences. Advisor:

Transcription:

Data Science Mastery Program

Copyright Policy All content included on the Site or third-party platforms as part of the class, such as text, graphics, logos, button icons, images, audio clips, video clips, live streams, digital downloads, data compilations, and software, is the property of BitTiger or its content suppliers and protected by copyright laws. Any attempt to redistribute or resell BitTiger content will result in the appropriate legal action being taken. We thank you in advance for respecting our copyrighted content. For more info see https://www.bittiger.io/termsofuse and https://www.bittiger.io/termsofservice

BitTiger https://www.bittiger.io/termsofuse https://www.bittiger.io/termsofservice

Outline Matrix Factorization in Clustering and Dimensionality Reduction Non-negative Matrix Factorization Singular Value Decomposition Recommender Examples Approaches in Recommenders Content-Based Collaborative Filtering Item-Item User-User Matrix Factorization UV Decomposition

Matrix Factorization

Non-negative Matrix Factorization

Non-negative Matrix Factorization(NMF) Matrix*Vm#x#n#*where*each*entry*vij* *0** m*x*r" r*x*n* m*x*n* also*wij* *0* *********hij* *0* Cannot*be*solved analy(cally,*so* approximated* numerically* r*set*by*user; $**r*<*min(m,n)**

No(ce*the*columns*of*V*are*sum*of*columns*of*W*weighted by*corresponding*column*in*hi NMF*is*a*rela(vely*new*way*of*reducing*dimensionality*of data*into*linear*combina(on*of*bases $**Columns*of*W*as*basis,*weighted*by*hi Non$nega(vity*constraint $**Unlike*the*decomposi(ons*we ve*looked*at*thus*far*

Document*Clustering*with*NMF* * 500*documents* 10,000*words* * V*****=*****W*****H* * * *

W:* Think*of*column*of*W*as*document*archetype* where*the*higher*the*word s*cell*value,*the* higher*the*word s*rank*for*that*latent*feature.* H:* Think*of*column*of*H*as*the*original* document,*where*cell*value*is*document s* rank*for*a*par(cular*latent*feature.* Recall* V:* Think*of*recons(tu(ng*a*par(cular*document*as* linear*combina(on*of* document*archetypes * weighed*by*how*important*they*are.*** NMF (least-squares objective) = a relaxed form of K-means Clustering: W contains cluster centroids H contains cluster membership indicators

Mechanics* *Alterna(ng*LS* *Minimize* *with*respect*to*w*and*h# *subject*to*w,*h* *0* Steps* (1) Randomly*ini(alize*W*and*H*to*the*appropriate*shapes* (2) Repeat*following* $ Holding*W*fixed,*update*H*by*minimizing*sum*of*squared*errors.**Ensure*all*H>0.* $ Holding*H*fixed,*update*W*by*minimizing*sum*of*squared*errors.**Ensure*all*W>0.* (3) Stop*when*some*threshold*is*met* Decrease*in*RMSE,*#*of*itera(ons,*etc.

NMF Algorithm

NMF Algorithm

Computer*Visioning Popular*Applica(ons* Iden(fy*/*classifying*objects Generally*reducing*feature*space *of*images* Document*Clustering Recommender*systems

hnp://www.cs.cmu.edu/~02317/slides/lec_7.pdf#page=17*

Singular Value Decomposition

Singular Value Decomposition A [n x m] = U [n x r] L [ r x r] (V [m x r] ) T A: n x m matrix (e.g., n documents, m terms) U: n x r matrix (n documents, r concepts) L: r x r diagonal matrix (strength of each concept ) (r: rank of the matrix) V: m x r matrix (m terms, r concepts)

Singular Value Decomposition m: # of users n: # of items k: # of latent features (also rank of A)

SVD - Properties THEOREM: always possible to decompose matrix A into A = U L V T, where U, L, V: unique (*) U, V: column orthonormal (i.e., columns are unit vectors, orthogonal to each other) U T U = I; V T V = I (I: identity matrix) L: singular value are positive, and sorted in decreasing order

SVD - Properties spectral decomposition of the matrix: 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = x x u 1 u 2 l 1 l 2 v 1 v 2

SVD - Interpretation documents, terms and concepts : U: document-to-concept similarity matrix V: term-to-concept similarity matrix L: its diagonal elements: strength of each concept Projection: best axis to project on: ( best = min sum of squares of projection errors)

SVD - Example A = U L V T - example: Documents data information retrieval brain lung CS-TR1 1 1 1 0 0 CS-TR2 2 2 2 0 0 CS-TR3 1 1 1 0 0 CS-TR4 5 5 5 0 0 MED-TR1 0 0 0 2 2 MED-TR2 0 0 0 3 3 MED-TR3 0 0 0 1 1

SVD - Example CS MD A = U L V T - example: data info. retrieval 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 brain lung = 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Example CS MD A = U L V T - example: doc-to-concept similarity matrix CS-concept MD-concept data info. retrieval 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 brain lung = 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Example CS MD A = U L V T - example: data info. retrieval 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = brain lung strength of CS-concept 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Example CS MD A = U L V T - example: data info. retrieval 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = brain lung 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 CS-concept x term-to-concept similarity matrix 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD Dimensionality reduction Q: how exactly is dimensionality reduction done? A: set the smallest singular values to zero: 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Dimensionality reduction Reduced matrices 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 ~ 0.18 0.36 0.18 0.90 0 0 0 x 9.64 x 0.58 0.58 0.58 0 0

SVD - Dimensionality reduction Reduced matrices 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 ~ 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Intro to Recommenders

Recommenders Where are recommenders used? What does our dataset look like? What are the high-level approaches to building a recommender? Content-based Collaborative filtering Matrix factorization How do we evaluate our recommender system? What are the challenges in our recommender system? What are the computational performance concerns?

Recommenders in Industry Netflix: 2/3 of the movies watched are recommended Google News: recommendations generate 38% more click-through Amazon: 35% sales from recommendations Stitch Fix: 100% of their revenue is based on recommendations

Business Goals What will the user like? What will the user buy? What will the user click?

Data Science Canon: Netflix s $1,000,000 Prize (Oct. 2006 - July 2009) Goal: Beat Netflix s own recommender by 10%. Took almost 3 years. The winning team used gradient boosted decision trees over the predictions of 500 other models. Netflix never deployed the winning algorithm.

What are the high-level approaches to building a recommender? Popularity: Make the same recommendation to every user, based only on the popularity of an item. E.g. Twitter Moments Content-based (aka, Content filtering): Predictions are made based on the properties/characteristics of an item. Other users behavior is not considered. E.g. Pandora Radio Collaborative filtering: Only consider past user behavior. (not content properties...) User-User similarity: Item-Item similarity:... E.g. Netflix & Amazon Recommendations, Google Ads, Facebook Ads, Search, Friends Rec., News feed, Trending news, Rank Notifications, Rank Comments Matrix Factorization Methods: Find latent features (aka, factors)

Content-Based Recommendation

Content-based recommendation Recommendations based on information on the content of items rather than on other users opinions/interactions Use a machine learning algorithm or a heuristic approach to induce a model of the users preferences from examples based on a featural description of content. In content-based recommendations, the system tries to recommend items similar to those a given user has liked in the past A pure content-based recommender system makes recommendations for a user based solely on the profile built up by analyzing the content of items which that user has rated in the past.

What is content? What is the content of an item? It can be explicit attributes or characteristics of the item. For example for a film: Genre: Action / adventure Feature: Bruce Willis Year: 1995 It can also be textual content (title, description, table of content, etc.) Several techniques to compute the distance between two textual documents Can use NLP techniques to extract content features Can be extracted from the signal itself (audio, image)

Content-based Recommendation Common for recommending text-based products (web pages, use net news messages, ) Items to recommend are described by their associated features (e.g. keywords) User Model structured in a similar way as the content: features/keywords more likely to occur in the preferred documents (lazy approach) Text documents recommended based on a comparison between their content (words appearing) and user model (a set of preferred words) The user model can also be a classifier based on whatever technique (Neural Networks, Nai ve Bayes...)

Advantages of content-based Recommendation No need for data on other users. No cold-start or sparsity problems. Able to recommend to users with unique tastes. Able to recommend new and unpopular items No first-rater problem. Can provide explanations of recommended items by listing content-features that caused an item to be recommended

Disadvantages of content-based Recommendation Requires content that can be encoded as meaningful features. Some kind of items are not amenable to easy feature extraction methods (e.g. movies, music) Even for texts, IR techniques cannot consider multimedia information, aesthetic qualities, download me... If you rate positively a page it could be not related to the presence of certain keywords Users tastes must be represented as a learnable function of these content features. Hard to exploit quality judgments of other users. Easy to overfit (e.g. for a user with few data points we may pigeon hole her)

Clustering in Recommender

Clustering Another way to make recommendations based on past purchases is to cluster customers Each cluster will be assigned typical preferences, based on preferences of customers who belong to the cluster Customers within each cluster will receive recommendations computed at the cluster level

Clustering

Clustering

Clustering Pros: Clustering techniques can be used to work on aggregated data Can also be applied as a first step for shrinking the selection of relevant neighbors in a collaborative filtering algorithm and improve performance Can be used to capture latent similarities between users or items Cons: Recommendations (per cluster) may be less relevant than collaborative filtering (per individual)

Collaborative Filtering Recommender

Collaborative Filtering User Based Item Based.... Similar Users Both users read same books Similar Items Both Items read by same users Read by her Recommended to him Read red Recommend green

Ingredients of Collaborative Filtering List of m Users and a list of n Items Each user has a list of items with associated opinion Explicit opinion - a rating score Sometime the rating is implicitly purchase records or listen to tracks Active user for whom the CF prediction task is performed Metricfor measuring similarity between users/items Method for selecting a subset of neighbors Method for predicting a rating for items not currently rated by the active users

General Steps of Collaborative Filtering 1. Identify set of ratings for the target/active user 2. Identify set of users most similar to the target/active user according to a similarity function (neighborhood formation) 3. Identify the products these similar users liked 4. Generate a prediction - rating that would be given by the target user to the product - for each one of these products 5. Based on this predicted rating recommend a set of top N products

What does our dataset look like? Typically, data is a utility (rating) matrix, which captures user preferences/well-being: User rating of items User purchase decisions for items Unrated are coded as 0 or missing Most items are unrated matrix is sparse Use recommender: Determine which attributes users think are important Predict ratings for unrated items Better than trusting expert opinion

What does our dataset look like? Data can be: Explicit: User provided ratings (1 to 5 stars) User like/non-like Implicit: Infer user-item relationships from behavior More common Example: buy/not-buy; view/not-view To convert implicit to explicit, create a matrix of 1s (yes) and 0s (no)

Example 1: Explicit utility matrix We have explicit ratings, plus a bunch of missing values. What company might have data like this? Btw, we call this the utility matrix.

Example 2: Implicit utility matrix We have implicit feedback, and no missing values. What company might have data like this? Btw, we call this the utility matrix.

Explicit Rating vs. Implicit Feedback the company completely relied on its users rating titles with stars when it began personalization some years ago. At one point, it had over 10 billion 5-star ratings, and more than 50% of all members had rated more than 50 titles. However, over time, Netflix realized that explicit star ratings were less relevant than other signals. Users would rate documentaries with 5 stars, and silly movies with just 3 stars, but still watch silly movies more often than those high-rated documentaries.

Two types of similarity-based Collaborative Filtering User-based: predict based on similarities between users Performs well, but slow if many users Use item-based CF if Users Items Item-based: predict based on similarities between items Faster if you precompute item-item similarity Usually Users Items item-based CF is most popular Items-based tend to be more stable: Items often only in one category (e.g., action films) Stable over time Users may like variety or change preferences over time Items usually have more ratings than users items have more stable average ratings than users

User-User similarities We look at all pairs of users and calculate their similarity. How can we calculate the similarity of these row vectors?

Item-Item similarities We look at all pairs of items and calculate their similarity. How can we calculate the similarity of these column vectors?

User-User or Item-Item? User-User: Item-Item: Let: m = #users, n = #items We want to compute the similarity of all pairs. What is the algorithmic efficiency of each approach? User-User: O(m 2 n) Item-Item: O(mn 2 ) Which one is better?

Similarity Metric using Euclidean Distance What s the range? But we re interested in a similarity, so let s do this instead: What s the range? When use this?

Similarity Metric using Pearson Correlation What s the range? But we re interested in a similarity, so let s do this instead: What s the range? When use this?

Similarity Metric using Cosine Similarity What s the range? But we re interested in a standardized similarity, so let s do this instead: What s the range? When use this?

Similarity Metric using Jaccard Index What s the range? When use this?

The Similarity Matrix Pick a similarity metric, create the similarity matrix:

Item-Item based CF: How to make predictions Say user u hasn t rated item i. We want to predict the rating that this user would give this item. We order by descending predicted rating for a single user, and recommend the top k items to the user.

How to make predictions (using neighborhoods) This calculation of predicted ratings can be very costly. To mitigate this issue, we will only consider the n most similar items to an item when calculating the prediction.

How to make predictions How would you modify the prediction formula below for a user-based recommender? Hint: should you compute similarity between users or items?

How do we evaluate our recommender system? Is it possible to do cross-validation like normal? Before we continue, let s review: Why do we perform cross-validation? Quick warning: Recommenders are inherently hard to validate. There is a lot of discussion in academia (research papers) and industry (Kaggle, Netflix, etc) about this. There is no ONE answer for all dataset.

Cross-validation of ML models we have seen so far

Cross-validation for recommenders? For this slide, the question marks denote the holdout set (not missing values). We can calculate MSE between the targets and our predictions over the holdout set. (K-fold cross-validation is optional.) Recall: Why do we perform cross-validation? Why isn t the method above a true estimate of a recommender's performance in the field? Why would A/B testing be better?

Alternate way to validate What s the deal with this? I.e. Why might we prefer doing this instead of the more normal crossvalidation from the previous slide?

DON T DO THIS! Why?

Cross-validation of Recommenders

How to deal with cold start? Scenario: A new user signs up. What will our recommender do (assume we re using item-item similarities)? One strategy: Force users to rate 5 items as part of the signup process. AND/OR Recommend popular items at first. Scenario: A new item is introduced. What will our recommender do (assume we re using item-item similarities)? One strategy: Put it in the new releases section until enough users rate it AND/OR use item metadata if any exists.

How to deal with cold start? Scenario: A new user signs up. What will our recommender do (assume we re Youtube and we re using item popularity to make recommendations)? This really isn t a problem... Scenario: A new item is introduced. What will our recommender do (assume we re Youtube and we re using item popularity to make recommendations)? One strategy: Don t use total number of views as the popularity metric (we d have a rich-get-richer situation). Use something else...

Deploying the recommender In the middle of the night: Compute similarities between all pairs of items. Compute the neighborhood of each item. At request time: Predict scores for candidate items, and make a recommendation.

Matrix Factorization for Recommendation

Matrix Factorization for Recommendation Recall: An explicit-rating utility matrix is usually VERY sparse We ve previously used SVD to find latent features (aka, factors)... Would SVD be good for this sparse utility matrix? (Hint: No!) What s the problem with using SVD on this sparse utility matrix?

Matrix Factorization for Recommendation UV Decomposition (UVD) UVD via Stochastic Gradient Descent (SGD) Matrix Factorization for Recommendation: Basic system: UVD + SGD... FTW Intermediate topics: regularization accounting for biases

UV Decomposition (UVD) You choose k. UV approximates R by necessity if k is less than the rank of R. Usually choose: k << min(n, m) Compute U and V such that: Least Squares!

UV Decomposition (UVD)

Evaluating factorization To evaluate how well the factorization represents the original data, we use RMSE Root Mean Squared Error

UV Decomposition Algorithm

Evaluating Factorization To get the formulas for the updates, we take the partial derivative of the error formula

Evaluating Factorization To make prediction of ratings, we multiply the U and V matrices together. So to get a single rating, it's the dot product of one row in U with one column in V. Now the squared error can be calculated by:

Calculate the gradient Gradient Descent

Updating Formula Gradient Descent (cont d)

Regularization Since now we re fitting a large parameter set to sparse data, you ll most certainly need to regularize! Tune lambda: the amount of regularization

Accounting for Biases (let s capture our domain knowledge!) In practice, much of the observed variation in rating values is due to item bias and user bias: Some items (e.g. movies) have a tendency to be rated high, some low. Some users have a tendency to rate high, some low. We can capture this prior domain knowledge using a few bias terms: The overall bias of the rating by user i for item j The overall average rating (i.e. the overall bias) User i s average deviation from the overall average Item j s average deviation from the overall average

New Prediction The 4 parts of a prediction The prediction of user i rating item j The average rating User i s tendency to deviate from the average Item j s tendency to deviate from the average The prediction of how user i will interact with item j

Accounting for Biases (the new cost function) Ratings are now estimated as: The new cost function, with the biases included: New part! New part!

UVD vs NMF UVD: By convention: R ~= UV NMF is a specialization of UVD! Both are approximate factorizations, and both optimize to reduce the RSS. NMF: By convention: V ~= WH Same as UVD, but with one extra constraint: all values of V, W, and H must be non-negative!

UVD vs NMF (continued) UVD and NMF are both solved using either: Alternating Least Squares (ALS) Stochastic Gradient Descent (SGD)

ALS vs SGD ALS: Parallelizes very well Available in Spark/MLlib Only appropriate for matrices that don t have missing values SGD: Faster (if on single machine) Requires tuning learning rate Anecdotal evidence of better results Works with missing values

UVD (or NMF) + SGD FTW! UVD + SGD makes a lot of sense for recommender systems. In fact, UVD + SGD is best in class option for many recommender domains: No need to impute missing values. Use regularization to avoid overfitting. Optionally include biases terms to communicate prior knowledge. Can handle time-dynamics (e.g. change in user preference over time). Used by the winning entry in the Netflix challenge.

From the paper: Matrix Factorization Techniques for Recommender Systems Root mean square error over the Netflix dataset using various matrix factorization models. Numbers on the chart denote each model s dimensionality (k). The more refined models perform better (have lower error). Netflix s in-house model performs at RMSE=0.9514 on this dataset, so even the simple matrix factorization models are beating it! Read the paper for details; it s a good read!

Summary

Summary Non-negative Matrix Factorization Singular Value Decomposition Content-Based Recommenders Collaborative Filtering Recommenders Item-Item User-User Matrix Factorization Recommenders UV Decomposition