Data Science Mastery Program

Copyright Policy All content included on the Site or third-party platforms as part of the class, such as text, graphics, logos, button icons, images, audio clips, video clips, live streams, digital downloads, data compilations, and software, is the property of BitTiger or its content suppliers and protected by copyright laws. Any attempt to redistribute or resell BitTiger content will result in the appropriate legal action being taken. We thank you in advance for respecting our copyrighted content. For more info see https://www.bittiger.io/termsofuse and https://www.bittiger.io/termsofservice

BitTiger https://www.bittiger.io/termsofuse https://www.bittiger.io/termsofservice

Outline Matrix Factorization in Clustering and Dimensionality Reduction Non-negative Matrix Factorization Singular Value Decomposition Recommender Examples Approaches in Recommenders Content-Based Collaborative Filtering Item-Item User-User Matrix Factorization UV Decomposition

Matrix Factorization

Non-negative Matrix Factorization

Non-negative Matrix Factorization(NMF) Matrix*Vm#x#n#*where*each*entry*vij* *0** m*x*r" r*x*n* m*x*n* also*wij* *0* *********hij* *0* Cannot*be*solved analy(cally,*so* approximated* numerically* r*set*by*user; $**r*<*min(m,n)**

No(ce*the*columns*of*V*are*sum*of*columns*of*W*weighted by*corresponding*column*in*hi NMF*is*a*rela(vely*new*way*of*reducing*dimensionality*of data*into*linear*combina(on*of*bases $**Columns*of*W*as*basis,*weighted*by*hi Non$nega(vity*constraint $**Unlike*the*decomposi(ons*we ve*looked*at*thus*far*

Document*Clustering*with*NMF* * 500*documents* 10,000*words* * V*****=*****W*****H* * * *

W:* Think*of*column*of*W*as*document*archetype* where*the*higher*the*word s*cell*value,*the* higher*the*word s*rank*for*that*latent*feature.* H:* Think*of*column*of*H*as*the*original* document,*where*cell*value*is*document s* rank*for*a*par(cular*latent*feature.* Recall* V:* Think*of*recons(tu(ng*a*par(cular*document*as* linear*combina(on*of* document*archetypes * weighed*by*how*important*they*are.*** NMF (least-squares objective) = a relaxed form of K-means Clustering: W contains cluster centroids H contains cluster membership indicators

Mechanics* *Alterna(ng*LS* *Minimize* *with*respect*to*w*and*h# *subject*to*w,*h* *0* Steps* (1) Randomly*ini(alize*W*and*H*to*the*appropriate*shapes* (2) Repeat*following* $ Holding*W*fixed,*update*H*by*minimizing*sum*of*squared*errors.**Ensure*all*H>0.* $ Holding*H*fixed,*update*W*by*minimizing*sum*of*squared*errors.**Ensure*all*W>0.* (3) Stop*when*some*threshold*is*met* Decrease*in*RMSE,*#*of*itera(ons,*etc.

NMF Algorithm

Computer*Visioning Popular*Applica(ons* Iden(fy*/*classifying*objects Generally*reducing*feature*space *of*images* Document*Clustering Recommender*systems

hnp://www.cs.cmu.edu/~02317/slides/lec_7.pdf#page=17*

Singular Value Decomposition

Singular Value Decomposition A [n x m] = U [n x r] L [ r x r] (V [m x r] ) T A: n x m matrix (e.g., n documents, m terms) U: n x r matrix (n documents, r concepts) L: r x r diagonal matrix (strength of each concept ) (r: rank of the matrix) V: m x r matrix (m terms, r concepts)

Singular Value Decomposition m: # of users n: # of items k: # of latent features (also rank of A)

SVD - Properties THEOREM: always possible to decompose matrix A into A = U L V T, where U, L, V: unique (*) U, V: column orthonormal (i.e., columns are unit vectors, orthogonal to each other) U T U = I; V T V = I (I: identity matrix) L: singular value are positive, and sorted in decreasing order

SVD - Properties spectral decomposition of the matrix: 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = x x u 1 u 2 l 1 l 2 v 1 v 2

SVD - Interpretation documents, terms and concepts : U: document-to-concept similarity matrix V: term-to-concept similarity matrix L: its diagonal elements: strength of each concept Projection: best axis to project on: ( best = min sum of squares of projection errors)

SVD - Example A = U L V T - example: Documents data information retrieval brain lung CS-TR1 1 1 1 0 0 CS-TR2 2 2 2 0 0 CS-TR3 1 1 1 0 0 CS-TR4 5 5 5 0 0 MED-TR1 0 0 0 2 2 MED-TR2 0 0 0 3 3 MED-TR3 0 0 0 1 1

SVD - Example CS MD A = U L V T - example: data info. retrieval 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 brain lung = 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Example CS MD A = U L V T - example: doc-to-concept similarity matrix CS-concept MD-concept data info. retrieval 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 brain lung = 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Example CS MD A = U L V T - example: data info. retrieval 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = brain lung strength of CS-concept 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Example CS MD A = U L V T - example: data info. retrieval 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = brain lung 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 CS-concept x term-to-concept similarity matrix 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD Dimensionality reduction Q: how exactly is dimensionality reduction done? A: set the smallest singular values to zero: 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 = 0.18 0 0.36 0 0.18 0 0.90 0 0 0.53 0 0.80 0 0.27 x 9.64 0 0 5.29 x 0.58 0.58 0.58 0 0 0 0 0 0.71 0.71

SVD - Dimensionality reduction Reduced matrices 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 ~ 0.18 0.36 0.18 0.90 0 0 0 x 9.64 x 0.58 0.58 0.58 0 0

SVD - Dimensionality reduction Reduced matrices 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 2 2 0 0 0 3 3 0 0 0 1 1 ~ 1 1 1 0 0 2 2 2 0 0 1 1 1 0 0 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Intro to Recommenders

Recommenders Where are recommenders used? What does our dataset look like? What are the high-level approaches to building a recommender? Content-based Collaborative filtering Matrix factorization How do we evaluate our recommender system? What are the challenges in our recommender system? What are the computational performance concerns?

Recommenders in Industry Netflix: 2/3 of the movies watched are recommended Google News: recommendations generate 38% more click-through Amazon: 35% sales from recommendations Stitch Fix: 100% of their revenue is based on recommendations

Business Goals What will the user like? What will the user buy? What will the user click?

Data Science Canon: Netflix s $1,000,000 Prize (Oct. 2006 - July 2009) Goal: Beat Netflix s own recommender by 10%. Took almost 3 years. The winning team used gradient boosted decision trees over the predictions of 500 other models. Netflix never deployed the winning algorithm.

What are the high-level approaches to building a recommender? Popularity: Make the same recommendation to every user, based only on the popularity of an item. E.g. Twitter Moments Content-based (aka, Content filtering): Predictions are made based on the properties/characteristics of an item. Other users behavior is not considered. E.g. Pandora Radio Collaborative filtering: Only consider past user behavior. (not content properties...) User-User similarity: Item-Item similarity:... E.g. Netflix & Amazon Recommendations, Google Ads, Facebook Ads, Search, Friends Rec., News feed, Trending news, Rank Notifications, Rank Comments Matrix Factorization Methods: Find latent features (aka, factors)

Content-Based Recommendation

Content-based recommendation Recommendations based on information on the content of items rather than on other users opinions/interactions Use a machine learning algorithm or a heuristic approach to induce a model of the users preferences from examples based on a featural description of content. In content-based recommendations, the system tries to recommend items similar to those a given user has liked in the past A pure content-based recommender system makes recommendations for a user based solely on the profile built up by analyzing the content of items which that user has rated in the past.

What is content? What is the content of an item? It can be explicit attributes or characteristics of the item. For example for a film: Genre: Action / adventure Feature: Bruce Willis Year: 1995 It can also be textual content (title, description, table of content, etc.) Several techniques to compute the distance between two textual documents Can use NLP techniques to extract content features Can be extracted from the signal itself (audio, image)

Content-based Recommendation Common for recommending text-based products (web pages, use net news messages, ) Items to recommend are described by their associated features (e.g. keywords) User Model structured in a similar way as the content: features/keywords more likely to occur in the preferred documents (lazy approach) Text documents recommended based on a comparison between their content (words appearing) and user model (a set of preferred words) The user model can also be a classifier based on whatever technique (Neural Networks, Nai ve Bayes...)

Advantages of content-based Recommendation No need for data on other users. No cold-start or sparsity problems. Able to recommend to users with unique tastes. Able to recommend new and unpopular items No first-rater problem. Can provide explanations of recommended items by listing content-features that caused an item to be recommended

Disadvantages of content-based Recommendation Requires content that can be encoded as meaningful features. Some kind of items are not amenable to easy feature extraction methods (e.g. movies, music) Even for texts, IR techniques cannot consider multimedia information, aesthetic qualities, download me... If you rate positively a page it could be not related to the presence of certain keywords Users tastes must be represented as a learnable function of these content features. Hard to exploit quality judgments of other users. Easy to overfit (e.g. for a user with few data points we may pigeon hole her)

Clustering in Recommender

Clustering Another way to make recommendations based on past purchases is to cluster customers Each cluster will be assigned typical preferences, based on preferences of customers who belong to the cluster Customers within each cluster will receive recommendations computed at the cluster level

Clustering

Clustering Pros: Clustering techniques can be used to work on aggregated data Can also be applied as a first step for shrinking the selection of relevant neighbors in a collaborative filtering algorithm and improve performance Can be used to capture latent similarities between users or items Cons: Recommendations (per cluster) may be less relevant than collaborative filtering (per individual)

Collaborative Filtering Recommender

Collaborative Filtering User Based Item Based.... Similar Users Both users read same books Similar Items Both Items read by same users Read by her Recommended to him Read red Recommend green

Ingredients of Collaborative Filtering List of m Users and a list of n Items Each user has a list of items with associated opinion Explicit opinion - a rating score Sometime the rating is implicitly purchase records or listen to tracks Active user for whom the CF prediction task is performed Metricfor measuring similarity between users/items Method for selecting a subset of neighbors Method for predicting a rating for items not currently rated by the active users

General Steps of Collaborative Filtering 1. Identify set of ratings for the target/active user 2. Identify set of users most similar to the target/active user according to a similarity function (neighborhood formation) 3. Identify the products these similar users liked 4. Generate a prediction - rating that would be given by the target user to the product - for each one of these products 5. Based on this predicted rating recommend a set of top N products

What does our dataset look like? Typically, data is a utility (rating) matrix, which captures user preferences/well-being: User rating of items User purchase decisions for items Unrated are coded as 0 or missing Most items are unrated matrix is sparse Use recommender: Determine which attributes users think are important Predict ratings for unrated items Better than trusting expert opinion

What does our dataset look like? Data can be: Explicit: User provided ratings (1 to 5 stars) User like/non-like Implicit: Infer user-item relationships from behavior More common Example: buy/not-buy; view/not-view To convert implicit to explicit, create a matrix of 1s (yes) and 0s (no)

Example 1: Explicit utility matrix We have explicit ratings, plus a bunch of missing values. What company might have data like this? Btw, we call this the utility matrix.

Example 2: Implicit utility matrix We have implicit feedback, and no missing values. What company might have data like this? Btw, we call this the utility matrix.

Explicit Rating vs. Implicit Feedback the company completely relied on its users rating titles with stars when it began personalization some years ago. At one point, it had over 10 billion 5-star ratings, and more than 50% of all members had rated more than 50 titles. However, over time, Netflix realized that explicit star ratings were less relevant than other signals. Users would rate documentaries with 5 stars, and silly movies with just 3 stars, but still watch silly movies more often than those high-rated documentaries.

Two types of similarity-based Collaborative Filtering User-based: predict based on similarities between users Performs well, but slow if many users Use item-based CF if Users Items Item-based: predict based on similarities between items Faster if you precompute item-item similarity Usually Users Items item-based CF is most popular Items-based tend to be more stable: Items often only in one category (e.g., action films) Stable over time Users may like variety or change preferences over time Items usually have more ratings than users items have more stable average ratings than users

User-User similarities We look at all pairs of users and calculate their similarity. How can we calculate the similarity of these row vectors?

Item-Item similarities We look at all pairs of items and calculate their similarity. How can we calculate the similarity of these column vectors?

User-User or Item-Item? User-User: Item-Item: Let: m = #users, n = #items We want to compute the similarity of all pairs. What is the algorithmic efficiency of each approach? User-User: O(m 2 n) Item-Item: O(mn 2 ) Which one is better?

Similarity Metric using Euclidean Distance What s the range? But we re interested in a similarity, so let s do this instead: What s the range? When use this?

Similarity Metric using Pearson Correlation What s the range? But we re interested in a similarity, so let s do this instead: What s the range? When use this?

Similarity Metric using Cosine Similarity What s the range? But we re interested in a standardized similarity, so let s do this instead: What s the range? When use this?

Similarity Metric using Jaccard Index What s the range? When use this?

The Similarity Matrix Pick a similarity metric, create the similarity matrix:

Item-Item based CF: How to make predictions Say user u hasn t rated item i. We want to predict the rating that this user would give this item. We order by descending predicted rating for a single user, and recommend the top k items to the user.

How to make predictions (using neighborhoods) This calculation of predicted ratings can be very costly. To mitigate this issue, we will only consider the n most similar items to an item when calculating the prediction.

How to make predictions How would you modify the prediction formula below for a user-based recommender? Hint: should you compute similarity between users or items?

How do we evaluate our recommender system? Is it possible to do cross-validation like normal? Before we continue, let s review: Why do we perform cross-validation? Quick warning: Recommenders are inherently hard to validate. There is a lot of discussion in academia (research papers) and industry (Kaggle, Netflix, etc) about this. There is no ONE answer for all dataset.

Cross-validation of ML models we have seen so far

Cross-validation for recommenders? For this slide, the question marks denote the holdout set (not missing values). We can calculate MSE between the targets and our predictions over the holdout set. (K-fold cross-validation is optional.) Recall: Why do we perform cross-validation? Why isn t the method above a true estimate of a recommender's performance in the field? Why would A/B testing be better?

Alternate way to validate What s the deal with this? I.e. Why might we prefer doing this instead of the more normal crossvalidation from the previous slide?

DON T DO THIS! Why?

Cross-validation of Recommenders

How to deal with cold start? Scenario: A new user signs up. What will our recommender do (assume we re using item-item similarities)? One strategy: Force users to rate 5 items as part of the signup process. AND/OR Recommend popular items at first. Scenario: A new item is introduced. What will our recommender do (assume we re using item-item similarities)? One strategy: Put it in the new releases section until enough users rate it AND/OR use item metadata if any exists.

How to deal with cold start? Scenario: A new user signs up. What will our recommender do (assume we re Youtube and we re using item popularity to make recommendations)? This really isn t a problem... Scenario: A new item is introduced. What will our recommender do (assume we re Youtube and we re using item popularity to make recommendations)? One strategy: Don t use total number of views as the popularity metric (we d have a rich-get-richer situation). Use something else...

Deploying the recommender In the middle of the night: Compute similarities between all pairs of items. Compute the neighborhood of each item. At request time: Predict scores for candidate items, and make a recommendation.

Matrix Factorization for Recommendation

Matrix Factorization for Recommendation Recall: An explicit-rating utility matrix is usually VERY sparse We ve previously used SVD to find latent features (aka, factors)... Would SVD be good for this sparse utility matrix? (Hint: No!) What s the problem with using SVD on this sparse utility matrix?

Matrix Factorization for Recommendation UV Decomposition (UVD) UVD via Stochastic Gradient Descent (SGD) Matrix Factorization for Recommendation: Basic system: UVD + SGD... FTW Intermediate topics: regularization accounting for biases

UV Decomposition (UVD) You choose k. UV approximates R by necessity if k is less than the rank of R. Usually choose: k << min(n, m) Compute U and V such that: Least Squares!

UV Decomposition (UVD)

Evaluating factorization To evaluate how well the factorization represents the original data, we use RMSE Root Mean Squared Error

UV Decomposition Algorithm

Evaluating Factorization To get the formulas for the updates, we take the partial derivative of the error formula

Evaluating Factorization To make prediction of ratings, we multiply the U and V matrices together. So to get a single rating, it's the dot product of one row in U with one column in V. Now the squared error can be calculated by:

Calculate the gradient Gradient Descent

Updating Formula Gradient Descent (cont d)

Regularization Since now we re fitting a large parameter set to sparse data, you ll most certainly need to regularize! Tune lambda: the amount of regularization

Accounting for Biases (let s capture our domain knowledge!) In practice, much of the observed variation in rating values is due to item bias and user bias: Some items (e.g. movies) have a tendency to be rated high, some low. Some users have a tendency to rate high, some low. We can capture this prior domain knowledge using a few bias terms: The overall bias of the rating by user i for item j The overall average rating (i.e. the overall bias) User i s average deviation from the overall average Item j s average deviation from the overall average

New Prediction The 4 parts of a prediction The prediction of user i rating item j The average rating User i s tendency to deviate from the average Item j s tendency to deviate from the average The prediction of how user i will interact with item j

Accounting for Biases (the new cost function) Ratings are now estimated as: The new cost function, with the biases included: New part! New part!

UVD vs NMF UVD: By convention: R ~= UV NMF is a specialization of UVD! Both are approximate factorizations, and both optimize to reduce the RSS. NMF: By convention: V ~= WH Same as UVD, but with one extra constraint: all values of V, W, and H must be non-negative!

UVD vs NMF (continued) UVD and NMF are both solved using either: Alternating Least Squares (ALS) Stochastic Gradient Descent (SGD)

ALS vs SGD ALS: Parallelizes very well Available in Spark/MLlib Only appropriate for matrices that don t have missing values SGD: Faster (if on single machine) Requires tuning learning rate Anecdotal evidence of better results Works with missing values

UVD (or NMF) + SGD FTW! UVD + SGD makes a lot of sense for recommender systems. In fact, UVD + SGD is best in class option for many recommender domains: No need to impute missing values. Use regularization to avoid overfitting. Optionally include biases terms to communicate prior knowledge. Can handle time-dynamics (e.g. change in user preference over time). Used by the winning entry in the Netflix challenge.

From the paper: Matrix Factorization Techniques for Recommender Systems Root mean square error over the Netflix dataset using various matrix factorization models. Numbers on the chart denote each model s dimensionality (k). The more refined models perform better (have lower error). Netflix s in-house model performs at RMSE=0.9514 on this dataset, so even the simple matrix factorization models are beating it! Read the paper for details; it s a good read!

Summary

Summary Non-negative Matrix Factorization Singular Value Decomposition Content-Based Recommenders Collaborative Filtering Recommenders Item-Item User-User Matrix Factorization Recommenders UV Decomposition