Large-scale Collaborative Ranking in Near-Linear Time

Similar documents
SQL-Rank: A Listwise Approach to Collaborative Ranking

ECS289: Scalable Machine Learning

Large-scale Collaborative Ranking in Near-Linear Time

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning

Decoupled Collaborative Ranking

ECS171: Machine Learning

ECS289: Scalable Machine Learning

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Scaling Neighbourhood Methods

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

CS260: Machine Learning Algorithms

PU Learning for Matrix Completion

Convex Optimization Algorithms for Machine Learning in 10 Slides

ECS289: Scalable Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Optimization Methods for Machine Learning

Maximum Margin Matrix Factorization for Collaborative Ranking

Large-scale Linear RankSVM

Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD

Click-Through Rate prediction: TOP-5 solution for the Avazu contest

Restricted Boltzmann Machines for Collaborative Filtering

Lecture 9: September 28

A Simple Algorithm for Nuclear Norm Regularized Problems

Mixed Membership Matrix Factorization

Mixed Membership Matrix Factorization

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Andriy Mnih and Ruslan Salakhutdinov

Nonnegative Matrix Factorization

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Algorithms for Collaborative Filtering

Ad Placement Strategies

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Using SVD to Recommend Movies

ECS171: Machine Learning

Semestrial Project - Expedia Hotel Ranking

Chapter 2. Optimization. Gradients, convexity, and ALS

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Kaggle.

A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization

Presentation in Convex Optimization

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

Matrix Factorization Techniques for Recommender Systems

Collaborative Recommendation with Multiclass Preference Context

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Statistical Ranking Problem

A Parallel SGD method with Strong Convergence

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

1 [15 points] Frequent Itemsets Generation With Map-Reduce

Collaborative Filtering Applied to Educational Data Mining

Spectral k-support Norm Regularization

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Matrix Factorization and Factorization Machines for Recommender Systems

Large-Scale Social Network Data Mining with Multi-View Information. Hao Wang

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

STA141C: Big Data & High Performance Statistical Computing

Gradient Boosting, Continued

Low Rank Matrix Completion Formulation and Algorithm

Recommendation Systems

Generative Models for Discrete Data

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Parallel Matrix Factorization for Recommender Systems

Neural Word Embeddings from Scratch

CS-E4830 Kernel Methods in Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

What is Happening Right Now... That Interests Me?

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Circle-based Recommendation in Online Social Networks

* Matrix Factorization and Recommendation Systems

Learning to Recommend Point-of-Interest with the Weighted Bayesian Personalized Ranking Method in LBSNs

Collaborative topic models: motivations cont

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

A Statistical View of Ranking: Midway between Classification and Regression

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Recommendation Systems

a Short Introduction

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent

CS60021: Scalable Data Mining. Large Scale Machine Learning

Factor Modeling for Advertisement Targeting

Collaborative Filtering via Ensembles of Matrix Factorizations

Collaborative Filtering. Radek Pelánek

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

Undirected Graphical Models

Data Mining Techniques

A Unified Algorithm for One-class Structured Matrix Factorization with Side Information

Graphical Models for Collaborative Filtering

Improved Bounded Matrix Completion for Large-Scale Recommender Systems

Machine Learning Techniques

Transcription:

Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

Recommender Systems: Netflix Problem

Matrix Factorization Approach R WH T

Collaborative Filtering: Latent Factor Model

Collaborative Filtering: Matrix Factorization Approach Latent Factor model fit the ratings directly in the objective function: min W R d 1 r H R d 2 r (i,j) Ω Ω = {(i, j) R ij is observed} (R ij wi T h j ) 2 + λ 2 Regularized terms to avoid over-fitting Criticisms: ( W 2 F + H 2 ) F, Calibration drawback: users have different standards for their ratings Performance measured by quality of rating prediction, not the ranking of items for each user

Collaborative Filtering: Collaborative Ranking Approach Focus on ranking of items rather than ratings in the model Performance measured by ranking order of top k items for each user How?

Collaborative Filtering: Collaborative Ranking Approach Focus on ranking of items rather than ratings in the model Performance measured by ranking order of top k items for each user How? Given d 1 users, d 2 movies Each user has a subset of observed movie comparisons Goal: predict movie rankings for each user

Collaborative Ranking: Applications Collaborative Ranking can be applied to classical recommender system For classical data (e.g., Netflix), we can transform original ratings to pairwise comparisons With the same data size, ranking loss outperforms point-wise loss

Collaborative Ranking: Assumption Assumption: the underlying scoring matrix is low-rank X = UV T

Collaborative Ranking: Model Collaborative Ranking: ( l Y i,j,k min U,V (i,j,k) Ω [ (UV T ) ij (UV T ) ik ] ) + λ( U 2 F + V 2 F ) The loss function l we used is L 2 hinge loss l(a) = max(0, 1 a) 2 The set (i, j, k) Ω: User i rates item j > item k Y i,j,k = 1 If user i rates d 2 movies, there will be O( d 2 2 ) pairs per user.

Collaborative Ranking: Model Collaborative Ranking: ( l Y i,j,k min U,V (i,j,k) Ω [ (UV T ) ij (UV T ) ik ] ) + λ( U 2 F + V 2 F ) The loss function l we used is L 2 hinge loss l(a) = max(0, 1 a) 2 The set (i, j, k) Ω: User i rates item j > item k Y i,j,k = 1 If user i rates d 2 movies, there will be O( d 2 2 ) pairs per user. Pros and Cons + better prediction and recommendation - slow, since it involves O( Ω ) = O(d 1 d 2 2 ) loss terms (original matrix completion only has O(d 1 d 2 ) terms)

Collaborative Ranking: Existing methods State-of-the-art algorithms CollRank (Park et al., ICML 2015): similar problem formulation RobiRank (Yun et al., NIPS 2014): focus on binary observations All of them need O(d 1 d2 2 r) time and O(d1 d2 2 ) memory where d 1 is number of users, d2 is averaged ratings per user, r is the target rank. In full Netflix dataset: d 1 2.5 million, d 2 200, if r = 100 Time per iteration: d 1 d 2 2 r 10 13 Memory: d 1 d 2 2 10 11 400GB memory Can t scale to full Netflix data need days to train

Our Algorithms: Primal-CR and Primal-CR++ Classical matrix factorization: O(d 1 d2 r) time, O(d 1 d2 ) memory Previous collaborative ranking: O(d 1 d 2 2 r) time, O(d1 d 2 2 ) memory Primal-CR: O(d 1 d 2 2 + d1 d 2 r) time, O(d 1 d 2 ) memory Primal-CR++: O(d 1 d 2 (r + log( d 2 ))) time, O(d 1 d 2 ) memory

General Framework: Alternating Minimization Consider the L2-hinge loss: ( 2 min max 0, 1 Y U,V i,j,k (ui T v j ui T v k )) + λ U 2 F + λ V 2 F (i,j,k) Ω := f (U, V ) Use alternating minimization. For t = 1, 2,... Fix V and update U by Fix U and update V by U arg min U f (U, V ) V arg min V f (U, V )

Update V Approach: Newton s method + conjugate gradient. Step 1: Compute gradient and Hessian: g = f (V ) R d 1r H = 2 f (V ) R d 1r d 1 r Update V V ηh 1 g (η is the step size) However, H is a huge matrix (d 1 r can be millions) Cannot inverse H Use Conjugate Gradient (CG) to iteratively solve Hx = g. Each iteration of CG only needs a matrix-vector product Hp Questions: (1) How to compute g? (2) How to compute Hp

Compute Gradient How to compute the gradient? V f (V ) = d 1 i=1 (j,k) Ω i 2 max Naive computation: O( Ω r) = O(d 1 d2 2 r) But d 1 d 2 2 r 10 13 ( ) 0, 1 (ui T v j ui T v k ) (u i ek T u iej T ) + λv for Netflix

Compute Gradient How to compute the gradient? V f (V ) = d 1 i=1 (j,k) Ω i 2 max Naive computation: O( Ω r) = O(d 1 d2 2 r) But d 1 d 2 2 r 10 13 ( ) 0, 1 (ui T v j ui T v k ) (u i ek T u iej T ) + λv for Netflix Idea (Primal-CR++): Fix k, do a linear scan of j after sorting For simplicity in illustation: assume at k, observed rating is 1 ignore constant 1 in L 2 hinge loss ignore u i ej T in gradient equation

Even better way (Rethinking the main computation) Consider all the items for a user Sort items based on predictive values u T v j (O( d 2 log( d 2 )) time) Now assume observed ratings can only be 0 or 1

Even better way (Rethinking the main computation) Look at item 2 and sum over all items with larger predictive value but smaller observed ratings

Even better way (Rethinking the main computation) Look at item 1 and sum over all items with larger predictive value but smaller observed ratings

Even better way (Rethinking the main computation) Look at item 6 and sum over all items with larger predictive value but smaller observed ratings

The linear-time algorithm for 0/1 ratings Let s k = u T v k (current prediction) Maintain prefix-sum at k: P = j:s j >s k and R j =0 Linea-scan algorithm: Initially P = j:r j =0 s j For j = π(1), π(2), If R j = 0: Update P P s j If R j = 1: Add P u to g j s j

General case: For T levels of ratings

General case: For T levels of ratings Each step needs two operations: (1) Compute P 1 + + P l for some l (2) Update one of P j

General case: For T levels of ratings Each step needs two operations: (1) Compute P 1 + + P l for some l (2) Update one of P j Can be done in O(log(T )) time per query using Segment tree or Fenwick tree!

General case: For T levels of ratings Each step needs two operations: (1) Compute P 1 + + P l for some l (2) Update one of P j Can be done in O(log(T )) time per query using Segment tree or Fenwick tree! 2 Overall time complexity: d 1 d 2 r 10 13 d 1 d 2 r + d 1 d 2 log( d 2 ) 5.1 10 10 Order of 3 Speed-up!! Time complexity for standard matrix factorization: d 1 d 2 r 5 10 10

Experiments NDCG@10: measuring the quality of top-10 recommendations where i represents i-th user and Comparisons NDCG@10 = 1 d 1 DCG@10(i, π i ) d 1 DCG@10(i, πi (1) ), DCG@10(i, π i ) = 1 Single core subsampled data 2 Parallelization 3 Single core full data i=1 10 k=1 2 M i π i (k) 1 log 2 (k + 1). (2)

Comparisons: Single Core Subsampled Data We subsampled each user to have exactly 200 ratings in training set and used the rest of ratings as test set, since previous approaches cannot scale up Users with fewer than 200 + 10 ratings not included MoviLens1M (6, 040 3, 952) Netflix (2, 649, 430 17, 771)

Comparisons: Parallelization Our algorithm scales up well (see comparisons for 1 core, 4 cores and 8 cores) in the multi-core shared memory setting Primal-CR still much faster than Collrank when 8 cores are used MoviLens10M (71, 567 65, 134) MoviLens10M (71, 567 65, 134)

Comparisons: Single Core Full Data Due to the O( Ω r) complexity, existing algorithms always sub-sample a limited number of pairs per user Our algorithm is the first ranking-based algorithm that can scale to full Netflix data set using a single machine, and without sub-sampling A natural question: Does using more training data help us predict and recommend better?

Comparisons: Single Core Full Data Due to the O( Ω r) complexity, existing algorithms always sub-sample a limited number of pairs per user Our algorithm is the first ranking-based algorithm that can scale to full Netflix data set using a single machine, and without sub-sampling A natural question: Does using more training data help us predict and recommend better? The answer is yes!

Comparisons: Single Core Full Data Randomly choose 10 ratings as test data and out of the rest ratings randomly choose up to C ratings per user as training data Users with fewer than 20 ratings not included Comparing NDCG@10 when C = 100, 200, d 2 MoviLens1M (6, 040 3, 952) Netflix (2, 649, 430 17, 771)

Conclusions We show that CR can be used to replace matrix factorization in recommender systems We show that CR can be solved efficiently (almost same time complexity as matrix factorization) We show that it is always best to use all available pairwise comparisons if possible (subsampling gives suboptimal recommender results for top k items) Julia codes: https://github.com/wuliwei9278/ml-1m C++ codes (2x faster than Julia codes, with support for multithreading): https://github.com/wuliwei9278/primalcr Thank You!