Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD

Similar documents
Distributed Methods for High-dimensional and Large-scale Tensor Factorization

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

ECS289: Scalable Machine Learning

FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

Parallel Numerical Algorithms

Low Rank Matrix Completion Formulation and Algorithm

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Parallel Matrix Factorization for Recommender Systems

Large-scale Collaborative Ranking in Near-Linear Time

Matrix Factorization and Factorization Machines for Recommender Systems

FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

Using SVD to Recommend Movies

Scaling Neighbourhood Methods

ECS171: Machine Learning

DFacTo: Distributed Factorization of Tensors

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

ECS289: Scalable Machine Learning

Recommender System for Yelp Dataset CS6220 Data Mining Northeastern University

Collaborative Filtering Matrix Completion Alternating Least Squares

Polynomial optimization methods for matrix factorization

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Matrix and Tensor Factorization from a Machine Learning Perspective

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Improved Bounded Matrix Completion for Large-Scale Recommender Systems

Distributed Stochastic ADMM for Matrix Factorization

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

High Performance Parallel Tucker Decomposition of Sparse Tensors

Collaborative Filtering. Radek Pelánek

Large-Scale Behavioral Targeting

CS260: Machine Learning Algorithms

CSC321 Lecture 2: Linear Regression

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization

Supplemental Material for TKDE

Clustering based tensor decomposition

Recommendation Systems

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

ECS289: Scalable Machine Learning

Chapter 2. Optimization. Gradients, convexity, and ALS

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

IBM Research Report. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Matrix Factorization Techniques for Recommender Systems

Collaborative Filtering

A Modified PMF Model Incorporating Implicit Item Associations

arxiv: v6 [cs.na] 5 Dec 2017

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Graphical Models for Collaborative Filtering

SQL-Rank: A Listwise Approach to Collaborative Ranking

CSC 576: Variants of Sparse Learning

Generative Models for Discrete Data

Decoupled Collaborative Ranking

Nesterov-based Alternating Optimization for Nonnegative Tensor Completion: Algorithm and Parallel Implementation

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Large-scale Collaborative Ranking in Near-Linear Time

Algorithms for Collaborative Filtering

Large-scale Stochastic Optimization

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

ECS289: Scalable Machine Learning

PU Learning for Matrix Completion

Faloutsos, Tong ICDE, 2009

ECS289: Scalable Machine Learning

Least Mean Squares Regression

This ensures that we walk downhill. For fixed λ not even this may be the case.

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Matrices: 2.1 Operations with Matrices

Restricted Boltzmann Machines for Collaborative Filtering

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Using R for Iterative and Incremental Processing

Multiverse Recommendation: N-dimensional Tensor Factorization for Context-aware Collaborative Filtering

Andriy Mnih and Ruslan Salakhutdinov

CS425: Algorithms for Web Scale Data

Selected Topics in Optimization. Some slides borrowed from

CSC 411 Lecture 6: Linear Regression

Matrix Completion: Fundamental Limits and Efficient Algorithms

Data Mining and Matrices

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Scalable Nonnegative Matrix Factorization with Block-wise Updates

Modern Optimization Techniques

Personalized Ranking for Non-Uniformly Sampled Items

Tensor Decomposition with Smoothness (ICML2017)

Stochastic Gradient Descent. CS 584: Big Data Analytics

Notes on AdaGrad. Joseph Perla 2014

Matrix Factorization and Collaborative Filtering

A Distributed Solver for Kernelized SVM

Linear Algebra for Machine Learning. Sargur N. Srihari

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

Modern Optimization Techniques

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

STA141C: Big Data & High Performance Statistical Computing

Crowdsourcing via Tensor Augmentation and Completion (TAC)

Vector, Matrix, and Tensor Derivatives

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Jeffrey D. Ullman Stanford University

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Transcription:

Large-scale Matrix Factorization Kijung Shin Ph.D. Student, CSD

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 2/99

Roadmap Matrix Factorization (review) << Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 3/99

Matrix Factorization: Problem Given: V: n by m matrix possibly with missing values r: target rank (a scalar) usually r << n and r << m n? m??? V Large-scale Matrix Factorization (by Kijung Shin) 4/99

Matrix Factorization: Problem Given: V: n by m matrix possibly with missing values r: target rank (a scalar) usually r << n and r << m r m Find: W: n by r matrix H: r by m matrix without missing values n W r H Large-scale Matrix Factorization (by Kijung Shin) 5/99

Matrix Factorization: Problem Goal: WH V?? H?? W V Large-scale Matrix Factorization (by Kijung Shin) 6/99

Loss Function L V, W, H = σ i,j Z (V ij WH ij ) 2 + Indices of non-missing entries i.e., i, j Z V ij is not missing (i, j)-th entry of V (i, j)-th entry of WH Goal: to make WH similar to V Large-scale Matrix Factorization (by Kijung Shin) 7/99

Loss Function (cont.) L V, W, H = σ i,j Z (V ij WH ij ) 2 + λ( W F 2 + H F 2 ) Regularization parameter Goal: to prevent overfitting Frobenius Norm: W F 2 = i=1 (by making the entries of W and H close to zero) n r k=1 W ik 2 Large-scale Matrix Factorization (by Kijung Shin) 8/99

Algorithms How can we minimize this loss function L? Stochastic Gradient Descent: SGD (covered in the last lecture) Alternating Least Square: ALS (covered today) Cyclic Coordinate Descent: CCD++ (covered today) Are these algorithms parallelizable? Yes, all of them! Large-scale Matrix Factorization (by Kijung Shin) 9/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD << Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 10/99

Distributed SGD (DSGD) Large-scale matrix factorization with distributed stochastic gradient descent (KDD 2011) Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis Large-scale Matrix Factorization (by Kijung Shin) 11/99

Stochastic GD for MF (review) Let W = W 1: : W n:, H = H :1 H :m L V, W, H : sum of loss for each non-missing entry L V, W, H = i,j Z L V ij, W i:, H :j where loss at each non-missing entry V ij is L V ij, W i:, H :j V ij W i: H :j 2 + λ W i: 2 = WH ij Number of non-missing entries in the i-th row of V N i: + H :j N :j Large-scale Matrix Factorization (by Kijung Shin) 12/99 2

Stochastic GD for MF (cont.) Stochastic Gradient Descent (SGD) for MF repeat until convergence randomly shuffle the non-missing entries of V for each non-missing entry: perform an SGD step on it Large-scale Matrix Factorization (by Kijung Shin) 13/99

Stochastic GD for MF (cont.) An SGD step on each non-missing entry V ij : Read W i: and H :j Compute gradient of L V ij, W i:, H :j Update W i: and H :j Detailed update rules were covered in the last lecture H :j H W i: V ij W V Large-scale Matrix Factorization (by Kijung Shin) 14/99

Simple Parallel SGD for MF Parameter Mixing: MSGD entries of V are distributed across multiple machines Machine 1 Machine 2 Machine 3 V V V Large-scale Matrix Factorization (by Kijung Shin) 15/99

Simple Parallel SGD for MF (cont.) Parameter Mixing: MSGD entries of V are distributed across multiple machines Map step: each machine runs SGD independently on the assigned entries until convergence Machine 1 Machine 2 Machine 3 H H H W W W V V V Large-scale Matrix Factorization (by Kijung Shin) 16/99

Simple Parallel SGD for MF (cont.) Parameter Mixing: MSGD Map step: each machine runs an independent instance of SGD on subsets of the data until convergence H Reduce step: average results (i.e., W and H) Machine 1 Machine 2 Machine 3 H + H + H Problem: does not converge to correct solutions no guarantee that the average reduces the loss function L 3 Large-scale Matrix Factorization (by Kijung Shin) 17/99

Simple Parallel SGD for MF (cont.) Iterative Parameter Mixing: ISGD entries of V are distributed across multiple machines Repeat until convergence Map step: each machine runs SGD independently on the assigned entries for some time Reduce step: average results (i.e., W and H) broadcast averaged results Problem: slow convergence still has the averaging step Large-scale Matrix Factorization (by Kijung Shin) 18/99

Interchangeability How can we avoid the averaging step? let different machines update different entries of W and H Two entries V ij and V kl of V are interchangeable if they share neither row nor column H :j H :l H Not interchangeable! H :j H :l H W k: V kl W i: V ij Interchangeable! W i: V ij V il W V W V Large-scale Matrix Factorization (by Kijung Shin) 19/99

Interchangeability (cont.) If V ij and V kl are interchangeable, SGD steps on V ij and V kl can be parallelized safely Read W i: and H :j Compute gradient of L W i:, H :j Update W i: and H :j H :j H :l H Read W k: and H :l Compute gradient of L W l:, H :l Update W k: and H :l W k: W i: V ij V kl no conflicts! W V Large-scale Matrix Factorization (by Kijung Shin) 20/99

Distributed SGD Block W, H, and V H H (1) H (2) H (3) W (1) V (1,1) V (1,2) V (1,3) W V W (2) V (2,1) V (2,2) V (2,3) W (3) V (3,1) V (3,2) V (3,3) Large-scale Matrix Factorization (by Kijung Shin) 21/99

Distributed SGD (cont.) Block W, H, and V repeat until convergence for a set of interchangeable blocks of V 1. run SGD on the blocks in parallel no conflict between machines 2. merge results (i.e., W and H) averaging is not needed W (1) W (2) H (1) H (2) H (3) V (1,1) V (2,2) Blocks on a diagonal are interchangeable! W (3) V (3,3) Large-scale Matrix Factorization (by Kijung Shin) 22/99

Interchangeable Blocks Example of interchangeable blocks H (1) H (2) H (3) H (1) H (2) W (1) V (1,1) W (1) V (1,1) W (2) V (2,2) W (2) V (2,2) Machine 1 Machine 2 H (3) W (3) V (3,3) Blocks are interchangeable! W (3) V (3,3) Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 23/99

Interchangeable Blocks (cont.) Example of interchangeable blocks H (1) H (2) H (3) H (2) H (3) W (1) V (1,2) W (1) V (1,2) W (2) V (2,3) W (2) V (2,3) Machine 1 Machine 2 H (1) W (3) V (3,1) Blocks are interchangeable! W (3) V (3,1) Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 24/99

Interchangeable Blocks (cont.) Example of interchangeable blocks H (1) H (2) H (3) H (3) H (1) W (1) V (1,3) W (1) V (1,3) W (2) V (2,1) W (2) V (2,1) Machine 1 Machine 2 H (2) W (3) V (3,2) Blocks are interchangeable! W (3) V (3,2) Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 25/99

Interchangeable Blocks (cont.) Not interchangeable blocks Not used in DSGD H (1) H (2) H (3) W (1) V (1,3) W (2) V (2,3) W (3) V (3,2) Blocks are not interchangeable! Large-scale Matrix Factorization (by Kijung Shin) 26/99

More details of DSGD How should we choose sets of interchangeable blocks? In each iteration, we choose sets of interchangeable blocks so that every block of V is used exactly once E.g., But the order and grouping are random SGD within each block every non-missing entry in the block is used exactly once but the order is random Large-scale Matrix Factorization (by Kijung Shin) 27/99

More details of DSGD (cont.) Use bold driver to set step size (or learning rate) After each iteration, increase the step size if the loss decreases decrease the step size if the loss increases Implemented Hadoop and R/Snowfall Snowfall: package for parallel R programs https://cran.r-project.org/web/packages/snowfall/index.html Large-scale Matrix Factorization (by Kijung Shin) 28/99

Pros & Cons of DSGD Pros: Faster convergence than MSGD and ISGD no averaging step Memory efficiency: each machine needs to maintain a single block of W and a single block of H in memory H (1) H (2) H (3) W (1) V (1,1) W (2) V (2,2) W (3) V (3,3) Machine 1 Machine 2 Machine 3 Cons: Many hyperparameters: step size (ε) in addition to regularization parameter (λ) and rank (r) Large-scale Matrix Factorization (by Kijung Shin) 29/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS << Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 30/99

Alternating Least Square Large-scale Parallel Collaborative Filtering for the Netflix Prize (AAIM 2008) Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan Large-scale Matrix Factorization (by Kijung Shin) 31/99

Main Idea behind ALS How hard is matrix factorization? i.e., finding argmin W,H L(V, W, H) Solving the entire problem is difficult L V, W, H is a non-convex function of W and H many local optima finding a global optimum is NP-hard Large-scale Matrix Factorization (by Kijung Shin) 32/99

Main Idea behind ALS (cont.) How hard is solving a smaller problem? Specifically, finding argmin L(V, W, H) H while fixing W to its current value Much easier! L V, W, H is a convex function of H (once W is fixed) one local optimum, which is also globally optimal Moreover, there exists the closed-form solution we can direct compute the global optimum Large-scale Matrix Factorization (by Kijung Shin) 33/99

Updating H Finding argmin H L(V, W, H) while fixing W to its current value This problem is solved by the following update rule k by k identity matrix For j = 1 m H :j i Z :j W i: W i: T + λi k derived from L H kj = 0, k, j (first-order condition) 1 i Z :j V ij W i: row indices of non-missing entries in the j-th column of V Large-scale Matrix Factorization (by Kijung Shin) 34/99

Updating W Likewise, finding argmin L(V, W, H) W while fixing H to its current value This problem is solved by the following update rule For i = 1 n W i: derived from i Z i: H :j T H:j + λi k 1 j Z i: V ij H :j column indices of non-missing entries in the i-th row of V L W ik = 0, i, k (first-order condition) Large-scale Matrix Factorization (by Kijung Shin) 35/99

Alternating Least Square (ALS) randomly Initialize W and H repeat until convergence update W while fixing H to its current value update H while fixing W to its current value Each step never increases L V, W, H W is updated to the best W that minimizes L V, W, H for current H H is updated to the best H that minimizes L V, W, H for current W L V, W, H monotonically decreases until convergence Large-scale Matrix Factorization (by Kijung Shin) 36/99

Parallel ALS: Idea Recall the update rule for each j-th column of H H :j i Z :j W i: W i: T + λi k 1 i Z :j V ij W i: Read H Written H :j H V :j W V W V Large-scale Matrix Factorization (by Kijung Shin) 37/99

Parallel ALS: Idea (cont.) Potentially every entry of W needs to be read Only the j-th column of V needs to be read H does not need to be read can update the columns of H independently in parallel Read H Written H :j H V :j W V W V Large-scale Matrix Factorization (by Kijung Shin) 38/99

Updating H in Parallel A toy example with 3 machines 1. Divide V column-wise into 3 pieces W V (1) V (2) V (3) Large-scale Matrix Factorization (by Kijung Shin) 39/99

Updating H in Parallel (cont.) 2. Distribute the pieces across the machines 3. W is broadcast to every machine Machine 1 Machine 2 Machine 3 W V (1) V (2) W V (1) W V (2) W V (3) V (3) Large-scale Matrix Factorization (by Kijung Shin) 40/99

Updating H in Parallel (cont.) 4. Compute the corresponding columns of H in parallel Machine 1 Machine 2 Machine 3 New! New! New! H (1) H (2) H (3) W V (1) V (2) W V (1) W V (2) W V (3) V (3) Large-scale Matrix Factorization (by Kijung Shin) 41/99

Updating H in Parallel (cont.) 5. Broadcast the part of H that each machine computes H (1) H (2) H (3) H (1) H (2) H (3) H (1) H (2) H (3) W V (1) W V (2) W V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 42/99

Parallel ALS: Idea Recall the update rule for each row of W W i: i Z i: H :j T H:j + λi k 1 j Z i: V ij H :j Read H Written H V i: W i: W V W V Large-scale Matrix Factorization (by Kijung Shin) 43/99

Parallel ALS: Idea (cont.) Potentially every entry of H needs to be read Only the i-th row of V needs to be read W does not need to be read can update the rows of W independently in parallel Read H Written H V i: W i: W V W V Large-scale Matrix Factorization (by Kijung Shin) 44/99

Updating W in Parallel A toy example with 3 machines 1. Divide V row-wise into 3 pieces H V (1) V (2) V (3) Large-scale Matrix Factorization (by Kijung Shin) 45/99

Updating W in Parallel (cont.) 2. Distribute the pieces across the machines 3. H is broadcast to every machine H V (1) V (2) V (3) H H H V (1) V (2) V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 46/99

Updating W in Parallel (cont.) 4. Compute the corresponding rows of W in parallel H H H New! New! New! W (1) W (2) V (3) V (1) V (2) V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 47/99

Updating W in Parallel (cont.) 5. Broadcast the part of W that each machine computes W (1) W (1) W (1) W (2) H H H W (2) W (2) W (3) W (3) W (3) V (1) V (2) V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 48/99

Pros & Cons of ALS Pros: Less hyper-parameters: not requiring step size Cons: High computational cost: e.g., matrix inversion takes O(r 3 ) 1 H :j i Z :j W i: W i: T + λi k i Z :j V ij W i: Large-scale Matrix Factorization (by Kijung Shin) 49/99

Pros & Cons of ALS (cont.) Cons (cont.) High memory requirement: while updating W(or H), each machine maintains the entire H(or W) in memory Machine 1 Machine 2 Machine 3 W V (1) W V (2) W V (3) Large-scale Matrix Factorization (by Kijung Shin) 50/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ << Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 51/99

Cyclic Coordinate Descent Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems (ICDM 2012) Best Paper Award Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon Large-scale Matrix Factorization (by Kijung Shin) 52/99

Matrix Factorization: Revisit Goal: V is approximated by the product of W and H m r m n??? n H r? V W Large-scale Matrix Factorization (by Kijung Shin) 53/99

Matrix Factorization: Revisit Goal: V is approximated by the product of W and H m r m n?? V ij? n W i: H :j H r? V W V ij WH ij = σr k=1 W ik H kj = W i1 H 1j + + W ir H rj Large-scale Matrix Factorization (by Kijung Shin) 54/99

Matrix Factorization: Revisit Equivalently, V is approximated by the sum of r matrices?? + + H 1: 1 st row of H H r: r th row of H?? 1 st column of W r th column of W V W :1 W :r W :rh r: W :1 H 1: ij = W i1 H 1j W :r H r: ij = W ir H rj V ij W :1 H 1: ij + + W :r H r: ij = W i1 H 1j + + W ir H rj Large-scale Matrix Factorization (by Kijung Shin) 55/99

Cyclic Coordinate Descent: CCD++ repeat until convergence for k = 1 r update the k-th column of W and the k-th row of H while fixing the other entries of W and H Consider a toy example with rank 3 Step 1: H 1: + + H 2: H 3: V W :1 W :2 W :3 Large-scale Matrix Factorization (by Kijung Shin) 56/99

Step 2: Order of Updates in CCD++ + + H 1: H 2: H 3: V Step 3: W :1 W :2 W :3 + + H 1: H 2: H 3: V W :1 W :2 W :3 Large-scale Matrix Factorization (by Kijung Shin) 57/99

Performing one update Let W :1 and H 1: be the updated row and column + + H 1: H 2: H 3: V W :1 W :2 W :3 H 2: H 3: H 1: V W :2 W :3 W :1 residual matrix R Large-scale Matrix Factorization (by Kijung Shin) 58/99

Performing one update (cont.) H 2: H 3: H 1: V W :2 W :3 W :1 residual matrix R Updating W :1 and H 1: equals to factorizing R with rank 1 R = V W :2 H 2: W :3 H :3 = V W :1 H 1: W :2 H 2: W :3 H :3 + W :1 H 1: = V WH + W :1 H 1: Large-scale Matrix Factorization (by Kijung Shin) 59/99

Performing one update (cont.) Factorizing R with rank 1 can be performed using ALS with the following rules: The updates rules can be derived from those in ALS For j = 1 m H kj σ i Z:j R ij W ik σi Z:j W ik 2 + λ For i = 1 n σ W ik j Zi: R ij H kj σ j Zi: H kj 2 + λ Large-scale Matrix Factorization (by Kijung Shin) 60/99

Pseudocode of CCD++ Pseudocode (Simple) randomly initialize W and H repeat until convergence for k = 1 r: R V WH + W :k H k: update W :k and H k: (rank-1 factorization of R) Repeated!! Pseudocode (Detailed) randomly initialize W and H R V WH Maintained!! repeat until convergence for k = 1 r: R R + W :k H k: update W :k and H k: (rank-1 factorization of R) R R + W :k H k: Large-scale Matrix Factorization (by Kijung Shin) 61/99

Parallel CCD++ Updating H k: in parallel 1. divide R column-wise 2. distribute the pieces across the machines 3. W :k is broadcast to every machine H k: Machine 1 Machine 2 Machine 3 R (1) R (2) R (3) R(1) R(2) R(3) W :k W :k W :k W K: Large-scale Matrix Factorization (by Kijung Shin) 62/99

Parallel CCD++ (cont.) Updating H k: in parallel (cont.) 4. compute the corresponding entries of H k: in parallel Machine 1 Machine 2 Machine 3 New! H 1 k: New! H 2 k: New! H 3 k: R(1) R(2) R(3) W :k W :k W :k Large-scale Matrix Factorization (by Kijung Shin) 63/99

Parallel CCD++ (cont.) Updating H k: in parallel (cont.) 5. send the computed entries to the other machines. Machine 1 Machine 2 Machine 3 H k: H k: H k: R(1) R(2) R(3) W :k W :k W :k Large-scale Matrix Factorization (by Kijung Shin) 64/99

Pros & Cons of CCD++ Pros: Low computational cost update rules do not need matrix inversion H kj σ i Z:j R ij W ik σi Z:j W ik 2 + λ σ W ik j Zi: R ij H kj σ j Zi: H kj 2 + λ Large-scale Matrix Factorization (by Kijung Shin) 65/99

Pros & Cons of CCD++ Pros (cont.): Low memory requirements each machine needs to maintain one row of H (or one column of W) in memory at a time instead of entire H (or W) Machine 1 in ALS Machine 1 in CCD++ W V (1) R (1) W :k Large-scale Matrix Factorization (by Kijung Shin) 66/99

Pros & Cons of CCD++ Cons: Slow for dense matrices with many non-missing entries To update R, every non-missing entry needs to be read and rewritten. This is especially slow if R is stored on disk. Large-scale Matrix Factorization (by Kijung Shin) 67/99

Roadmap Problem (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments << Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 68/99

Users Experimental Settings Datasets: Movies (or songs) 5? 4 3 3 4 1? 2? 1? 5 3? 2 Ratings Large-scale Matrix Factorization (by Kijung Shin) 69/99

Experimental Settings (cont.) Datasets: (2.7M users, 18K movies, 100M ratings) (1M users, 625K songs, 256M ratings) (71K users, 65K movies, 10M ratings) Large-scale Matrix Factorization (by Kijung Shin) 70/99

Experimental Settings (cont.) Split of Datasets: Training set: about 80% of non-missing entries Test set: about 20% of non-missing entries Evaluation Metric: Test RMSE (Root-mean square error) RMSE V, W, H = 1 Z test V ij WH ij 2 (i,j) Z test index of test entries True rating Estimated rating Large-scale Matrix Factorization (by Kijung Shin) 71/99

Experimental Settings (cont.) Machines and implementations Shared-memory setting: 8 cores OpenMP in C++ Distributed setting: up to 20 machines MPI in C++ Large-scale Matrix Factorization (by Kijung Shin) 72/99

Convergence Speed Shared-memory setting with 8 cores CCD++ decreases Test RMSE fastest Figure 3(b) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 73/99

Convergence Speed (cont.) Shared-memory setting with 8 cores CCD++ converges Test RMSE fastest Figure 3(c) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 74/99

Convergence Speed (cont.) Shared-memory setting with 8 cores CCD++ converges fastest Figure 3(a) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 75/99

Speedup in Shared Memory CCD++ and ALS show near-linear machine scalability DSGD suffers form high cache-miss rates due to its randomness Cache-miss rate increases as more cores are used Figure 4 of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 76/99

Speedup in Distributed Settings All algorithms show similar speedups Figure 6(b) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 77/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization << Conclusions Large-scale Matrix Factorization (by Kijung Shin) 78/99

Tensors An N-way tensor is an N-dimensional array t 1 t I1 1-way tensor (= vector) t 11 t 1I2 t I1 1 t I1 I 2 2-way tensor (= matrix) i 1 = 1,, I 1 i 2 = 1,, I 2 3-way tensor Large-scale Matrix Factorization (by Kijung Shin) 79/99

Tensor Factorization: Problem Given: X: I by J by K tensor possibly with missing values R: target rank (a scalar) usually R I, J, K K I???? J X Large-scale Matrix Factorization (by Kijung Shin) 80/99

Tensor Factorization: Problem Given: X: I by J by K tensor (i.e., 3-dim array) possibly with missing values R: target rank (a scalar) usually R I, J, K R J R I H Find: U: I by R matrix H: J by R matrix W: Kby R matrix without missing values U K R W Large-scale Matrix Factorization (by Kijung Shin) 81/99

Tensor Factorization: Problem Goal: X [UHW] where UHW ijk = σr r=1 U ir H jr W kr???? H T X U Large-scale Matrix Factorization (by Kijung Shin) 82/99

Loss Function L X, U, H, W = σ i,j,k Z (X ijk UHW ijk ) 2 + Indices of non-missing entries i.e., i, j, k Z X ijk is not missing (i, j, k)-th entry of X (i, j, k)-th entry of [UHW] Goal: to make X and UHW similar Large-scale Matrix Factorization (by Kijung Shin) 83/99

Loss Function (cont.) L X, U, H, W = σ i,j,k Z (X ijk UHW ijk ) 2 + λ( U F 2 + H F 2 + W F 2 ) Regularization parameter Goal: to prevent overfitting Frobenius Norm: U F 2 = i=1 (by making the entries of U, H and V close to zero) I R r=1 U ir 2 Large-scale Matrix Factorization (by Kijung Shin) 84/99

SGD for TF L X, U, H, W : sum of loss for each non-missing entry L X, U, H, W = σ i,j,k Z L (X ijk, U i: H j: W k: ) An SGD step on each non-missing entry X ijk : Read U i:, H j:, and W k: Compute gradient of L V ij, W i:, H :j, W k: Update U i:, H j:, W k: W H T X ijk U X Large-scale Matrix Factorization (by Kijung Shin) 85/99

Parallel Algorithms for TF Parallel algorithms for MF are extended to TF DSGD FlexiFaCT [BKP+14] ALS [SPK16] CCD++ CDTF [SK17] FlexiFaCT [VKP+14] Large-scale Matrix Factorization (by Kijung Shin) 86/99

Users Applications Context-aware recommendation [KABO10] 4 5? 2 Missing rating X ijk = UHW ijk 3 Ratings Movies Large-scale Matrix Factorization (by Kijung Shin) 87/99

Users Applications (cont.) Context-aware recommendation [KABO10] 4? Missing rating X ijk 5 2 3 = UHW ijk Ratings Restaurants Large-scale Matrix Factorization (by Kijung Shin) 88/99

X-axis Applications (cont.) Video Restoration [LMWY13] 4 5 Y-Axis? 2 3 Missing or corrupted pixel X ijk = UHW ijk Pixels Large-scale Matrix Factorization (by Kijung Shin) 89/99

Applications (cont.) Video Restoration [LMWY13] Corrupted frame Restored frame Large-scale Matrix Factorization (by Kijung Shin) 90/99

Users Applications (cont.) Personalized Web Search [SZL+05] 4? Missing preference X ijk 5 2 3 Query keywords = UHW ijk Preference Large-scale Matrix Factorization (by Kijung Shin) 91/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions << Large-scale Matrix Factorization (by Kijung Shin) 92/99

Conclusions Matrix Factorization (MF) Parallel algorithms for MF Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Tensor Factorization (TF) Extension of MF to tensors Applications: context-aware recommendation, video restoration, personalized web search, etc. Large-scale Matrix Factorization (by Kijung Shin) 93/99

Questions? These slides are available at: www.cs.cmu.edu/~kijungs/etc/10-405.pdf Email: kijungs@cs.cmu.edu Large-scale Matrix Factorization (by Kijung Shin) 94/99

References [SZL+05] Jian-Tao Sun, Hua-Jun Zeng, Huan Liu, Yuchang Lu, and Zheng Chen. "Cubesvd: a novel approach to personalized web search." WWW 2005 [ZWSP08] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. "Large-scale parallel collaborative filtering for the netflix prize." AAIM 2008 [KABO10] Karatzoglou, Alexandros, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. "Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering." RecSys 2010. [GNHS11] Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." KDD 2011 [DKA11] Daniel M. Dunlavy, Tamara G. Kolda, and Evrim Acar. "Temporal link prediction using matrix and tensor factorizations." TKDD 2011 Large-scale Matrix Factorization (by Kijung Shin) 95/99

References (cont.) [YHSD12] Yu, Hsiang-Fu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. "Scalable coordinate descent approaches to parallel matrix factorization for recommender systems." ICDM 2012 [LMWY13] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. "Tensor completion for estimating missing values in visual data." TPAMI 2013. [BKP+14] Alex Beutel, Abhimanu Kumar, Evangelos E. Papalexakis, Partha Pratim Talukdar, Christos Faloutsos, and Eric P. Xing. "Flexifact: Scalable flexible factorization of coupled tensors on hadoop." SDM 2014 [SPK16] Shaden Smith, Jongsoo Park, and George Karypis. "An exploration of optimization algorithms for high performance tensor completion." In High Performance Computing, Networking, Storage and Analysis, SC 2016 [SK17] Kijung Shin and U Kang. Distributed Methods for Highdimensional and Large-scale Tensor Factorization. ICDM 2014 Large-scale Matrix Factorization (by Kijung Shin) 96/99

Closed-Form Solution of ALS Let W = Then, L V, W, H W 1: : W n:, H = H :1 H :m = σ i,j Z (V ij W i: H :j ) 2 + λ σn i=1 = σ i,j Z (V ij σr k=1 W ik H kj ) 2 + λ σ n i=1 σr k=1 W 2 ik + σ m j=1 σr k=1 W i: 2 H kj 2 + σ m j=1 H :j 2 Large-scale Matrix Factorization (by Kijung Shin) 97/99

Closed-Form Solution of ALS (cont.) 1 2 L H kj = 0, k, j (first-order conditions) σ i Z:j V ij W i: H :j W ik + λh kj = 0, k, j rows of non-missing entrys in the j-th column of V σ i Z:j W i: H :j W ik + λh kj = σ i Z:j V ij W ik, k, j σ i Z:j W i: H :j W ik + λh kj = σ i Z:j V ij W ik, k, j Large-scale Matrix Factorization (by Kijung Shin) 98/99

Closed-Form Solution of ALS (cont.) (If we stack r conditions on H 1j,., H rj ) (σ i Z:j W i: W T i: + λi k )H :j = σ i Z:j V ij W i:, j k by k identity matrix H :j = σ i Z:j W i: W i: T + λi k 1 σi Z:j V ij W i:, j Large-scale Matrix Factorization (by Kijung Shin) 99/99