Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD

Large-scale Matrix Factorization Kijung Shin Ph.D. Student, CSD

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 2/99

Roadmap Matrix Factorization (review) << Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 3/99

Matrix Factorization: Problem Given: V: n by m matrix possibly with missing values r: target rank (a scalar) usually r << n and r << m n? m??? V Large-scale Matrix Factorization (by Kijung Shin) 4/99

Matrix Factorization: Problem Given: V: n by m matrix possibly with missing values r: target rank (a scalar) usually r << n and r << m r m Find: W: n by r matrix H: r by m matrix without missing values n W r H Large-scale Matrix Factorization (by Kijung Shin) 5/99

Matrix Factorization: Problem Goal: WH V?? H?? W V Large-scale Matrix Factorization (by Kijung Shin) 6/99

Loss Function L V, W, H = σ i,j Z (V ij WH ij ) 2 + Indices of non-missing entries i.e., i, j Z V ij is not missing (i, j)-th entry of V (i, j)-th entry of WH Goal: to make WH similar to V Large-scale Matrix Factorization (by Kijung Shin) 7/99

Loss Function (cont.) L V, W, H = σ i,j Z (V ij WH ij ) 2 + λ( W F 2 + H F 2 ) Regularization parameter Goal: to prevent overfitting Frobenius Norm: W F 2 = i=1 (by making the entries of W and H close to zero) n r k=1 W ik 2 Large-scale Matrix Factorization (by Kijung Shin) 8/99

Algorithms How can we minimize this loss function L? Stochastic Gradient Descent: SGD (covered in the last lecture) Alternating Least Square: ALS (covered today) Cyclic Coordinate Descent: CCD++ (covered today) Are these algorithms parallelizable? Yes, all of them! Large-scale Matrix Factorization (by Kijung Shin) 9/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD << Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 10/99

Distributed SGD (DSGD) Large-scale matrix factorization with distributed stochastic gradient descent (KDD 2011) Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis Large-scale Matrix Factorization (by Kijung Shin) 11/99

Stochastic GD for MF (review) Let W = W 1: : W n:, H = H :1 H :m L V, W, H : sum of loss for each non-missing entry L V, W, H = i,j Z L V ij, W i:, H :j where loss at each non-missing entry V ij is L V ij, W i:, H :j V ij W i: H :j 2 + λ W i: 2 = WH ij Number of non-missing entries in the i-th row of V N i: + H :j N :j Large-scale Matrix Factorization (by Kijung Shin) 12/99 2

Stochastic GD for MF (cont.) Stochastic Gradient Descent (SGD) for MF repeat until convergence randomly shuffle the non-missing entries of V for each non-missing entry: perform an SGD step on it Large-scale Matrix Factorization (by Kijung Shin) 13/99

Stochastic GD for MF (cont.) An SGD step on each non-missing entry V ij : Read W i: and H :j Compute gradient of L V ij, W i:, H :j Update W i: and H :j Detailed update rules were covered in the last lecture H :j H W i: V ij W V Large-scale Matrix Factorization (by Kijung Shin) 14/99

Simple Parallel SGD for MF Parameter Mixing: MSGD entries of V are distributed across multiple machines Machine 1 Machine 2 Machine 3 V V V Large-scale Matrix Factorization (by Kijung Shin) 15/99

Simple Parallel SGD for MF (cont.) Parameter Mixing: MSGD entries of V are distributed across multiple machines Map step: each machine runs SGD independently on the assigned entries until convergence Machine 1 Machine 2 Machine 3 H H H W W W V V V Large-scale Matrix Factorization (by Kijung Shin) 16/99

Simple Parallel SGD for MF (cont.) Parameter Mixing: MSGD Map step: each machine runs an independent instance of SGD on subsets of the data until convergence H Reduce step: average results (i.e., W and H) Machine 1 Machine 2 Machine 3 H + H + H Problem: does not converge to correct solutions no guarantee that the average reduces the loss function L 3 Large-scale Matrix Factorization (by Kijung Shin) 17/99

Simple Parallel SGD for MF (cont.) Iterative Parameter Mixing: ISGD entries of V are distributed across multiple machines Repeat until convergence Map step: each machine runs SGD independently on the assigned entries for some time Reduce step: average results (i.e., W and H) broadcast averaged results Problem: slow convergence still has the averaging step Large-scale Matrix Factorization (by Kijung Shin) 18/99

Interchangeability How can we avoid the averaging step? let different machines update different entries of W and H Two entries V ij and V kl of V are interchangeable if they share neither row nor column H :j H :l H Not interchangeable! H :j H :l H W k: V kl W i: V ij Interchangeable! W i: V ij V il W V W V Large-scale Matrix Factorization (by Kijung Shin) 19/99

Interchangeability (cont.) If V ij and V kl are interchangeable, SGD steps on V ij and V kl can be parallelized safely Read W i: and H :j Compute gradient of L W i:, H :j Update W i: and H :j H :j H :l H Read W k: and H :l Compute gradient of L W l:, H :l Update W k: and H :l W k: W i: V ij V kl no conflicts! W V Large-scale Matrix Factorization (by Kijung Shin) 20/99

Distributed SGD Block W, H, and V H H (1) H (2) H (3) W (1) V (1,1) V (1,2) V (1,3) W V W (2) V (2,1) V (2,2) V (2,3) W (3) V (3,1) V (3,2) V (3,3) Large-scale Matrix Factorization (by Kijung Shin) 21/99

Distributed SGD (cont.) Block W, H, and V repeat until convergence for a set of interchangeable blocks of V 1. run SGD on the blocks in parallel no conflict between machines 2. merge results (i.e., W and H) averaging is not needed W (1) W (2) H (1) H (2) H (3) V (1,1) V (2,2) Blocks on a diagonal are interchangeable! W (3) V (3,3) Large-scale Matrix Factorization (by Kijung Shin) 22/99

Interchangeable Blocks Example of interchangeable blocks H (1) H (2) H (3) H (1) H (2) W (1) V (1,1) W (1) V (1,1) W (2) V (2,2) W (2) V (2,2) Machine 1 Machine 2 H (3) W (3) V (3,3) Blocks are interchangeable! W (3) V (3,3) Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 23/99

Interchangeable Blocks (cont.) Example of interchangeable blocks H (1) H (2) H (3) H (2) H (3) W (1) V (1,2) W (1) V (1,2) W (2) V (2,3) W (2) V (2,3) Machine 1 Machine 2 H (1) W (3) V (3,1) Blocks are interchangeable! W (3) V (3,1) Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 24/99

Interchangeable Blocks (cont.) Example of interchangeable blocks H (1) H (2) H (3) H (3) H (1) W (1) V (1,3) W (1) V (1,3) W (2) V (2,1) W (2) V (2,1) Machine 1 Machine 2 H (2) W (3) V (3,2) Blocks are interchangeable! W (3) V (3,2) Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 25/99

Interchangeable Blocks (cont.) Not interchangeable blocks Not used in DSGD H (1) H (2) H (3) W (1) V (1,3) W (2) V (2,3) W (3) V (3,2) Blocks are not interchangeable! Large-scale Matrix Factorization (by Kijung Shin) 26/99

More details of DSGD How should we choose sets of interchangeable blocks? In each iteration, we choose sets of interchangeable blocks so that every block of V is used exactly once E.g., But the order and grouping are random SGD within each block every non-missing entry in the block is used exactly once but the order is random Large-scale Matrix Factorization (by Kijung Shin) 27/99

More details of DSGD (cont.) Use bold driver to set step size (or learning rate) After each iteration, increase the step size if the loss decreases decrease the step size if the loss increases Implemented Hadoop and R/Snowfall Snowfall: package for parallel R programs https://cran.r-project.org/web/packages/snowfall/index.html Large-scale Matrix Factorization (by Kijung Shin) 28/99

Pros & Cons of DSGD Pros: Faster convergence than MSGD and ISGD no averaging step Memory efficiency: each machine needs to maintain a single block of W and a single block of H in memory H (1) H (2) H (3) W (1) V (1,1) W (2) V (2,2) W (3) V (3,3) Machine 1 Machine 2 Machine 3 Cons: Many hyperparameters: step size (ε) in addition to regularization parameter (λ) and rank (r) Large-scale Matrix Factorization (by Kijung Shin) 29/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS << Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 30/99

Alternating Least Square Large-scale Parallel Collaborative Filtering for the Netflix Prize (AAIM 2008) Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan Large-scale Matrix Factorization (by Kijung Shin) 31/99

Main Idea behind ALS How hard is matrix factorization? i.e., finding argmin W,H L(V, W, H) Solving the entire problem is difficult L V, W, H is a non-convex function of W and H many local optima finding a global optimum is NP-hard Large-scale Matrix Factorization (by Kijung Shin) 32/99

Main Idea behind ALS (cont.) How hard is solving a smaller problem? Specifically, finding argmin L(V, W, H) H while fixing W to its current value Much easier! L V, W, H is a convex function of H (once W is fixed) one local optimum, which is also globally optimal Moreover, there exists the closed-form solution we can direct compute the global optimum Large-scale Matrix Factorization (by Kijung Shin) 33/99

Updating H Finding argmin H L(V, W, H) while fixing W to its current value This problem is solved by the following update rule k by k identity matrix For j = 1 m H :j i Z :j W i: W i: T + λi k derived from L H kj = 0, k, j (first-order condition) 1 i Z :j V ij W i: row indices of non-missing entries in the j-th column of V Large-scale Matrix Factorization (by Kijung Shin) 34/99

Updating W Likewise, finding argmin L(V, W, H) W while fixing H to its current value This problem is solved by the following update rule For i = 1 n W i: derived from i Z i: H :j T H:j + λi k 1 j Z i: V ij H :j column indices of non-missing entries in the i-th row of V L W ik = 0, i, k (first-order condition) Large-scale Matrix Factorization (by Kijung Shin) 35/99

Alternating Least Square (ALS) randomly Initialize W and H repeat until convergence update W while fixing H to its current value update H while fixing W to its current value Each step never increases L V, W, H W is updated to the best W that minimizes L V, W, H for current H H is updated to the best H that minimizes L V, W, H for current W L V, W, H monotonically decreases until convergence Large-scale Matrix Factorization (by Kijung Shin) 36/99

Parallel ALS: Idea Recall the update rule for each j-th column of H H :j i Z :j W i: W i: T + λi k 1 i Z :j V ij W i: Read H Written H :j H V :j W V W V Large-scale Matrix Factorization (by Kijung Shin) 37/99

Parallel ALS: Idea (cont.) Potentially every entry of W needs to be read Only the j-th column of V needs to be read H does not need to be read can update the columns of H independently in parallel Read H Written H :j H V :j W V W V Large-scale Matrix Factorization (by Kijung Shin) 38/99

Updating H in Parallel A toy example with 3 machines 1. Divide V column-wise into 3 pieces W V (1) V (2) V (3) Large-scale Matrix Factorization (by Kijung Shin) 39/99

Updating H in Parallel (cont.) 2. Distribute the pieces across the machines 3. W is broadcast to every machine Machine 1 Machine 2 Machine 3 W V (1) V (2) W V (1) W V (2) W V (3) V (3) Large-scale Matrix Factorization (by Kijung Shin) 40/99

Updating H in Parallel (cont.) 4. Compute the corresponding columns of H in parallel Machine 1 Machine 2 Machine 3 New! New! New! H (1) H (2) H (3) W V (1) V (2) W V (1) W V (2) W V (3) V (3) Large-scale Matrix Factorization (by Kijung Shin) 41/99

Updating H in Parallel (cont.) 5. Broadcast the part of H that each machine computes H (1) H (2) H (3) H (1) H (2) H (3) H (1) H (2) H (3) W V (1) W V (2) W V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 42/99

Parallel ALS: Idea Recall the update rule for each row of W W i: i Z i: H :j T H:j + λi k 1 j Z i: V ij H :j Read H Written H V i: W i: W V W V Large-scale Matrix Factorization (by Kijung Shin) 43/99

Parallel ALS: Idea (cont.) Potentially every entry of H needs to be read Only the i-th row of V needs to be read W does not need to be read can update the rows of W independently in parallel Read H Written H V i: W i: W V W V Large-scale Matrix Factorization (by Kijung Shin) 44/99

Updating W in Parallel A toy example with 3 machines 1. Divide V row-wise into 3 pieces H V (1) V (2) V (3) Large-scale Matrix Factorization (by Kijung Shin) 45/99

Updating W in Parallel (cont.) 2. Distribute the pieces across the machines 3. H is broadcast to every machine H V (1) V (2) V (3) H H H V (1) V (2) V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 46/99

Updating W in Parallel (cont.) 4. Compute the corresponding rows of W in parallel H H H New! New! New! W (1) W (2) V (3) V (1) V (2) V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 47/99

Updating W in Parallel (cont.) 5. Broadcast the part of W that each machine computes W (1) W (1) W (1) W (2) H H H W (2) W (2) W (3) W (3) W (3) V (1) V (2) V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 48/99

Pros & Cons of ALS Pros: Less hyper-parameters: not requiring step size Cons: High computational cost: e.g., matrix inversion takes O(r 3 ) 1 H :j i Z :j W i: W i: T + λi k i Z :j V ij W i: Large-scale Matrix Factorization (by Kijung Shin) 49/99

Pros & Cons of ALS (cont.) Cons (cont.) High memory requirement: while updating W(or H), each machine maintains the entire H(or W) in memory Machine 1 Machine 2 Machine 3 W V (1) W V (2) W V (3) Large-scale Matrix Factorization (by Kijung Shin) 50/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ << Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 51/99

Cyclic Coordinate Descent Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems (ICDM 2012) Best Paper Award Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon Large-scale Matrix Factorization (by Kijung Shin) 52/99

Matrix Factorization: Revisit Goal: V is approximated by the product of W and H m r m n??? n H r? V W Large-scale Matrix Factorization (by Kijung Shin) 53/99

Matrix Factorization: Revisit Goal: V is approximated by the product of W and H m r m n?? V ij? n W i: H :j H r? V W V ij WH ij = σr k=1 W ik H kj = W i1 H 1j + + W ir H rj Large-scale Matrix Factorization (by Kijung Shin) 54/99

Matrix Factorization: Revisit Equivalently, V is approximated by the sum of r matrices?? + + H 1: 1 st row of H H r: r th row of H?? 1 st column of W r th column of W V W :1 W :r W :rh r: W :1 H 1: ij = W i1 H 1j W :r H r: ij = W ir H rj V ij W :1 H 1: ij + + W :r H r: ij = W i1 H 1j + + W ir H rj Large-scale Matrix Factorization (by Kijung Shin) 55/99

Cyclic Coordinate Descent: CCD++ repeat until convergence for k = 1 r update the k-th column of W and the k-th row of H while fixing the other entries of W and H Consider a toy example with rank 3 Step 1: H 1: + + H 2: H 3: V W :1 W :2 W :3 Large-scale Matrix Factorization (by Kijung Shin) 56/99

Step 2: Order of Updates in CCD++ + + H 1: H 2: H 3: V Step 3: W :1 W :2 W :3 + + H 1: H 2: H 3: V W :1 W :2 W :3 Large-scale Matrix Factorization (by Kijung Shin) 57/99

Performing one update Let W :1 and H 1: be the updated row and column + + H 1: H 2: H 3: V W :1 W :2 W :3 H 2: H 3: H 1: V W :2 W :3 W :1 residual matrix R Large-scale Matrix Factorization (by Kijung Shin) 58/99

Performing one update (cont.) H 2: H 3: H 1: V W :2 W :3 W :1 residual matrix R Updating W :1 and H 1: equals to factorizing R with rank 1 R = V W :2 H 2: W :3 H :3 = V W :1 H 1: W :2 H 2: W :3 H :3 + W :1 H 1: = V WH + W :1 H 1: Large-scale Matrix Factorization (by Kijung Shin) 59/99

Performing one update (cont.) Factorizing R with rank 1 can be performed using ALS with the following rules: The updates rules can be derived from those in ALS For j = 1 m H kj σ i Z:j R ij W ik σi Z:j W ik 2 + λ For i = 1 n σ W ik j Zi: R ij H kj σ j Zi: H kj 2 + λ Large-scale Matrix Factorization (by Kijung Shin) 60/99

Pseudocode of CCD++ Pseudocode (Simple) randomly initialize W and H repeat until convergence for k = 1 r: R V WH + W :k H k: update W :k and H k: (rank-1 factorization of R) Repeated!! Pseudocode (Detailed) randomly initialize W and H R V WH Maintained!! repeat until convergence for k = 1 r: R R + W :k H k: update W :k and H k: (rank-1 factorization of R) R R + W :k H k: Large-scale Matrix Factorization (by Kijung Shin) 61/99

Parallel CCD++ Updating H k: in parallel 1. divide R column-wise 2. distribute the pieces across the machines 3. W :k is broadcast to every machine H k: Machine 1 Machine 2 Machine 3 R (1) R (2) R (3) R(1) R(2) R(3) W :k W :k W :k W K: Large-scale Matrix Factorization (by Kijung Shin) 62/99

Parallel CCD++ (cont.) Updating H k: in parallel (cont.) 4. compute the corresponding entries of H k: in parallel Machine 1 Machine 2 Machine 3 New! H 1 k: New! H 2 k: New! H 3 k: R(1) R(2) R(3) W :k W :k W :k Large-scale Matrix Factorization (by Kijung Shin) 63/99

Parallel CCD++ (cont.) Updating H k: in parallel (cont.) 5. send the computed entries to the other machines. Machine 1 Machine 2 Machine 3 H k: H k: H k: R(1) R(2) R(3) W :k W :k W :k Large-scale Matrix Factorization (by Kijung Shin) 64/99

Pros & Cons of CCD++ Pros: Low computational cost update rules do not need matrix inversion H kj σ i Z:j R ij W ik σi Z:j W ik 2 + λ σ W ik j Zi: R ij H kj σ j Zi: H kj 2 + λ Large-scale Matrix Factorization (by Kijung Shin) 65/99

Pros & Cons of CCD++ Pros (cont.): Low memory requirements each machine needs to maintain one row of H (or one column of W) in memory at a time instead of entire H (or W) Machine 1 in ALS Machine 1 in CCD++ W V (1) R (1) W :k Large-scale Matrix Factorization (by Kijung Shin) 66/99

Pros & Cons of CCD++ Cons: Slow for dense matrices with many non-missing entries To update R, every non-missing entry needs to be read and rewritten. This is especially slow if R is stored on disk. Large-scale Matrix Factorization (by Kijung Shin) 67/99

Roadmap Problem (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments << Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 68/99

Users Experimental Settings Datasets: Movies (or songs) 5? 4 3 3 4 1? 2? 1? 5 3? 2 Ratings Large-scale Matrix Factorization (by Kijung Shin) 69/99

Experimental Settings (cont.) Datasets: (2.7M users, 18K movies, 100M ratings) (1M users, 625K songs, 256M ratings) (71K users, 65K movies, 10M ratings) Large-scale Matrix Factorization (by Kijung Shin) 70/99

Experimental Settings (cont.) Split of Datasets: Training set: about 80% of non-missing entries Test set: about 20% of non-missing entries Evaluation Metric: Test RMSE (Root-mean square error) RMSE V, W, H = 1 Z test V ij WH ij 2 (i,j) Z test index of test entries True rating Estimated rating Large-scale Matrix Factorization (by Kijung Shin) 71/99

Experimental Settings (cont.) Machines and implementations Shared-memory setting: 8 cores OpenMP in C++ Distributed setting: up to 20 machines MPI in C++ Large-scale Matrix Factorization (by Kijung Shin) 72/99

Convergence Speed Shared-memory setting with 8 cores CCD++ decreases Test RMSE fastest Figure 3(b) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 73/99

Convergence Speed (cont.) Shared-memory setting with 8 cores CCD++ converges Test RMSE fastest Figure 3(c) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 74/99

Convergence Speed (cont.) Shared-memory setting with 8 cores CCD++ converges fastest Figure 3(a) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 75/99

Speedup in Shared Memory CCD++ and ALS show near-linear machine scalability DSGD suffers form high cache-miss rates due to its randomness Cache-miss rate increases as more cores are used Figure 4 of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 76/99

Speedup in Distributed Settings All algorithms show similar speedups Figure 6(b) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 77/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization << Conclusions Large-scale Matrix Factorization (by Kijung Shin) 78/99

Tensors An N-way tensor is an N-dimensional array t 1 t I1 1-way tensor (= vector) t 11 t 1I2 t I1 1 t I1 I 2 2-way tensor (= matrix) i 1 = 1,, I 1 i 2 = 1,, I 2 3-way tensor Large-scale Matrix Factorization (by Kijung Shin) 79/99

Tensor Factorization: Problem Given: X: I by J by K tensor possibly with missing values R: target rank (a scalar) usually R I, J, K K I???? J X Large-scale Matrix Factorization (by Kijung Shin) 80/99

Tensor Factorization: Problem Given: X: I by J by K tensor (i.e., 3-dim array) possibly with missing values R: target rank (a scalar) usually R I, J, K R J R I H Find: U: I by R matrix H: J by R matrix W: Kby R matrix without missing values U K R W Large-scale Matrix Factorization (by Kijung Shin) 81/99

Tensor Factorization: Problem Goal: X [UHW] where UHW ijk = σr r=1 U ir H jr W kr???? H T X U Large-scale Matrix Factorization (by Kijung Shin) 82/99

Loss Function L X, U, H, W = σ i,j,k Z (X ijk UHW ijk ) 2 + Indices of non-missing entries i.e., i, j, k Z X ijk is not missing (i, j, k)-th entry of X (i, j, k)-th entry of [UHW] Goal: to make X and UHW similar Large-scale Matrix Factorization (by Kijung Shin) 83/99

Loss Function (cont.) L X, U, H, W = σ i,j,k Z (X ijk UHW ijk ) 2 + λ( U F 2 + H F 2 + W F 2 ) Regularization parameter Goal: to prevent overfitting Frobenius Norm: U F 2 = i=1 (by making the entries of U, H and V close to zero) I R r=1 U ir 2 Large-scale Matrix Factorization (by Kijung Shin) 84/99

SGD for TF L X, U, H, W : sum of loss for each non-missing entry L X, U, H, W = σ i,j,k Z L (X ijk, U i: H j: W k: ) An SGD step on each non-missing entry X ijk : Read U i:, H j:, and W k: Compute gradient of L V ij, W i:, H :j, W k: Update U i:, H j:, W k: W H T X ijk U X Large-scale Matrix Factorization (by Kijung Shin) 85/99

Parallel Algorithms for TF Parallel algorithms for MF are extended to TF DSGD FlexiFaCT [BKP+14] ALS [SPK16] CCD++ CDTF [SK17] FlexiFaCT [VKP+14] Large-scale Matrix Factorization (by Kijung Shin) 86/99

Users Applications Context-aware recommendation [KABO10] 4 5? 2 Missing rating X ijk = UHW ijk 3 Ratings Movies Large-scale Matrix Factorization (by Kijung Shin) 87/99

Users Applications (cont.) Context-aware recommendation [KABO10] 4? Missing rating X ijk 5 2 3 = UHW ijk Ratings Restaurants Large-scale Matrix Factorization (by Kijung Shin) 88/99

X-axis Applications (cont.) Video Restoration [LMWY13] 4 5 Y-Axis? 2 3 Missing or corrupted pixel X ijk = UHW ijk Pixels Large-scale Matrix Factorization (by Kijung Shin) 89/99

Applications (cont.) Video Restoration [LMWY13] Corrupted frame Restored frame Large-scale Matrix Factorization (by Kijung Shin) 90/99

Users Applications (cont.) Personalized Web Search [SZL+05] 4? Missing preference X ijk 5 2 3 Query keywords = UHW ijk Preference Large-scale Matrix Factorization (by Kijung Shin) 91/99

Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions << Large-scale Matrix Factorization (by Kijung Shin) 92/99

Conclusions Matrix Factorization (MF) Parallel algorithms for MF Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Tensor Factorization (TF) Extension of MF to tensors Applications: context-aware recommendation, video restoration, personalized web search, etc. Large-scale Matrix Factorization (by Kijung Shin) 93/99

Questions? These slides are available at: www.cs.cmu.edu/~kijungs/etc/10-405.pdf Email: kijungs@cs.cmu.edu Large-scale Matrix Factorization (by Kijung Shin) 94/99

References [SZL+05] Jian-Tao Sun, Hua-Jun Zeng, Huan Liu, Yuchang Lu, and Zheng Chen. "Cubesvd: a novel approach to personalized web search." WWW 2005 [ZWSP08] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. "Large-scale parallel collaborative filtering for the netflix prize." AAIM 2008 [KABO10] Karatzoglou, Alexandros, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. "Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering." RecSys 2010. [GNHS11] Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." KDD 2011 [DKA11] Daniel M. Dunlavy, Tamara G. Kolda, and Evrim Acar. "Temporal link prediction using matrix and tensor factorizations." TKDD 2011 Large-scale Matrix Factorization (by Kijung Shin) 95/99

References (cont.) [YHSD12] Yu, Hsiang-Fu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. "Scalable coordinate descent approaches to parallel matrix factorization for recommender systems." ICDM 2012 [LMWY13] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. "Tensor completion for estimating missing values in visual data." TPAMI 2013. [BKP+14] Alex Beutel, Abhimanu Kumar, Evangelos E. Papalexakis, Partha Pratim Talukdar, Christos Faloutsos, and Eric P. Xing. "Flexifact: Scalable flexible factorization of coupled tensors on hadoop." SDM 2014 [SPK16] Shaden Smith, Jongsoo Park, and George Karypis. "An exploration of optimization algorithms for high performance tensor completion." In High Performance Computing, Networking, Storage and Analysis, SC 2016 [SK17] Kijung Shin and U Kang. Distributed Methods for Highdimensional and Large-scale Tensor Factorization. ICDM 2014 Large-scale Matrix Factorization (by Kijung Shin) 96/99

Closed-Form Solution of ALS Let W = Then, L V, W, H W 1: : W n:, H = H :1 H :m = σ i,j Z (V ij W i: H :j ) 2 + λ σn i=1 = σ i,j Z (V ij σr k=1 W ik H kj ) 2 + λ σ n i=1 σr k=1 W 2 ik + σ m j=1 σr k=1 W i: 2 H kj 2 + σ m j=1 H :j 2 Large-scale Matrix Factorization (by Kijung Shin) 97/99

Closed-Form Solution of ALS (cont.) 1 2 L H kj = 0, k, j (first-order conditions) σ i Z:j V ij W i: H :j W ik + λh kj = 0, k, j rows of non-missing entrys in the j-th column of V σ i Z:j W i: H :j W ik + λh kj = σ i Z:j V ij W ik, k, j σ i Z:j W i: H :j W ik + λh kj = σ i Z:j V ij W ik, k, j Large-scale Matrix Factorization (by Kijung Shin) 98/99

Closed-Form Solution of ALS (cont.) (If we stack r conditions on H 1j,., H rj ) (σ i Z:j W i: W T i: + λi k )H :j = σ i Z:j V ij W i:, j k by k identity matrix H :j = σ i Z:j W i: W i: T + λi k 1 σi Z:j V ij W i:, j Large-scale Matrix Factorization (by Kijung Shin) 99/99