Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD

Size: px
Start display at page:

Download "Large-scale Matrix Factorization. Kijung Shin Ph.D. Student, CSD"

Transcription

1 Large-scale Matrix Factorization Kijung Shin Ph.D. Student, CSD

2 Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 2/99

3 Roadmap Matrix Factorization (review) << Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 3/99

4 Matrix Factorization: Problem Given: V: n by m matrix possibly with missing values r: target rank (a scalar) usually r << n and r << m n? m??? V Large-scale Matrix Factorization (by Kijung Shin) 4/99

5 Matrix Factorization: Problem Given: V: n by m matrix possibly with missing values r: target rank (a scalar) usually r << n and r << m r m Find: W: n by r matrix H: r by m matrix without missing values n W r H Large-scale Matrix Factorization (by Kijung Shin) 5/99

6 Matrix Factorization: Problem Goal: WH V?? H?? W V Large-scale Matrix Factorization (by Kijung Shin) 6/99

7 Loss Function L V, W, H = σ i,j Z (V ij WH ij ) 2 + Indices of non-missing entries i.e., i, j Z V ij is not missing (i, j)-th entry of V (i, j)-th entry of WH Goal: to make WH similar to V Large-scale Matrix Factorization (by Kijung Shin) 7/99

8 Loss Function (cont.) L V, W, H = σ i,j Z (V ij WH ij ) 2 + λ( W F 2 + H F 2 ) Regularization parameter Goal: to prevent overfitting Frobenius Norm: W F 2 = i=1 (by making the entries of W and H close to zero) n r k=1 W ik 2 Large-scale Matrix Factorization (by Kijung Shin) 8/99

9 Algorithms How can we minimize this loss function L? Stochastic Gradient Descent: SGD (covered in the last lecture) Alternating Least Square: ALS (covered today) Cyclic Coordinate Descent: CCD++ (covered today) Are these algorithms parallelizable? Yes, all of them! Large-scale Matrix Factorization (by Kijung Shin) 9/99

10 Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD << Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 10/99

11 Distributed SGD (DSGD) Large-scale matrix factorization with distributed stochastic gradient descent (KDD 2011) Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis Large-scale Matrix Factorization (by Kijung Shin) 11/99

12 Stochastic GD for MF (review) Let W = W 1: : W n:, H = H :1 H :m L V, W, H : sum of loss for each non-missing entry L V, W, H = i,j Z L V ij, W i:, H :j where loss at each non-missing entry V ij is L V ij, W i:, H :j V ij W i: H :j 2 + λ W i: 2 = WH ij Number of non-missing entries in the i-th row of V N i: + H :j N :j Large-scale Matrix Factorization (by Kijung Shin) 12/99 2

13 Stochastic GD for MF (cont.) Stochastic Gradient Descent (SGD) for MF repeat until convergence randomly shuffle the non-missing entries of V for each non-missing entry: perform an SGD step on it Large-scale Matrix Factorization (by Kijung Shin) 13/99

14 Stochastic GD for MF (cont.) An SGD step on each non-missing entry V ij : Read W i: and H :j Compute gradient of L V ij, W i:, H :j Update W i: and H :j Detailed update rules were covered in the last lecture H :j H W i: V ij W V Large-scale Matrix Factorization (by Kijung Shin) 14/99

15 Simple Parallel SGD for MF Parameter Mixing: MSGD entries of V are distributed across multiple machines Machine 1 Machine 2 Machine 3 V V V Large-scale Matrix Factorization (by Kijung Shin) 15/99

16 Simple Parallel SGD for MF (cont.) Parameter Mixing: MSGD entries of V are distributed across multiple machines Map step: each machine runs SGD independently on the assigned entries until convergence Machine 1 Machine 2 Machine 3 H H H W W W V V V Large-scale Matrix Factorization (by Kijung Shin) 16/99

17 Simple Parallel SGD for MF (cont.) Parameter Mixing: MSGD Map step: each machine runs an independent instance of SGD on subsets of the data until convergence H Reduce step: average results (i.e., W and H) Machine 1 Machine 2 Machine 3 H + H + H Problem: does not converge to correct solutions no guarantee that the average reduces the loss function L 3 Large-scale Matrix Factorization (by Kijung Shin) 17/99

18 Simple Parallel SGD for MF (cont.) Iterative Parameter Mixing: ISGD entries of V are distributed across multiple machines Repeat until convergence Map step: each machine runs SGD independently on the assigned entries for some time Reduce step: average results (i.e., W and H) broadcast averaged results Problem: slow convergence still has the averaging step Large-scale Matrix Factorization (by Kijung Shin) 18/99

19 Interchangeability How can we avoid the averaging step? let different machines update different entries of W and H Two entries V ij and V kl of V are interchangeable if they share neither row nor column H :j H :l H Not interchangeable! H :j H :l H W k: V kl W i: V ij Interchangeable! W i: V ij V il W V W V Large-scale Matrix Factorization (by Kijung Shin) 19/99

20 Interchangeability (cont.) If V ij and V kl are interchangeable, SGD steps on V ij and V kl can be parallelized safely Read W i: and H :j Compute gradient of L W i:, H :j Update W i: and H :j H :j H :l H Read W k: and H :l Compute gradient of L W l:, H :l Update W k: and H :l W k: W i: V ij V kl no conflicts! W V Large-scale Matrix Factorization (by Kijung Shin) 20/99

21 Distributed SGD Block W, H, and V H H (1) H (2) H (3) W (1) V (1,1) V (1,2) V (1,3) W V W (2) V (2,1) V (2,2) V (2,3) W (3) V (3,1) V (3,2) V (3,3) Large-scale Matrix Factorization (by Kijung Shin) 21/99

22 Distributed SGD (cont.) Block W, H, and V repeat until convergence for a set of interchangeable blocks of V 1. run SGD on the blocks in parallel no conflict between machines 2. merge results (i.e., W and H) averaging is not needed W (1) W (2) H (1) H (2) H (3) V (1,1) V (2,2) Blocks on a diagonal are interchangeable! W (3) V (3,3) Large-scale Matrix Factorization (by Kijung Shin) 22/99

23 Interchangeable Blocks Example of interchangeable blocks H (1) H (2) H (3) H (1) H (2) W (1) V (1,1) W (1) V (1,1) W (2) V (2,2) W (2) V (2,2) Machine 1 Machine 2 H (3) W (3) V (3,3) Blocks are interchangeable! W (3) V (3,3) Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 23/99

24 Interchangeable Blocks (cont.) Example of interchangeable blocks H (1) H (2) H (3) H (2) H (3) W (1) V (1,2) W (1) V (1,2) W (2) V (2,3) W (2) V (2,3) Machine 1 Machine 2 H (1) W (3) V (3,1) Blocks are interchangeable! W (3) V (3,1) Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 24/99

25 Interchangeable Blocks (cont.) Example of interchangeable blocks H (1) H (2) H (3) H (3) H (1) W (1) V (1,3) W (1) V (1,3) W (2) V (2,1) W (2) V (2,1) Machine 1 Machine 2 H (2) W (3) V (3,2) Blocks are interchangeable! W (3) V (3,2) Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 25/99

26 Interchangeable Blocks (cont.) Not interchangeable blocks Not used in DSGD H (1) H (2) H (3) W (1) V (1,3) W (2) V (2,3) W (3) V (3,2) Blocks are not interchangeable! Large-scale Matrix Factorization (by Kijung Shin) 26/99

27 More details of DSGD How should we choose sets of interchangeable blocks? In each iteration, we choose sets of interchangeable blocks so that every block of V is used exactly once E.g., But the order and grouping are random SGD within each block every non-missing entry in the block is used exactly once but the order is random Large-scale Matrix Factorization (by Kijung Shin) 27/99

28 More details of DSGD (cont.) Use bold driver to set step size (or learning rate) After each iteration, increase the step size if the loss decreases decrease the step size if the loss increases Implemented Hadoop and R/Snowfall Snowfall: package for parallel R programs Large-scale Matrix Factorization (by Kijung Shin) 28/99

29 Pros & Cons of DSGD Pros: Faster convergence than MSGD and ISGD no averaging step Memory efficiency: each machine needs to maintain a single block of W and a single block of H in memory H (1) H (2) H (3) W (1) V (1,1) W (2) V (2,2) W (3) V (3,3) Machine 1 Machine 2 Machine 3 Cons: Many hyperparameters: step size (ε) in addition to regularization parameter (λ) and rank (r) Large-scale Matrix Factorization (by Kijung Shin) 29/99

30 Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS << Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 30/99

31 Alternating Least Square Large-scale Parallel Collaborative Filtering for the Netflix Prize (AAIM 2008) Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan Large-scale Matrix Factorization (by Kijung Shin) 31/99

32 Main Idea behind ALS How hard is matrix factorization? i.e., finding argmin W,H L(V, W, H) Solving the entire problem is difficult L V, W, H is a non-convex function of W and H many local optima finding a global optimum is NP-hard Large-scale Matrix Factorization (by Kijung Shin) 32/99

33 Main Idea behind ALS (cont.) How hard is solving a smaller problem? Specifically, finding argmin L(V, W, H) H while fixing W to its current value Much easier! L V, W, H is a convex function of H (once W is fixed) one local optimum, which is also globally optimal Moreover, there exists the closed-form solution we can direct compute the global optimum Large-scale Matrix Factorization (by Kijung Shin) 33/99

34 Updating H Finding argmin H L(V, W, H) while fixing W to its current value This problem is solved by the following update rule k by k identity matrix For j = 1 m H :j i Z :j W i: W i: T + λi k derived from L H kj = 0, k, j (first-order condition) 1 i Z :j V ij W i: row indices of non-missing entries in the j-th column of V Large-scale Matrix Factorization (by Kijung Shin) 34/99

35 Updating W Likewise, finding argmin L(V, W, H) W while fixing H to its current value This problem is solved by the following update rule For i = 1 n W i: derived from i Z i: H :j T H:j + λi k 1 j Z i: V ij H :j column indices of non-missing entries in the i-th row of V L W ik = 0, i, k (first-order condition) Large-scale Matrix Factorization (by Kijung Shin) 35/99

36 Alternating Least Square (ALS) randomly Initialize W and H repeat until convergence update W while fixing H to its current value update H while fixing W to its current value Each step never increases L V, W, H W is updated to the best W that minimizes L V, W, H for current H H is updated to the best H that minimizes L V, W, H for current W L V, W, H monotonically decreases until convergence Large-scale Matrix Factorization (by Kijung Shin) 36/99

37 Parallel ALS: Idea Recall the update rule for each j-th column of H H :j i Z :j W i: W i: T + λi k 1 i Z :j V ij W i: Read H Written H :j H V :j W V W V Large-scale Matrix Factorization (by Kijung Shin) 37/99

38 Parallel ALS: Idea (cont.) Potentially every entry of W needs to be read Only the j-th column of V needs to be read H does not need to be read can update the columns of H independently in parallel Read H Written H :j H V :j W V W V Large-scale Matrix Factorization (by Kijung Shin) 38/99

39 Updating H in Parallel A toy example with 3 machines 1. Divide V column-wise into 3 pieces W V (1) V (2) V (3) Large-scale Matrix Factorization (by Kijung Shin) 39/99

40 Updating H in Parallel (cont.) 2. Distribute the pieces across the machines 3. W is broadcast to every machine Machine 1 Machine 2 Machine 3 W V (1) V (2) W V (1) W V (2) W V (3) V (3) Large-scale Matrix Factorization (by Kijung Shin) 40/99

41 Updating H in Parallel (cont.) 4. Compute the corresponding columns of H in parallel Machine 1 Machine 2 Machine 3 New! New! New! H (1) H (2) H (3) W V (1) V (2) W V (1) W V (2) W V (3) V (3) Large-scale Matrix Factorization (by Kijung Shin) 41/99

42 Updating H in Parallel (cont.) 5. Broadcast the part of H that each machine computes H (1) H (2) H (3) H (1) H (2) H (3) H (1) H (2) H (3) W V (1) W V (2) W V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 42/99

43 Parallel ALS: Idea Recall the update rule for each row of W W i: i Z i: H :j T H:j + λi k 1 j Z i: V ij H :j Read H Written H V i: W i: W V W V Large-scale Matrix Factorization (by Kijung Shin) 43/99

44 Parallel ALS: Idea (cont.) Potentially every entry of H needs to be read Only the i-th row of V needs to be read W does not need to be read can update the rows of W independently in parallel Read H Written H V i: W i: W V W V Large-scale Matrix Factorization (by Kijung Shin) 44/99

45 Updating W in Parallel A toy example with 3 machines 1. Divide V row-wise into 3 pieces H V (1) V (2) V (3) Large-scale Matrix Factorization (by Kijung Shin) 45/99

46 Updating W in Parallel (cont.) 2. Distribute the pieces across the machines 3. H is broadcast to every machine H V (1) V (2) V (3) H H H V (1) V (2) V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 46/99

47 Updating W in Parallel (cont.) 4. Compute the corresponding rows of W in parallel H H H New! New! New! W (1) W (2) V (3) V (1) V (2) V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 47/99

48 Updating W in Parallel (cont.) 5. Broadcast the part of W that each machine computes W (1) W (1) W (1) W (2) H H H W (2) W (2) W (3) W (3) W (3) V (1) V (2) V (3) Machine 1 Machine 2 Machine 3 Large-scale Matrix Factorization (by Kijung Shin) 48/99

49 Pros & Cons of ALS Pros: Less hyper-parameters: not requiring step size Cons: High computational cost: e.g., matrix inversion takes O(r 3 ) 1 H :j i Z :j W i: W i: T + λi k i Z :j V ij W i: Large-scale Matrix Factorization (by Kijung Shin) 49/99

50 Pros & Cons of ALS (cont.) Cons (cont.) High memory requirement: while updating W(or H), each machine maintains the entire H(or W) in memory Machine 1 Machine 2 Machine 3 W V (1) W V (2) W V (3) Large-scale Matrix Factorization (by Kijung Shin) 50/99

51 Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ << Experiments Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 51/99

52 Cyclic Coordinate Descent Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems (ICDM 2012) Best Paper Award Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon Large-scale Matrix Factorization (by Kijung Shin) 52/99

53 Matrix Factorization: Revisit Goal: V is approximated by the product of W and H m r m n??? n H r? V W Large-scale Matrix Factorization (by Kijung Shin) 53/99

54 Matrix Factorization: Revisit Goal: V is approximated by the product of W and H m r m n?? V ij? n W i: H :j H r? V W V ij WH ij = σr k=1 W ik H kj = W i1 H 1j + + W ir H rj Large-scale Matrix Factorization (by Kijung Shin) 54/99

55 Matrix Factorization: Revisit Equivalently, V is approximated by the sum of r matrices?? + + H 1: 1 st row of H H r: r th row of H?? 1 st column of W r th column of W V W :1 W :r W :rh r: W :1 H 1: ij = W i1 H 1j W :r H r: ij = W ir H rj V ij W :1 H 1: ij + + W :r H r: ij = W i1 H 1j + + W ir H rj Large-scale Matrix Factorization (by Kijung Shin) 55/99

56 Cyclic Coordinate Descent: CCD++ repeat until convergence for k = 1 r update the k-th column of W and the k-th row of H while fixing the other entries of W and H Consider a toy example with rank 3 Step 1: H 1: + + H 2: H 3: V W :1 W :2 W :3 Large-scale Matrix Factorization (by Kijung Shin) 56/99

57 Step 2: Order of Updates in CCD H 1: H 2: H 3: V Step 3: W :1 W :2 W :3 + + H 1: H 2: H 3: V W :1 W :2 W :3 Large-scale Matrix Factorization (by Kijung Shin) 57/99

58 Performing one update Let W :1 and H 1: be the updated row and column + + H 1: H 2: H 3: V W :1 W :2 W :3 H 2: H 3: H 1: V W :2 W :3 W :1 residual matrix R Large-scale Matrix Factorization (by Kijung Shin) 58/99

59 Performing one update (cont.) H 2: H 3: H 1: V W :2 W :3 W :1 residual matrix R Updating W :1 and H 1: equals to factorizing R with rank 1 R = V W :2 H 2: W :3 H :3 = V W :1 H 1: W :2 H 2: W :3 H :3 + W :1 H 1: = V WH + W :1 H 1: Large-scale Matrix Factorization (by Kijung Shin) 59/99

60 Performing one update (cont.) Factorizing R with rank 1 can be performed using ALS with the following rules: The updates rules can be derived from those in ALS For j = 1 m H kj σ i Z:j R ij W ik σi Z:j W ik 2 + λ For i = 1 n σ W ik j Zi: R ij H kj σ j Zi: H kj 2 + λ Large-scale Matrix Factorization (by Kijung Shin) 60/99

61 Pseudocode of CCD++ Pseudocode (Simple) randomly initialize W and H repeat until convergence for k = 1 r: R V WH + W :k H k: update W :k and H k: (rank-1 factorization of R) Repeated!! Pseudocode (Detailed) randomly initialize W and H R V WH Maintained!! repeat until convergence for k = 1 r: R R + W :k H k: update W :k and H k: (rank-1 factorization of R) R R + W :k H k: Large-scale Matrix Factorization (by Kijung Shin) 61/99

62 Parallel CCD++ Updating H k: in parallel 1. divide R column-wise 2. distribute the pieces across the machines 3. W :k is broadcast to every machine H k: Machine 1 Machine 2 Machine 3 R (1) R (2) R (3) R(1) R(2) R(3) W :k W :k W :k W K: Large-scale Matrix Factorization (by Kijung Shin) 62/99

63 Parallel CCD++ (cont.) Updating H k: in parallel (cont.) 4. compute the corresponding entries of H k: in parallel Machine 1 Machine 2 Machine 3 New! H 1 k: New! H 2 k: New! H 3 k: R(1) R(2) R(3) W :k W :k W :k Large-scale Matrix Factorization (by Kijung Shin) 63/99

64 Parallel CCD++ (cont.) Updating H k: in parallel (cont.) 5. send the computed entries to the other machines. Machine 1 Machine 2 Machine 3 H k: H k: H k: R(1) R(2) R(3) W :k W :k W :k Large-scale Matrix Factorization (by Kijung Shin) 64/99

65 Pros & Cons of CCD++ Pros: Low computational cost update rules do not need matrix inversion H kj σ i Z:j R ij W ik σi Z:j W ik 2 + λ σ W ik j Zi: R ij H kj σ j Zi: H kj 2 + λ Large-scale Matrix Factorization (by Kijung Shin) 65/99

66 Pros & Cons of CCD++ Pros (cont.): Low memory requirements each machine needs to maintain one row of H (or one column of W) in memory at a time instead of entire H (or W) Machine 1 in ALS Machine 1 in CCD++ W V (1) R (1) W :k Large-scale Matrix Factorization (by Kijung Shin) 66/99

67 Pros & Cons of CCD++ Cons: Slow for dense matrices with many non-missing entries To update R, every non-missing entry needs to be read and rewritten. This is especially slow if R is stored on disk. Large-scale Matrix Factorization (by Kijung Shin) 67/99

68 Roadmap Problem (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments << Extension to Tensor Factorization Conclusions Large-scale Matrix Factorization (by Kijung Shin) 68/99

69 Users Experimental Settings Datasets: Movies (or songs) 5? ? 2? 1? 5 3? 2 Ratings Large-scale Matrix Factorization (by Kijung Shin) 69/99

70 Experimental Settings (cont.) Datasets: (2.7M users, 18K movies, 100M ratings) (1M users, 625K songs, 256M ratings) (71K users, 65K movies, 10M ratings) Large-scale Matrix Factorization (by Kijung Shin) 70/99

71 Experimental Settings (cont.) Split of Datasets: Training set: about 80% of non-missing entries Test set: about 20% of non-missing entries Evaluation Metric: Test RMSE (Root-mean square error) RMSE V, W, H = 1 Z test V ij WH ij 2 (i,j) Z test index of test entries True rating Estimated rating Large-scale Matrix Factorization (by Kijung Shin) 71/99

72 Experimental Settings (cont.) Machines and implementations Shared-memory setting: 8 cores OpenMP in C++ Distributed setting: up to 20 machines MPI in C++ Large-scale Matrix Factorization (by Kijung Shin) 72/99

73 Convergence Speed Shared-memory setting with 8 cores CCD++ decreases Test RMSE fastest Figure 3(b) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 73/99

74 Convergence Speed (cont.) Shared-memory setting with 8 cores CCD++ converges Test RMSE fastest Figure 3(c) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 74/99

75 Convergence Speed (cont.) Shared-memory setting with 8 cores CCD++ converges fastest Figure 3(a) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 75/99

76 Speedup in Shared Memory CCD++ and ALS show near-linear machine scalability DSGD suffers form high cache-miss rates due to its randomness Cache-miss rate increases as more cores are used Figure 4 of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 76/99

77 Speedup in Distributed Settings All algorithms show similar speedups Figure 6(b) of [YHS+12] Large-scale Matrix Factorization (by Kijung Shin) 77/99

78 Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization << Conclusions Large-scale Matrix Factorization (by Kijung Shin) 78/99

79 Tensors An N-way tensor is an N-dimensional array t 1 t I1 1-way tensor (= vector) t 11 t 1I2 t I1 1 t I1 I 2 2-way tensor (= matrix) i 1 = 1,, I 1 i 2 = 1,, I 2 3-way tensor Large-scale Matrix Factorization (by Kijung Shin) 79/99

80 Tensor Factorization: Problem Given: X: I by J by K tensor possibly with missing values R: target rank (a scalar) usually R I, J, K K I???? J X Large-scale Matrix Factorization (by Kijung Shin) 80/99

81 Tensor Factorization: Problem Given: X: I by J by K tensor (i.e., 3-dim array) possibly with missing values R: target rank (a scalar) usually R I, J, K R J R I H Find: U: I by R matrix H: J by R matrix W: Kby R matrix without missing values U K R W Large-scale Matrix Factorization (by Kijung Shin) 81/99

82 Tensor Factorization: Problem Goal: X [UHW] where UHW ijk = σr r=1 U ir H jr W kr???? H T X U Large-scale Matrix Factorization (by Kijung Shin) 82/99

83 Loss Function L X, U, H, W = σ i,j,k Z (X ijk UHW ijk ) 2 + Indices of non-missing entries i.e., i, j, k Z X ijk is not missing (i, j, k)-th entry of X (i, j, k)-th entry of [UHW] Goal: to make X and UHW similar Large-scale Matrix Factorization (by Kijung Shin) 83/99

84 Loss Function (cont.) L X, U, H, W = σ i,j,k Z (X ijk UHW ijk ) 2 + λ( U F 2 + H F 2 + W F 2 ) Regularization parameter Goal: to prevent overfitting Frobenius Norm: U F 2 = i=1 (by making the entries of U, H and V close to zero) I R r=1 U ir 2 Large-scale Matrix Factorization (by Kijung Shin) 84/99

85 SGD for TF L X, U, H, W : sum of loss for each non-missing entry L X, U, H, W = σ i,j,k Z L (X ijk, U i: H j: W k: ) An SGD step on each non-missing entry X ijk : Read U i:, H j:, and W k: Compute gradient of L V ij, W i:, H :j, W k: Update U i:, H j:, W k: W H T X ijk U X Large-scale Matrix Factorization (by Kijung Shin) 85/99

86 Parallel Algorithms for TF Parallel algorithms for MF are extended to TF DSGD FlexiFaCT [BKP+14] ALS [SPK16] CCD++ CDTF [SK17] FlexiFaCT [VKP+14] Large-scale Matrix Factorization (by Kijung Shin) 86/99

87 Users Applications Context-aware recommendation [KABO10] 4 5? 2 Missing rating X ijk = UHW ijk 3 Ratings Movies Large-scale Matrix Factorization (by Kijung Shin) 87/99

88 Users Applications (cont.) Context-aware recommendation [KABO10] 4? Missing rating X ijk = UHW ijk Ratings Restaurants Large-scale Matrix Factorization (by Kijung Shin) 88/99

89 X-axis Applications (cont.) Video Restoration [LMWY13] 4 5 Y-Axis? 2 3 Missing or corrupted pixel X ijk = UHW ijk Pixels Large-scale Matrix Factorization (by Kijung Shin) 89/99

90 Applications (cont.) Video Restoration [LMWY13] Corrupted frame Restored frame Large-scale Matrix Factorization (by Kijung Shin) 90/99

91 Users Applications (cont.) Personalized Web Search [SZL+05] 4? Missing preference X ijk Query keywords = UHW ijk Preference Large-scale Matrix Factorization (by Kijung Shin) 91/99

92 Roadmap Matrix Factorization (review) Algorithms Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Experiments Extension to Tensor Factorization Conclusions << Large-scale Matrix Factorization (by Kijung Shin) 92/99

93 Conclusions Matrix Factorization (MF) Parallel algorithms for MF Distributed SGD: DSGD Alternating Least Square: ALS Cyclic Coordinate Descent: CCD++ Tensor Factorization (TF) Extension of MF to tensors Applications: context-aware recommendation, video restoration, personalized web search, etc. Large-scale Matrix Factorization (by Kijung Shin) 93/99

94 Questions? These slides are available at: Large-scale Matrix Factorization (by Kijung Shin) 94/99

95 References [SZL+05] Jian-Tao Sun, Hua-Jun Zeng, Huan Liu, Yuchang Lu, and Zheng Chen. "Cubesvd: a novel approach to personalized web search." WWW 2005 [ZWSP08] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. "Large-scale parallel collaborative filtering for the netflix prize." AAIM 2008 [KABO10] Karatzoglou, Alexandros, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. "Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering." RecSys [GNHS11] Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. "Large-scale matrix factorization with distributed stochastic gradient descent." KDD 2011 [DKA11] Daniel M. Dunlavy, Tamara G. Kolda, and Evrim Acar. "Temporal link prediction using matrix and tensor factorizations." TKDD 2011 Large-scale Matrix Factorization (by Kijung Shin) 95/99

96 References (cont.) [YHSD12] Yu, Hsiang-Fu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. "Scalable coordinate descent approaches to parallel matrix factorization for recommender systems." ICDM 2012 [LMWY13] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. "Tensor completion for estimating missing values in visual data." TPAMI [BKP+14] Alex Beutel, Abhimanu Kumar, Evangelos E. Papalexakis, Partha Pratim Talukdar, Christos Faloutsos, and Eric P. Xing. "Flexifact: Scalable flexible factorization of coupled tensors on hadoop." SDM 2014 [SPK16] Shaden Smith, Jongsoo Park, and George Karypis. "An exploration of optimization algorithms for high performance tensor completion." In High Performance Computing, Networking, Storage and Analysis, SC 2016 [SK17] Kijung Shin and U Kang. Distributed Methods for Highdimensional and Large-scale Tensor Factorization. ICDM 2014 Large-scale Matrix Factorization (by Kijung Shin) 96/99

97 Closed-Form Solution of ALS Let W = Then, L V, W, H W 1: : W n:, H = H :1 H :m = σ i,j Z (V ij W i: H :j ) 2 + λ σn i=1 = σ i,j Z (V ij σr k=1 W ik H kj ) 2 + λ σ n i=1 σr k=1 W 2 ik + σ m j=1 σr k=1 W i: 2 H kj 2 + σ m j=1 H :j 2 Large-scale Matrix Factorization (by Kijung Shin) 97/99

98 Closed-Form Solution of ALS (cont.) 1 2 L H kj = 0, k, j (first-order conditions) σ i Z:j V ij W i: H :j W ik + λh kj = 0, k, j rows of non-missing entrys in the j-th column of V σ i Z:j W i: H :j W ik + λh kj = σ i Z:j V ij W ik, k, j σ i Z:j W i: H :j W ik + λh kj = σ i Z:j V ij W ik, k, j Large-scale Matrix Factorization (by Kijung Shin) 98/99

99 Closed-Form Solution of ALS (cont.) (If we stack r conditions on H 1j,., H rj ) (σ i Z:j W i: W T i: + λi k )H :j = σ i Z:j V ij W i:, j k by k identity matrix H :j = σ i Z:j W i: W i: T + λi k 1 σi Z:j V ij W i:, j Large-scale Matrix Factorization (by Kijung Shin) 99/99

Distributed Methods for High-dimensional and Large-scale Tensor Factorization

Distributed Methods for High-dimensional and Large-scale Tensor Factorization Distributed Methods for High-dimensional and Large-scale Tensor Factorization Kijung Shin Dept. of Computer Science and Engineering Seoul National University Seoul, Republic of Korea koreaskj@snu.ac.kr

More information

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop Alex Beutel Department of Computer Science abeutel@cs.cmu.edu Evangelos Papalexakis Department of Computer Science epapalex@cs.cmu.edu

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at

More information

Low Rank Matrix Completion Formulation and Algorithm

Low Rank Matrix Completion Formulation and Algorithm 1 2 Low Rank Matrix Completion and Algorithm Jian Zhang Department of Computer Science, ETH Zurich zhangjianthu@gmail.com March 25, 2014 Movie Rating 1 2 Critic A 5 5 Critic B 6 5 Jian 9 8 Kind Guy B 9

More information

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon Department of Computer Science University of Texas

More information

Parallel Matrix Factorization for Recommender Systems

Parallel Matrix Factorization for Recommender Systems Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of

More information

Large-scale Collaborative Ranking in Near-Linear Time

Large-scale Collaborative Ranking in Near-Linear Time Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

More information

Matrix Factorization and Factorization Machines for Recommender Systems

Matrix Factorization and Factorization Machines for Recommender Systems Talk at SDM workshop on Machine Learning Methods on Recommender Systems, May 2, 215 Chih-Jen Lin (National Taiwan Univ.) 1 / 54 Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen

More information

FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop Alex Beutel abeutel@cs.cmu.edu Partha Pratim Talukdar ppt@cs.cmu.edu Abhimanu Kumar abhimank@cs.cmu.edu Christos Faloutsos christos@cs.cmu.edu

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

DFacTo: Distributed Factorization of Tensors

DFacTo: Distributed Factorization of Tensors DFacTo: Distributed Factorization of Tensors Joon Hee Choi Electrical and Computer Engineering Purdue University West Lafayette IN 47907 choi240@purdue.edu S. V. N. Vishwanathan Statistics and Computer

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

Recommender System for Yelp Dataset CS6220 Data Mining Northeastern University

Recommender System for Yelp Dataset CS6220 Data Mining Northeastern University Recommender System for Yelp Dataset CS6220 Data Mining Northeastern University Clara De Paolis Kaluza Fall 2016 1 Problem Statement and Motivation The goal of this work is to construct a personalized recommender

More information

Collaborative Filtering Matrix Completion Alternating Least Squares

Collaborative Filtering Matrix Completion Alternating Least Squares Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016

More information

Polynomial optimization methods for matrix factorization

Polynomial optimization methods for matrix factorization Polynomial optimization methods for matrix factorization Po-Wei Wang Machine Learning Department Carnegie Mellon University Pittsburgh, PA 53 poweiw@cs.cmu.edu Chun-Liang Li Machine Learning Department

More information

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015 Matrix Factorization In Recommender Systems Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015 Table of Contents Background: Recommender Systems (RS) Evolution of Matrix

More information

Matrix and Tensor Factorization from a Machine Learning Perspective

Matrix and Tensor Factorization from a Machine Learning Perspective Matrix and Tensor Factorization from a Machine Learning Perspective Christoph Freudenthaler Information Systems and Machine Learning Lab, University of Hildesheim Research Seminar, Vienna University of

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Improved Bounded Matrix Completion for Large-Scale Recommender Systems

Improved Bounded Matrix Completion for Large-Scale Recommender Systems Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-7) Improved Bounded Matrix Completion for Large-Scale Recommender Systems Huang Fang, Zhen Zhang, Yiqun

More information

Distributed Stochastic ADMM for Matrix Factorization

Distributed Stochastic ADMM for Matrix Factorization Distributed Stochastic ADMM for Matrix Factorization Zhi-Qin Yu Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China

More information

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Fast Coordinate Descent methods for Non-Negative Matrix Factorization Fast Coordinate Descent methods for Non-Negative Matrix Factorization Inderjit S. Dhillon University of Texas at Austin SIAM Conference on Applied Linear Algebra Valencia, Spain June 19, 2012 Joint work

More information

High Performance Parallel Tucker Decomposition of Sparse Tensors

High Performance Parallel Tucker Decomposition of Sparse Tensors High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon,

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Principal Component Analysis (PCA) for Sparse High-Dimensional Data AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Helsinki University of Technology, Finland Adaptive Informatics Research Center Principal

More information

A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization

A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization Wei-Sheng Chin, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin Department of Computer Science National Taiwan University, Taipei,

More information

Supplemental Material for TKDE

Supplemental Material for TKDE Supplemental Material for TKDE-05-05-035 Kijung Sh, Lee Sael, U Kang PROPOSED METHODS Proofs of Update Rules In this section, we present the proofs of the update rules Section 35 of the ma paper Specifically,

More information

Clustering based tensor decomposition

Clustering based tensor decomposition Clustering based tensor decomposition Huan He huan.he@emory.edu Shihua Wang shihua.wang@emory.edu Emory University November 29, 2017 (Huan)(Shihua) (Emory University) Clustering based tensor decomposition

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task Binary Principal Component Analysis in the Netflix Collaborative Filtering Task László Kozma, Alexander Ilin, Tapani Raiko first.last@tkk.fi Helsinki University of Technology Adaptive Informatics Research

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Chapter 2. Optimization. Gradients, convexity, and ALS

Chapter 2. Optimization. Gradients, convexity, and ALS Chapter 2 Optimization Gradients, convexity, and ALS Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2 Motivation We can solve

More information

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007 Recommender Systems Dipanjan Das Language Technologies Institute Carnegie Mellon University 20 November, 2007 Today s Outline What are Recommender Systems? Two approaches Content Based Methods Collaborative

More information

IBM Research Report. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

IBM Research Report. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent RJ10481 (A1103-008) March 16, 2011 Computer Science IBM Research Report Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla Max-Planck-Institut für Informatik Saarbrücken,

More information

Matrix Factorization Techniques for Recommender Systems

Matrix Factorization Techniques for Recommender Systems Matrix Factorization Techniques for Recommender Systems Patrick Seemann, December 16 th, 2014 16.12.2014 Fachbereich Informatik Recommender Systems Seminar Patrick Seemann Topics Intro New-User / New-Item

More information

Collaborative Filtering

Collaborative Filtering Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin

More information

A Modified PMF Model Incorporating Implicit Item Associations

A Modified PMF Model Incorporating Implicit Item Associations A Modified PMF Model Incorporating Implicit Item Associations Qiang Liu Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 31007, China Email: 01dtd@gmail.com

More information

arxiv: v6 [cs.na] 5 Dec 2017

arxiv: v6 [cs.na] 5 Dec 2017 Fast, Accurate, and calable Method for parse Coupled Matrix-Tensor Factorization Dongjin Choi eoul National University skywalker5@snu.ac.kr Jun-Gi Jang eoul National University elnino9158@gmail.com U Kang

More information

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos Must-read Material 15-826: Multimedia Databases and Data Mining Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories,

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

SQL-Rank: A Listwise Approach to Collaborative Ranking

SQL-Rank: A Listwise Approach to Collaborative Ranking SQL-Rank: A Listwise Approach to Collaborative Ranking Liwei Wu Depts of Statistics and Computer Science UC Davis ICML 18, Stockholm, Sweden July 10-15, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack

More information

CSC 576: Variants of Sparse Learning

CSC 576: Variants of Sparse Learning CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in

More information

Generative Models for Discrete Data

Generative Models for Discrete Data Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

More information

Decoupled Collaborative Ranking

Decoupled Collaborative Ranking Decoupled Collaborative Ranking Jun Hu, Ping Li April 24, 2017 Jun Hu, Ping Li WWW2017 April 24, 2017 1 / 36 Recommender Systems Recommendation system is an information filtering technique, which provides

More information

Nesterov-based Alternating Optimization for Nonnegative Tensor Completion: Algorithm and Parallel Implementation

Nesterov-based Alternating Optimization for Nonnegative Tensor Completion: Algorithm and Parallel Implementation Nesterov-based Alternating Optimization for Nonnegative Tensor Completion: Algorithm and Parallel Implementation Georgios Lourakis and Athanasios P. Liavas School of Electrical and Computer Engineering,

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

Large-scale Collaborative Ranking in Near-Linear Time

Large-scale Collaborative Ranking in Near-Linear Time Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts. of Statistics and Computer Science University of California, Davis liwu@ucdavis.edu ABSTRACT In this paper, we consider the Collaborative

More information

Algorithms for Collaborative Filtering

Algorithms for Collaborative Filtering Algorithms for Collaborative Filtering or How to Get Half Way to Winning $1million from Netflix Todd Lipcon Advisor: Prof. Philip Klein The Real-World Problem E-commerce sites would like to make personalized

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs

Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs Rank Selection in Low-rank Matrix Approximations: A Study of Cross-Validation for NMFs Bhargav Kanagal Department of Computer Science University of Maryland College Park, MD 277 bhargav@cs.umd.edu Vikas

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

PU Learning for Matrix Completion

PU Learning for Matrix Completion Cho-Jui Hsieh Dept of Computer Science UT Austin ICML 2015 Joint work with N. Natarajan and I. S. Dhillon Matrix Completion Example: movie recommendation Given a set Ω and the values M Ω, how to predict

More information

Faloutsos, Tong ICDE, 2009

Faloutsos, Tong ICDE, 2009 Large Graph Mining: Patterns, Tools and Case Studies Christos Faloutsos Hanghang Tong CMU Copyright: Faloutsos, Tong (29) 2-1 Outline Part 1: Patterns Part 2: Matrix and Tensor Tools Part 3: Proximity

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

This ensures that we walk downhill. For fixed λ not even this may be the case.

This ensures that we walk downhill. For fixed λ not even this may be the case. Gradient Descent Objective Function Some differentiable function f : R n R. Gradient Descent Start with some x 0, i = 0 and learning rate λ repeat x i+1 = x i λ f(x i ) until f(x i+1 ) ɛ Line Search Variant

More information

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge

More information

Matrices: 2.1 Operations with Matrices

Matrices: 2.1 Operations with Matrices Goals In this chapter and section we study matrix operations: Define matrix addition Define multiplication of matrix by a scalar, to be called scalar multiplication. Define multiplication of two matrices,

More information

Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem

More information

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:

More information

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang Relational Stacked Denoising Autoencoder for Tag Recommendation Hao Wang Dept. of Computer Science and Engineering Hong Kong University of Science and Technology Joint work with Xingjian Shi and Dit-Yan

More information

Using R for Iterative and Incremental Processing

Using R for Iterative and Incremental Processing Using R for Iterative and Incremental Processing Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber UC Berkeley and HP Labs UC BERKELEY Big Data, Complex Algorithms PageRank (Dominant

More information

Multiverse Recommendation: N-dimensional Tensor Factorization for Context-aware Collaborative Filtering

Multiverse Recommendation: N-dimensional Tensor Factorization for Context-aware Collaborative Filtering Multiverse Recommendation: N-dimensional Tensor Factorization for Context-aware Collaborative Filtering Alexandros Karatzoglou Telefonica Research Barcelona, Spain alexk@tid.es Xavier Amatriain Telefonica

More information

Andriy Mnih and Ruslan Salakhutdinov

Andriy Mnih and Ruslan Salakhutdinov MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org J. Leskovec,

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

CSC 411 Lecture 6: Linear Regression

CSC 411 Lecture 6: Linear Regression CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression

More information

Matrix Completion: Fundamental Limits and Efficient Algorithms

Matrix Completion: Fundamental Limits and Efficient Algorithms Matrix Completion: Fundamental Limits and Efficient Algorithms Sewoong Oh PhD Defense Stanford University July 23, 2010 1 / 33 Matrix completion Find the missing entries in a huge data matrix 2 / 33 Example

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 6 Non-Negative Matrix Factorization Rainer Gemulla, Pauli Miettinen May 23, 23 Non-Negative Datasets Some datasets are intrinsically non-negative: Counters (e.g., no. occurrences

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Scalable Nonnegative Matrix Factorization with Block-wise Updates

Scalable Nonnegative Matrix Factorization with Block-wise Updates Scalable Nonnegative Matrix Factorization with Block-wise Updates Jiangtao Yin 1, Lixin Gao 1, and Zhongfei (Mark) Zhang 2 1 University of Massachusetts Amherst, Amherst, MA 01003, USA 2 Binghamton University,

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic

More information

Personalized Ranking for Non-Uniformly Sampled Items

Personalized Ranking for Non-Uniformly Sampled Items JMLR: Workshop and Conference Proceedings 18:231 247, 2012 Proceedings of KDD-Cup 2011 competition Personalized Ranking for Non-Uniformly Sampled Items Zeno Gantner Lucas Drumond Christoph Freudenthaler

More information

Tensor Decomposition with Smoothness (ICML2017)

Tensor Decomposition with Smoothness (ICML2017) Tensor Decomposition with Smoothness (ICML2017) Masaaki Imaizumi 1 Kohei Hayashi 2,3 1 Institute of Statistical Mathematics 2 National Institute of Advanced Industrial Science and Technology 3 RIKEN Center

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

Matrix Factorization and Collaborative Filtering

Matrix Factorization and Collaborative Filtering 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Matrix Factorization and Collaborative Filtering MF Readings: (Koren et al., 2009)

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear Algebra for Machine Learning. Sargur N. Srihari Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it

More information

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University 2018 EE448, Big Data Mining, Lecture 10 Recommender Systems Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of This Course Overview of

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

Crowdsourcing via Tensor Augmentation and Completion (TAC)

Crowdsourcing via Tensor Augmentation and Completion (TAC) Crowdsourcing via Tensor Augmentation and Completion (TAC) Presenter: Yao Zhou joint work with: Dr. Jingrui He - 1 - Roadmap Background Related work Crowdsourcing based on TAC Experimental results Conclusion

More information

Vector, Matrix, and Tensor Derivatives

Vector, Matrix, and Tensor Derivatives Vector, Matrix, and Tensor Derivatives Erik Learned-Miller The purpose of this document is to help you learn to take derivatives of vectors, matrices, and higher order tensors (arrays with three dimensions

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 2 Often, our data can be represented by an m-by-n matrix. And this matrix can be closely approximated by the product of two matrices that share a small common dimension

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information