ECS289: Scalable Machine Learning

Size: px

Start display at page:

Download "ECS289: Scalable Machine Learning"

Brenda Crawford
6 years ago
Views:

1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016

2 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday) Send me the paper you want to present before October 21 Choose a paper in ICML, NIPS, KDD, ICDM, AISTATS or JMLR Final project proposal (one page): due October 30 (11:59pm)

3 Outline Matrix Completion (Background) Alternating Least Squares (ALS) Stochastic Gradient method (SG) Coordinate Descent (CD)

4 Recommender Systems

5 Matrix Factorization Approach A WH T

6 Matrix Factorization Approach A WH T

7 Matrix Factorization Approach min W R m k H R n k (i,j) Ω Ω = {(i, j) A ij is observed} Regularized terms to avoid over-fitting (A ij wi T h j ) 2 + λ ( W 2 F + ) H 2 F, Matrix factorization maps users/items to latent feature space R k the i th user i th row of W, w T i, the j th item j th row of H, h T j. wi T h j : measures the interaction between i th user and j th item.

8 Latent Feature Space

9 Latent Feature Space

10 Other Factorizations Nonnegative Matrix Factorization min W 0,H 0 A WHT 2 F + λ W 2 F + λ H 2 F Each entry is positive A is either fully or partially observed Goal: find nonnegative latent factors

11 NMF vs PCA

12 Optimization for Matrix Completion: Alternating Least Squares

13 Properties of the Objective Function Nonconvex problem (why?) Example: f (x, y) = 1 2 (xy 1)2 f (0, 0) = 0, but clearly [0, 0] is not a global optimum

14 ALS: Alternating Least Squares Objective function: 1 min (A W,H ij (WH T ) ij ) 2 + λ 2 2 W 2 F + λ 2 H 2 F := f (W, H) i,j Ω Iteratively fix either H or W and optimize the other: Input: partially observed matrix A, initial values of W, H For t = 1, 2,... Fix W and update H: H argmin H f (W, H) Fix H and update W : W argmin W f (W, H)

15 ALS: Alternating Least Squares Define: Ω j := {i (i, j) Ω} w i : the i-th row of W ; h j : the j-th row of H The subproblem: 1 argmin H 2 = (A ij (WH T ) ij ) 2 + λ 2 H 2 F i,j Ω n ( 1 2 j=1 i Ω j (A ij w T i h j ) 2 + λ 2 h j 2 } {{ } ridge regression problem )

16 ALS: Alternating Least Squares Define: Ω j := {i (i, j) Ω} w i : the i-th row of W ; h j : the j-th row of H The subproblem: 1 argmin H 2 = (A ij (WH T ) ij ) 2 + λ 2 H 2 F i,j Ω n ( 1 2 j=1 i Ω j (A ij w T i h j ) 2 + λ 2 h j 2 } {{ } ridge regression problem ) n ridge regression problems, each with k variables O( Ω k 2 + nk 3 ) Easy to parallelize (n independent ridge regression subproblems)

17 ALS: Alternating Least Squares ( ) H T ( ) H T w1 T w2 T w3 T A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 w1 T w2 T w3 T A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33

18 Optimization for Matrix Completion: Stochastic Gradient Method

19 Stochastic Gradient Method ni W : number of nonzeroes in the i-th row of A nj H :number of nonzeroes in the j-th column of A Decompose the problem into Ω components: f (W, H) = 1 (A ij wi T h j ) 2 + λ 2 2 W 2 F + λ 2 H 2 F i,j Ω ( Ω h j ) 2 + λ Ω w i 2 + λ Ω ) h j 2 = 1 Ω i,j Ω 2 (A ij wi T 2n W i 2n H j } {{ } f i,j (W,H)

20 Stochastic Gradient Method ni W : number of nonzeroes in the i-th row of A nj H :number of nonzeroes in the j-th column of A Decompose the problem into Ω components: f (W, H) = 1 (A ij wi T h j ) 2 + λ 2 2 W 2 F + λ 2 H 2 F i,j Ω ( Ω h j ) 2 + λ Ω w i 2 + λ Ω ) h j 2 = 1 Ω i,j Ω 2 (A ij wi T The gradient of each component: 2n W i 2n H j } {{ } f i,j (W,H) wi f i,j (W, H) = Ω (w T i hj f i,j (W, H) = Ω (w T i h j A ij )h j + λ Ω ni W w i h j A ij )w i + λ Ω nj H h j

21 Stochastic Gradient Method SG algorithm: Input; partially observed matrix A, initial values of W, H For t = 1, 2,... Randomly pick a pair (i, j) Ω w i (1 ηtλ )w ni W i η t (wi T h j A ij )h j h j (1 ηtλ )h nj H j η t (wi T h j A ij )w i

22 Stochastic Gradient Method SG algorithm: Input; partially observed matrix A, initial values of W, H For t = 1, 2,... Randomly pick a pair (i, j) Ω w i (1 ηtλ )w ni W i η t (wi T h j A ij )h j h j (1 ηtλ )h nj H j η t (wi T h j A ij )w i Time complexity: O(k) per iteration; O( Ω k) for one pass of all observed entries.

23 Stochastic Gradient Method ( ) h 1 h 2 h 3 ( ) h 1 h 2 ; h 3 w1 T w2 T w3 T A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 w1 T w2 T w3 T A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33

24 Optimization for Matrix Completion: Distributed Stochastic Gradient Descent (DSGD)

25 How to parallelize SG? Two SG updates on (i 1, j 1 ) and (i 2, j 2 ) in the same time: (i 1, j 1 ): Update w i1 and h j1 (i 2, j 2 ): Update w i2 and h j2 Confliction happens when i 1 = i 2 or j 1 = j 2 How to avoid confliction? Gemulla et al., Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent. In KDD 2011.

26 DSGD: Distributed SGD [Gemulla et al, 2011] w T 1 w T 2 w T 3 h 1 h 2 h 3 A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 P 1 P 2 P 3

27 DSGD: Distributed SGD w T 1 w T 2 w T 3 h 1 h 2 h 3 A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 P 1 P 2 P 3

28 DSGD: Distributed SGD w T 1 w T 2 w T 3 h 1 h 2 h 3 A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 P 1 P 2 P 3

29 Optimization for Matrix Completion: Coordinate Descent

30 Coordinate Descent Update a variable at a time: j Ω w it i (A ij wi T h j + w it h jt )h jt λ + j Ω i hjt 2 Subproblem is just a univariate quadratic problem Ω i = {j : (i, j) Ω} Can be done in O( Ω i ) Update Sequence: Item/user-wise update: pick a user i or an item j update the i-th row of W or the j-th column of H Feature-wise update: pick a feature index t {1,..., k} update t-column of W and H alternatively.

31 Feature-wise Update: CCD++ When T = 2

32 Feature-wise Update: CCD++ When T = 2

33 Feature-wise Update: CCD++ When T = 2

34 Feature-wise Update: CCD++ When T = 2

35 Feature-wise Update: CCD++ When T = 2

36 Feature-wise Update: CCD++ When T = 2

37 Feature-wise Update: CCD++ When T = 2

38 Feature-wise Update: CCD++ When T = 2

39 Feature-wise Update: CCD++ When T = 2

40 Feature-wise Update: CCD++ When T = 2

41 Feature-wise Update: CCD++ When T = 2

42 Feature-wise Update: CCD++ When T = 2

43 Feature-wise Update: CCD++ When T = 2 netflix with k = 40 Cycle through k feature dimensions

44 Related papers Alternating Minimization and SGD: used in Netflix Price. (Koren et al., Matrix Factorization Techniques for Recommender Systems. Computer, ) Coordinate Descent: (Yu et al., Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems. ICDM, ) DSGD (block partition for distributed SGD): (Gemulla et al., Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent. KDD, ) NOMAD (improved distributed SGD): ( NOMAD: Non-locking, stochastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. VLDB ) Software: LIBPMF (multi-core), LIBMF (multi-core), GraphLab, NOMAD (multicore and distributed), MLlib (spark), (more?)

45 Provable Guarantees for the Non-convex Form ALS with re-sampling can recover the underlying low-rank matrix: (Jain et al., Low-rank matrix completion using alternating minimization. STOC ) ALS without re-sampling with similar guarantee: (Sun and Luo, Guaranteed Matrix Completion via Non-convex Factorization. FOCS, ) (Ge et al., Matrix Completion has No Spurious Local Minimum. NIPS, )

46 Coming up Next class: other matrix completion topics Questions?

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization