Lecture: Matrix Completion

Size: px

Start display at page:

Download "Lecture: Matrix Completion"

Kelley Ryan
6 years ago
Views:

1 1/56 Lecture: Matrix Completion Acknowledgement: this slides is based on Prof. Jure Leskovec and Prof. Emmanuel Candes s lecture notes

Recommendation systems References: http://bicmr.pku.edu.

2 Recommendation systems References: 2/56

3 The Netflix Prize 3/56 Training data 100 million ratings, 480,000 users, 17,770 movies 6 years of data: Test data Last few ratings of each user (2.8 million) Evaluation criterion: root mean squared error (RMSE): xi (r xi r xi )2 : r xi and r xi are the predicted and true rating of x on i Netflix Cinematch RMSE: Competition teams $1 million prize for 10% improvement on Cinematch

4 Netflix: evaluation 4/56

5 Collaborative Filtering: weighted sum model 5/56 ˆr xi = b xi + w ij (r xj b xj ) j N(i;x) baseline estimate for r xi : b xi = µ + b x + b i µ: overall mean rating b x : rating deviation of user x = (avg. rating of user x) - µ b i : (avg. rating of movie i) - µ We sum over all movies j that are similar to i and were rated by x w ij is the interpolation weight (some real number). We allow: j N(i,x) w ij 1 w ij models interaction between pairs of movies (it does not depend on user x) N(i; x): set of movies rated by user x that are similar to movie i

6 Finding weights w ij? Find w ij such that they work well on known (user, item) ratings: min w ij F(w) := x b xi + j N(i;x) w ij (r xj b xj ) r xi 2 Unconstrained optimization: quadratic function wij F(w) = 2 b xi + w ik (r xk b xk ) r xi (r xj b xj ) = 0 x k N(i;x) for j {N(i, x), i, x} Equivalent to solving a system of linear equations? Steepest gradient descent method: w k+1 = w k τ F(w) Conjugate gradient method 6/56

Latent factor models 7/56 SVD on Netflix data: R Q P T For now let s assume we can approximate the rating matrix R as a product of thin Q P T R has missing entries

7 Latent factor models 7/56 SVD on Netflix data: R Q P T For now let s assume we can approximate the rating matrix R as a product of thin Q P T R has missing entries but let s ignore that for now! Basically, we will want the reconstruction error to be small on known ratings and we don t care about the values on the missing ones

8 Ratings as products of factors 8/56 How to estimate the missing rating of user x for item i? ˆr xi = q i p T x = f q if p xf, where q i is row i of Q and p x is column x of P T

9 SVD - Properties Theorem: SVD If A is a real m-by-n matrix, then there exits U = [u 1,..., u m ] R m m and V = [v 1,..., v n ] R n n such that U T U = I, V T V = I and U T AV = diag(σ 1,..., σ p ) R m n, p = min(m, n), where σ 1 σ 2... σ p 0. Eckart & Young, 1936 Let the SVD of A R m n be given in Theorem: SVD. If k < r = rank(a) and A k = k i=1 σ iu i v T i, then min rank(b)=k A B 2 = A A k 2 = σ k+1. 9/56

10 What is SVD? 10/56 A: Input data matrix U: Left singular vecs V: Right singular vecs Σ: Singular values SVD gives minimum reconstruction error (SSE!) (A ij [UΣV T ] ij ) 2 min U,V,Σ In our case, SVD on Netflix data: R Q P T, i.e., A = R, Q = U, P T = V T But, we are not done yet! R has missing entries! ij

11 Latent factor models 11/56 Minimize SSE on training data! Use specialized methods to find P, Q such that ˆr xi = q i p T x (r xi q i p T x ) 2 min P,Q (i,x) training We don t require cols of P, Q to be orthogonal/unit length P, Q map users/movies to a latent space Add regularization: [ min (r xi q i p T x ) 2 + λ p x P,Q (i,x) training x i λ is called regularization parameters q i 2 2 ]

12 Gradient descent method min P,Q [ F(P, Q) := (r xi q i p T x ) 2 + λ p x (i,x) training x i q i 2 2 ] Gradient decent: Initialize P and Q (using SVD, pretend missing ratings are 0) Do gradient descent: P k+1 P k τ P F(P k, Q k ), Q k+1 Q k τ Q F(P k, Q k ), where ( Q F) if = 2 x,i (r xi q i p T x )p xf + 2λq if. Here q if is entry f of row q i of matrix Q Computing gradients is slow when the dimension is huge 12/56

13 Stochastic gradient descent method 13/56 Observation: Letq if be entry f of row q i of matrix Q ( Q F) if = ( 2(rxi q i p T ) x )p xf + 2λq if = Q F(r xi ) x,i x,i ( P F) xf = ( 2(rxi q i p T ) x )q xf + 2λp if = P F(r xi ) x,i x,i Stochastic gradient decent: Instead of evaluating gradient over all ratings, evaluate it for each individual rating and make a step P P τ P F(r xi ) Q Q τ Q F(r xi ) Need more steps but each step is computed much faster

14 Latent factor models with biases 14/56 predicted models: ˆr xi = µ + b x + b i + q i p T x µ: overall mean rating, b x : Bias for user x, b i : Bias for movie i New model: min P,Q,b x,b i (r xi (µ + b x + b i + q i p T x )) 2 (i,x) training [ +λ p x ] q i b x b i 2 2 x i Both biases b x, b i as well as interactions q i, p x are treated as parameters (we estimate them) Add time dependence to biases: ˆr xi = µ + b x (t) + b i (t) + q i p T x

15 Netflix: performance 15/56

16 Netflix: performance 16/56

17 General matrix completion 17/56

18 18/56 Matrix completion Matrix M R n 1 n 2 Observe subset of entries Can we guess the missing entries?

19 19/56 Which algorithm? Hope: only one low-rank matrix consistent with the sampled entries Recovery by minimum complexity minimize rank(x) subject to X ij = M ij, (i, j) Ω Problem This is NP-hard Doubly exponential in n (?)

20 Nuclear-norm minimization Singular value decomposition r X = σ k u k v k k=1 {σ k }: singular values, {u k }, {v k }: singular vectors Nuclear norm (σ i (X) is ith largest singular value of X) X = n σ i (X) i=1 Heuristic minimize X subject to X ij = M ij, (i, j) Ω Convex relaxation of the rank minimization program 20/56

21 Connections with compressed sensing General setup Rank minimization Convex relaxation minimize rank(x) minimize X subject to A(X) = b subject to A(X) = b Suppose X = diag(x), x R n rank(x) = i 1 (x i 0) = x l0 X = i x i = x l1 Rank minimization Convex relaxation minimize x l0 minimize x l1 subject to Ax = b subject to Ax = b This is compressed sensing! 21/56

22 Correspondence 22/56 parsimony concept cardinality rank Hilbert Space norm Euclidean Frobenius sparsity inducing norm l 1 nuclear dual norm l operator norm additivity disjoint support orthogonal row and column spaces convex optimization linear programming semidefinite programming Table: From Recht Parrilo Fazel (08)

23 Semidefinite programming (SDP) Special class of convex optimization problems Relatively natural extension of linear programming (LP) Efficient numerical solvers (interior point methods) LP: x R n minimize c, x subject to Ax = b x 0 SDP: X R n n minimize C, X subject to A k, X = b k X 0 Standard inner product: C, X = trace(c X) 23/56

24 SOCP/SDP Duality (P) min c x s.t. Ax = b, x Q 0 (D) max b y s.t. A y + s = c, s Q 0 (P) min C, X s.t. A 1, X = b 1... A m, X = b m X 0 (D) max b y s.t. y i A i + S = C i S 0 Strong duality If p >, (P) is strictly feasible, then (D) is feasible and p = d If d < +, (D) is strictly feasible, then (P) is feasible and p = d If (P) and (D) has strictly feasible solutions, then both have optimal solutions. 24/56

25 Semidefinite program (D) min s.t. b y y 1 A y m A m C A i, C S k, multiplier is matrix X S k Lagrangian L(y, X) = b y + X, y 1 A y m A m C dual function g(x) = inf y L(y, X) = { C, X, A i, X = b i otherwise The dual of (D) is min C, X s.t. A i, X = b i, X 0 p = d if primal SDP is strictly feasible. 25/56

26 SDP Relaxtion of Maxcut min x Wx s.t. x 2 i = 1 = max 1 ν s.t. W + diag(ν) 0 min Tr(WX) s.t. X ii = 1 X 0 a nonconvex problem; feasible set contains 2n discrete points interpretation: partition {1,..., n} in two sets; W ij is cost of assigning i, j to the same set;; W ij is cost of assigning to different sets dual function ( g(ν) = inf x x Wx + i ν i (x 2 i 1) ) = inf x (W + diag(ν))x 1 ν x { 1 ν W + diag(ν) 0 = otherwise 26/56

27 Positive semidefinite unknown: SDP formulation 27/56 Suppose unknown matrix X is positive semidefinite n min σ i (X) min trace(x) i=1 s.t. X ij = M ij (i, j) Ω X 0 s.t. X ij = M ij (i, j) Ω X 0 Trace heuristic: Mesbahi & Papavassilopoulos (1997), Beck & D Andrea (1998)

28 General SDP formulation 28/56 Let X R m n. For a given norm, the dual norm d is defined as X d := sup{ X, Y : Y R m n, Y 1} Nuclear norm and spectral norms are dual: X := σ 1 (X), X = i σ i (X). (P) max Y X, Y s.t. Y 2 1 max 2 X, Y Y [ ] Im Y s.t. 0 Y I n [ ] 0 X min Z, Z X 0 s.t. Z 1 = I m Z 2 = I n [ ] Z1 Z Z = 3 Z3 0 Z 2

29 General SDP formulation The Lagrangian dual problem is: max min W 1,W 2 Z 0 [ ] 0 X Z, X + Z 0 1 I m, W 1 + Z 2 I n, W 2 strong duality after a scaling of 1/2 and change of variables X to X (D) minimize 1 2 (trace(w 1) + trace(w 2 )) subject to [ ] W1 X X 0 W 2 Optimization variables: W 1 R n 1 n 1, W 2 R n 2 n 2. Proposition 2.1 in "Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization", Benjamin Recht, Maryam Fazel, Pablo A. Parrilo 29/56

30 General SDP formulation 30/56 Nuclear norm minimization min X s.t. A(X) = b max b y s.t. A (y) 1 SDP Reformulation min 1 2 (trace(w 1) + trace(w 2 )) s.t. A(X) = b [ ] W1 X X 0 W 2 max b y [ I A s.t. ] (y) (A (y)) 0 I

31 Matrix recovery 31/56 M = 2 σ k u k u k, k=1 u 1 = (e 1 + e 2 )/ 2, u 2 = (e 1 e 2 )/ M = Cannot be recovered from a small set of entries

32 32/56 Rank-1 matrix M = xy M ij = x i y j If single row (or column) is not sampled recovery is not possible What happens for almost all sampling sets? Ω subset of m entries selected uniformly at random

33 33/56 References Jianfeng Cai, Emmanuel Candes, Zuowei Shen, Singular value thresholding algorithm for matrix completion Shiqian Ma, Donald Goldfarb, Lifeng Chen, Fixed point and Bregman iterative methods for matrix rank minimization Zaiwen Wen, Wotao Yin, Yin Zhang, Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm Onureena Banerjee, Laurent El Ghaoui, Alexandre d Aspremont, Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data Zhaosong Lu, Smooth optimization approach for sparse covariance selection

34 34/56 Matrix Rank Minimization Given X R m n, A : R m n R p, b R p, we consider the matrix rank minimization problem: min rank(x), s.t. A(X) = b matrix completion problem: min rank(x), s.t. X ij = M ij, (i, j) Ω nuclear norm minimization: min X s.t. A(X) = b where X = i σ i and σ i = ith singular value of matrix X.

35 Quadratic penalty framework 35/56 Unconstrained Nuclear Norm Minimization: min F(X) := µ X A(X) b 2 2. Optimality condition: 0 µ X + A (A(X ) b), where X = {UV + W : U W = 0, WV = 0, W 2 1}. Linearization approach (g is the gradient of 1 2 A(X) b 2 2 ): X k+1 := arg min X = arg min X µ X + g k, X X k + 1 2τ X Xk 2 F µ X + 1 2τ X (Xk τg k ) 2 F

36 Matrix Shrinkage Operator 36/56 For a matrix Y R m n, consider: The optimal solution is: SVD: Y = UDiag(σ)V min X R m n ν X X Y 2 F. X := S ν (Y) = UDiag(s ν (σ))v, Thresholding operator: { xi ν, if x s ν (x) := x, with x i = i ν > 0 0, o.w.

37 Fixed Point Method (Proximal gradient method) 37/56 Fixed Point Iterative Scheme { Y k = X k τa (A(X k ) b) X k+1 = S τµ (Y k ). Lemma: Matrix shrinkage operator is non-expansive. i.e., S ν (Y 1 ) S ν (Y 2 ) F Y 1 Y 2 F. Theorem: The sequence {X k } generated by the fixed point iterations converges to some X X, where X is the optimal solution set.

38 38/56 SVT Linearized Bregman method: V k+1 := V k τa (A(X k ) b) X k+1 := S τµ (V k+1 ) Convergence to min τ X X 2 F, s.t. A(X) = b

39 Accelerated proximal gradient (APG) method 39/56 Complexity of the fixed point method: APG algorithm (t 1 = t 0 = 1): Complexity: F(X k ) F(X ) L f X 0 X 2 2k Y k = X k + tk 1 1 t k (X k X k 1 ) G k = Y k (τ k ) 1 A (A(Y k ) b) X k+1 = S τ k(g k ), t k+1 = (t k ) 2 F(X k ) F(X ) 2L f X 0 X 2 (k + 1) 2 2

40 Low-rank factorization model 40/56 Finding a low-rank matrix W so that P Ω (W M) 2 F or the distance between W and {Z R m n, Z ij = M ij, (i, j) Ω} is minimized. Any matrix W R m n with rank(w) K can be expressed as W = XY where X R m K and Y R K n. New model min X,Y,Z 1 2 XY Z 2 F s.t. Z ij = M ij, (i, j) Ω Advantage: SVD is no longer needed! Related work: the solver OptSpace based on optimization on manifold

41 41/56 Nonlinear Gauss-Seidel scheme First variant of alternating minimization: X + ZY ZY (YY ), Y + (X + ) Z (X+X + ) (X+Z), Z + X + Y + + P Ω (M X + Y + ). Let P A be the orthogonal projection onto the range space R(A) X + Y + = ( X + (X+X + ) X+) Z = PX+ Z One can verify that R(X + ) = R(ZY ). X + Y + = P ZY Z = ZY (YZ ZY ) (YZ )Z. idea: modify X + or Y + to obtain the same product X + Y +

42 41/56 Nonlinear Gauss-Seidel scheme Second variant of alternating minimization: X + ZY, Y + (X + ) Z (X+X + ) (X+Z), Z + X + Y + + P Ω (M X + Y + ). Third variant of alternating minimization: V = orth(zy ) X + V, Y + V Z, Z + X + Y + + P Ω (M X + Y + ).

43 42/56 Sparse and low-rank matrix separation Given a matrix M, we want to find a low rank matrix W and a sparse matrix E, so that W + E = M. Convex approximation: min W,E W + µ E 1, s.t. W + E = M Robust PCA

44 Video separation 43/56 Partition the video into moving and static parts

45 ADMM Convex approximation: min W,E W + µ E 1, s.t. W + E = M Augmented Lagrangian function: L(W, E, Λ) := W + µ E 1 + Λ, W + E M + 1 2β W + E M 2 F Alternating direction Augmented Lagrangian method W j+1 := arg min W L(W, Ej, Λ j ), E j+1 := arg min E L(W j+1, E, Λ j ), Λ j+1 := Λ j + γ β (Wj+1 + E j+1 M). 44/56

46 W-subproblem 45/56 Convex approximation: W j+1 := arg min L(W, W Ej, Λ j ) = arg min W + 1 W ( M E j βλ j) 2 W 2β F = S β (M E j βλ j ) := UDiag(s β (σ))v SVD: M E j βλ j = UDiag(σ)V Thresholding operator: { xi ν, if x s ν (x) := x, with x i = i ν > 0 0, o.w.

47 E-subproblem 46/56 Convex approximation: W j+1 := arg min E L(W j+1, E, Λ j ) = arg min E E βµ = s βµ (M W j+1 βλ j ) E ( M W j+1 βλ j) 2 F s ν (y) : = arg min x R 2 x y 2 2 { = y νsgn(y), if y > ν 0, otherwise

48 Low-rank factorization model for matrix separation 47/56 Consider the model min Z,S S 1 s.t. Z + S = D, rank(z) K Low-rank factorization: Z = UV min Z D 1 s.t. UV Z = 0 U,V,Z Only the entries D ij, (i, j) Ω, are given. P Ω (D) is the projection of D onto Ω. New model min U,V,Z P Ω (Z D) 1 s.t. UV Z = 0 Advantage: SVD is no longer needed!

49 ADMM Consider: min U,V,Z P Ω (Z D) 1 s.t. UV Z = 0 Introduce the augmented Lagrangian function L β (U, V, Z, Λ) = P Ω (Z D) 1 + Λ, UV Z + β 2 UV Z 2 F, Alternating direction augmented Lagrangian framework (Bregman): U j+1 := arg min L β (U, V j, Z j, Λ j ), U R m k V j+1 := arg min V R k n L β(u j+1, V, Z j, Λ j ), Z j+1 := arg min Z R m n L β(u j+1, V j+1, Z, Λ j ), Λ j+1 := Λ j + γβ(u j+1 V j+1 Z j+1 ). 48/56

50 ADMM subproblems 49/56 Let B = Z Λ/β, then U + = BV (VV ) and V + = (U +U + ) U +B Since U + V + = U + (U +U + ) U +B = P U+ B, then: Q := orth(bv ), U + = Q and V + = Q B Variable Z: ( P Ω (Z + ) = P Ω (S U + V + D + Λ β, 1 ) ) + D β ( P Ω c(z + ) = P Ω c U + V + + Λ ) β

51 Nonnegative matrix factorization completion 50/56 Model problem: min P Ω (XY M) F s.t. X 0, Y 0, Augmented Lagrangian function: min 1 2 XY Z 2 F s.t. X = U, Y = V, U 0, V 0, P Ω (Z M) = 0 L A (X, Y, Z, U, V, Λ, Π) = 1 2 XY Z 2 F + Λ (X U) where A B := i,j a ijb ij. +Π (Y V) + α 2 X U 2 F + β 2 Y V 2 F,

52 51/56 ADMM X k+1 = (Z k Yk T + αu k Λ k )(Y k Yk T + αi) 1, Y k+1 = (Xk+1X T k+1 + βi) 1 (Xk+1Z T k + βv k Π k ), Z k+1 = X k+1 Y k+1 + P Ω (M X k+1 Y k+1 ), U k+1 = P + (X k+1 + Λ k /α), V k+1 = P + (Y k+1 + Π k /β), Λ k+1 = Λ k + γα(x k+1 U k+1 ), Π k+1 = Π k + γβ(y k+1 V k+1 ), (3a) (3b) (3c) (3d) (3e) (3f) (3g) where γ (0, 1.618) and (P + (A)) ij = max{a ij, 0}.

53 Sparse covariance selection (A. d Aspremont) 52/56 We estimate a covariance matrix Σ from empirical data Infer independence relationships between variables Given m + 1 observations x i R n on n random variables, compute S := 1 m+1 m i=1 (x i x)(x i x) Choose a symmetric subset I of matrix coefficients and denote by J the complement Choose a covariance matrix ˆΣ such that ˆΣ ij = S ij for all (i, j) I = 0 for all (i, j) J ˆΣ 1 ij Benefits: maximum entropy, maximum likelihood, existence and uniqueness Applications: Gene expression data, speech recoginition and finance

54 Maximum likelihood estimation 53/56 Consider estimation: Convex relaxations: whose dual problem is: max X S n log det X Tr(SX) ρ X 0 max log det X Tr(SX) ρ X 1, X S n max log det W s.t. W S λ

55 APG Zhaosong Lu (smooth optimization approach for sparse covariance selection) consider max log det X Tr(SX) ρ X 1 s.t. X := {X S n : βi X αi}, which is equivalent to (U := {U S n : U ij 1, ij}) max X X min U U log det X S + ρu, X Let f (U) := max X X log det X S + ρu, X log det X is strongly concave on X f (U) is continuous differentiable f (U) is Lipschitz cont. with L = ρβ 2 Therefore, APG can be applied to the dual problem min f (U) U U 54/56

56 Extension 55/56 Consider Assume: Then max x X g(x) := min φ(x, u) u U φ(x, u) is a cont. fun. which is strictly concave in x X for every fixed u U, and convex diff. in u U for every fixed x X. Then f (u) := max x X φ(x, u) is diff. f (u) is Lipschitz cont. the primal and the dual min u U f (u) are both solvable and have the same optimal value; Nesterov s smooth minimization approach can be applied to the dual

57 56/56 Nesterov s smoothing technique Consider max x X min φ(x, u) u U Question: What if the assumptions do not hold? Add a strictly convex function µd(u) to the obj. fun. g(u) is differentiable g(u) := arg min φ(x, u) + µd(u) u U Apply Nesterov s smooth minimization Complexity of finding a ɛ-suboptimal point: O( 1 ɛ ) iterations Other smooth technique?

Low-Rank Factorization Models for Matrix Completion and Matrix Separation

Low-Rank Factorization Models for Matrix Completion and Matrix Separation for Matrix Completion and Matrix Separation Joint work with Wotao Yin, Yin Zhang and Shen Yuan IPAM, UCLA Oct. 5, 2010 Low rank minimization problems Matrix completion: find a low-rank matrix W R m n so