ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/67

Outline Low-rank matrix completion and recovery Nuclear norm minimization (this lecture) RIP and low-rank matrix recovery Matrix completion Algorithms for nuclear norm minization Non-convex methods (next lecture) Spectral methods (Projected) gradient descent 2/67

Low-rank matrix completion and recovery: motivation 3/67

Motivation 1: recommendation systems??????????????? Netflix challenge: Netflix provides highly incomplete ratings from 0.5 million users for & 17,770 movies How to predict unseen user ratings for movies? 4/67

In general, we cannot infer missing ratings????????????????????????????? Underdetermined system (more unknowns than observations) 5/67

... unless rating matrix has other structure??????????????? A few factors explain most of the data 6/67

... unless rating matrix has other structure??????????????? A few factors explain most of the data low-rank approximation How to exploit (approx.) low-rank structure in prediction? 6/67

Motivation 2: sensor localization n sensors / points x j R 3, j = 1,, n Observe partial information about pairwise distances D i,j = x i x j 2 = x i 2 + x j 2 2x i x j Want to infer distance between every pair of nodes 7/67

Motivation 2: sensor localization Introduce x 1 x X = 2. Rn 3 x n then distance matrix D = [D i,j ] 1 i,j n can be written as where d 2 := [ x 1 2,, x n 2 ] D = d 2 e + ed 2 2XX }{{} low rank rank(d) n low-rank matrix completion 8/67

Motivation 3: structure from motion Structure from motion: reconstruct 3D scene geometry and }{{} structure camera poses from multiple images }{{} motion Unknown camera viewpoints Given multiple images and a few correspondences between image features, how to estimate locations of 3D points? 9/67

Motivation 3: structure from motion Tomasi and Kanade s factorization: Consider n 3D points in m different 2D frames x i,j R 2 1 : locations of j th point in i th frame x i,j = P i }{{} projection matrix R 2 3 Matrix of all 2D locations rank(x) = 3. x 1,1 x 1,n X =..... x m,1 x m,n = P 1. s j }{{} 3D position R 3 [ s 1 ] s n R 2m n P m }{{} low-rank factorization Due to occlusions, there are many missing entries in the matrix X. Goal: Can we complete the missing entries? 10/67

Motivation 4: missing phase problem Detectors record intensities of diffracted rays electric field x(t 1, t 2 ) Fourier transform ˆx(f 1, f 2 ) Fig credit: Stanford SLAC intensity of electrical field: ˆx(f1, f 2 ) 2 = x(t 1, t 2 )e i2π(f1t1+f2t2) dt 1 dt 2 2 Phase retrieval: recover signal x(t 1, t 2 ) from intensity ˆx(f 1, f 2 ) 2 11/67

A discrete-time model: solving quadratic systems A x Ax y = Ax 2 1-3 2-1 4 2-2 -1 3 4 1 9 4 1 16 4 4 1 9 16 Solve for x C n in m quadratic equations y k = a k, x 2, k = 1,..., m or y = Ax 2 where z 2 := { z 1 2,, z m 2 } 12/67

An equivalent view: low-rank factorization Lifting: introduce X = xx to linearize constraints y k = a kx 2 = a k(xx )a k = y k = a kxa k (6.1) find X 0 s.t. y k = a k a k, X, k = 1,, m rank(x) = 1 13/67

The list continues system identification and time series analysis; spatial-temporal data: low-rank due to correlations, e.g. MRI video, network traffic,.. face recognition; quantum state tomography; community detection;... 14/67

Low-rank matrix completion and recovery: setup and algorithms 15/67

Setup Consider M R n n (square case for simplicity) rank(m) = r n The thin Singular value decomposition (SVD) of M: where Σ = M = UΣV }{{ } = (2n r)r degrees of freedom σ 1... σ r r σ i u i vi T i=1 contain all singular values {σ i }; U := [u 1,, u r ], V := [v 1,, v r ] consist of singular vectors 16/67

Observed entries Low-rank matrix completion M i,j, (i, j) }{{} Ω sampling set Completion via rank minimization minimize X rank(x) s.t. X i,j = M i,j, (i, j) Ω 17/67

Observed entries Low-rank matrix completion M i,j, (i, j) }{{} Ω sampling set An operator P Ω : orthogonal projection onto subspace of matrices supported on Ω { Xi,j if (i, j) Ω [P Ω (X)] i,j = 0 otherwise Completion via rank minimization minimize X rank(x) s.t. P Ω (X) = P Ω (M) 17/67

More general: low-rank matrix recovery Linear measurements y i = A i, M = Tr(A i M), i = 1,... m An operator form, with A : R n n R m : y = A(M) := A 1, M. A m, M Recovery via rank minimization minimize X rank(x) s.t. y = A(X) 18/67

Nuclear norm minimization 19/67

Convex relaxation Low-rank matrix completion: minimize X s.t. rank(x) }{{} nonconvex P Ω (X) = P Ω (M) Low-rank matrix recovery: minimize X s.t. rank(x) }{{} nonconvex A(X) = A(M) Question: what is convex surrogate for rank( )? 20/67

Analogy with sparse recovery For a given matrix X R n n, take the full SVD of X: X = Û Σ V ] = [u 1,, u n σ 1... [ ] v 1,, v n. } {{ σ n } diag(σ) Then n rank(x) = 1{σ i 0} = σ 0 i=1 Convex relaxation: from σ 0 to σ 1? 21/67

Nuclear norm Definition 6.1 The nuclear norm of X is X := n σ i (X) }{{} i=1 i th largest singular value Nuclear norm is a counterpart of l 1 norm for rank function Equivalence among different norms (r = rank(x)) X X F X r X F r X. where X = σ 1 (X); X F = ( n i=1 σ 2 i (X)) 1/2. 22/67

Tightness of relaxation Recall: the l 1 norm ball is convex hull of 1-sparse, unit-norm vectors. (a) l 1 norm ball (b) nuclear norm ball Fact 6.2 The nuclear norm ball {X : X 1} is the convex hull of rank-1 matrices, unit-norm matrices obeying uv = 1. 23/67

Additivity of nuclear norm Fact 6.3 Let A and B be matrices of the same dimensions. If AB = 0 and A B = 0, then A + B = A + B. If row (resp. column) spaces of A and B are orthogonal, then A + B = A + B Similar to l 1 norm: when x and y have disjoint support, x + y 1 = x 1 + y 1 which is a key to study l 1 -min under RIP. 24/67

Proof of Fact 6.3 Suppose A = U A Σ A V A and B = U BΣ B VB, which gives AB = 0 A B = 0 V A V B = 0 U A U B = 0 Thus, one can write A = [U A, U B, U C ] B = [U A, U B, U C ] Σ A 0 0 0 Σ B 0 [V A, V B, V C ] [V A, V B, V C ] and hence A + B = [U A, U B ] [ ΣA Σ B ] [V A, V B ] = A + B 25/67

Dual norm Definition 6.4 (Dual norm) For a given norm A, the dual norm is defined as X A := max{ X, Y : Y A 1} l 1 norm l 2 norm Frobenius norm nuclear norm dual dual dual dual l norm l 2 norm Frobenius norm spectral norm 26/67

Schur complement Given a block matrix [ ] A B D = B R (p+q) (p+q), C where A R p p, B R p q and C R q q. The Schur complement of the block A (assume it s invertible) in D is given as C B A 1 B. Fact 6.5 D 0 A 0, C B A 1 B 0 27/67

Representing nuclear norm via SDP Since spectral norm is dual norm of nuclear norm, X = max{ X, Y : Y 1} The constraint is equivalent to Y 1 Y Y I Schur complement [ I Y Y I ] 0 Fact 6.6 X = max Y { X, Y [ ] } I Y Y 0 I 28/67

Representing nuclear norm via SDP Since spectral norm is dual norm of nuclear norm, X = max{ X, Y : Y 1} The constraint is equivalent to Y 1 Y Y I Schur complement [ I Y Y I ] 0 Fact 6.7 (Dual characterization) X = min W 1,W 2 { 1 2 Tr(W 1) + 1 2 Tr(W 2) [ ] } W1 X X 0 W 2 Optimal point: W 1 = UΣU, W 2 = V ΣV (where X = UΣV ) 28/67

Aside: dual of semidefinite program (primal) minimize X C, X s.t. A i, X = b i, 1 i m X 0 (dual) maximize y b y s.t. m y i A i + S = C Exercise: use this to verify Fact 6.7 i=1 S 0 29/67

Nuclear norm minimization via SDP Nuclear norm minization ˆM = argmin X X s.t. y = A(X) This is solvable via SDP minimize X,W1,W 2 1 2 Tr(W 1) + 1 2 Tr(W 2) s.t. y = A(X), [ ] W1 X X 0 W 2 30/67

Rank minimization vs Cardinality minimization Fig. credit: Fazel et.al. 10 31/67

Proximal algorithm In the presence of noise, one needs to solve minimize X 1 2 y A(X) 2 F + λ X which can be solved via proximal methods. Algorithm 6.1 Proximal gradient methods for t = 0, 1, : ) X t+1 = prox µtλ (X t µ t A (A(X t ) y) where µ t : step size / learning rate 32/67

Proximal operator for nuclear norm Proximal operator: prox λ (X) = arg min Z = UT λ (Σ)V { } 1 Z X 2F + λ Z 2 where SVD of X is X = UΣV with Σ = diag({σ i }), and T λ (Σ) = diag({(σ i λ) + }) 33/67

Accelerated proximal gradient Algorithm 6.2 Accelerated proximal gradient methods for t = 0, 1, : ) X t+1 = prox µtλ (Z t µ t A (A(Z t ) y) ( Z t+1 = X t+1 + α t X t+1 X t) } {{ } momentum term where µ t : step size / learning rate, α t is the momentum. ( ) Convergence rate: O 1 ɛ iterations to reach ɛ-accuracy. Per-iteration cost is a (partial-)svd. 34/67

Frank-Wolfe for nuclear norm minimization Consider the constrained problem: minimize X 1 2 y A(X) 2 F s.t. X τ. which can be solved via conditional gradient method (Frank-Wolfe 1956). More generally, consider the problem: minimize β f(β) s.t. β C, where f(x) is smooth and convex, and C is a convex set. Recall projected gradient descent: ) β t+1 = P C (β t µ t f(β t ) 35/67

Conditional Gradient Method Algorithm 6.3 Frank-Wolfe for t = 0, 1, : s t = argmin s C f(β t ) s β t+1 = (1 γ t )β t + γ t s t where γ t := 2/(t + 1) is the (default) step size. The first step is a constrained optimization of a linear approximation at f(β t ); The second step controls how much we move towards s t. 36/67

Figure illustration Figure credit: Jaggi 2011 37/67

Frank-Wolfe for nuclear norm minimization Algorithm 6.4 Frank-Wolfe for nuclear norm minimization for t = 0, 1, : S t = argmin S τ f(x t ), S, X t+1 = (1 γ t )X t + γ t S t ; where γ t := 2/(t + 1) is the (default) step size. (Homework) Note that f(x t ) = A (A(X t ) y)), and S t = τ argmin S 1 f(x t ), S = τuv T, where u, v are the left and right top singular vector of f(x t ). 38/67

Further comments on Frank-Wolfe Extremely low per-iteration cost (only top singular vectors are needed); Every iteration is a rank-1 update; Convergence rate: O( 1 ɛ ) to reach ɛ-accuracy, which can be very slow. Various ways to speed up; active research area. 39/67

RIP and low-rank matrix recovery 40/67

RIP for low-rank matrices Almost parallel results to compressed sensing... 1 Definition 6.8 The r-restricted isometry constants δr ub (A) and δr lb (A) are smallest quantities s.t. (1 δ lb r ) X F A(X) F (1 + δ ub r ) X F, X : rank(x) r 1 One can also define RIP w.r.t. 2 F rather than F. 41/67

RIP and low-rank matrix recovery Theorem 6.9 (Recht, Fazel, Parrilo 10, Candes, Plan 11) Suppose rank(m) = r. For any fixed integer K > 0, if 1+δ ub Kr 1 δ lb (2+K)r < K 2, then nuclear norm minimization is exact Can be easily extended to account for noise and imperfect structural assumption 42/67

Geometry of nuclear norm ball Level set of nuclear norm ball: [ x y y z ] 1 Fig. credit: Candes 14 43/67

Some notation Recall M = UΣV Let T be span of matrices of the form (called tangent space) T = {UX + Y V : X, Y R n r } Let P T be orthogonal projection onto T : P T (X) = UU X + XV V UU XV V Its complement P T = I P T : P T (X) = (I UU )X(I V V ) MP T (X) = 0 and M P T (X) = 0 44/67

Proof of Theorem 6.9 Suppose X = M + H is feasible and obeys M + H M. The goal is to show that H = 0 under RIP. The key is to decompose H into H 0 + H 1 + H 2 +... }{{} H c H 0 = P T (H) (rank 2r) H c = P T (H) (obeying MH c = 0 and M H c = 0) H 1 : best rank-(kr) approximation of H c H 2 : best rank-(kr) approximation of H c H 1... (K is const) 45/67

Proof of Theorem 6.9 Informally, the proof proceeds by showing that 1. H 0 dominates i 2 H i (by objective function) 2. (converse) i 2 H i dominates H 0 + H 1 (by RIP + feasibility) These can happen simultaneously only when H = 0 46/67

Proof of Theorem 6.9 Step 1 (which does not rely on RIP). Show that H j F H 0 / Kr. (6.2) j 2 This follows immediately by combining the following 2 observations: (i) Since M + H is assumed to be a better estimate: M M + H M + H c H 0 (6.3) = M + H c }{{} H 0 Fact 6.3 (MHc =0 and M H c=0) = H c H 0 (6.4) (ii) Since nonzero singular values of H j 1 dominate those of H j (j 2): H j F Kr H j Kr [ H j 1 /(Kr) ] H j 1 / Kr = H j F 1 H j 1 = 1 H c (6.5) j 2 Kr j 2 Kr

Proof of Theorem 6.9 Step 2 (using feasibility + RIP). Show that ρ < K/2 s.t. If this claim holds, then H 0 + H 1 F ρ j 2 H j F (6.6) H 0 + H 1 F ρ H (6.2) j F ρ 1 H 0 j 2 Kr ρ 1 ( ) 2 2r H0 F = ρ Kr K H 0 F 2 ρ K H 0 + H 1 F. (6.7) This cannot hold with ρ < K/2 unless H 0 + H 1 = 0 }{{} equivalently, H 0=H 1=0 48/67

Proof of Theorem 6.9 We now prove (6.6). To connect H 0 + H 1 with j 2 H j, we use feasibility: A(H) = 0 A(H 0 + H 1 ) = j 2 A(H j), which taken collectively with RIP yields (1 δ(2+k)r lb ) H 0 + H 1 F A(H0 + H 1 ) F = A(H j) j 2 A(H j ) j 2 F (1 + δub Kr) H j F j 2 F This establishes (6.6) as long as ρ := 1+δub Kr 1 δ lb (2+K)r < K 2. 49/67

Gaussian sampling operators satisfy RIP If entries of {A i } m i=1 are i.i.d. N (0, 1/m), then δ 5r (A) < 3 2 3 + 2 with high prob., provided that m nr (near-optimal sample size) This satisfies assumption of Theorem 6.9 with K = 3 50/67

Precise phase transition Using statistical dimension machienry, we can locate precise phase transition (Amelunxen, Lotz, McCoy & Tropp 13) nuclear norm min { works if m > stat-dim ( D (, X) ) fails if m < stat-dim ( D (, X) ) where stat-dim ( D (, X) ) ( r n 2 ψ n) and ψ (ρ) = inf τ 0 { ρ + (1 ρ) [ ρ(1 + τ 2 ) + (1 ρ) 2 τ ]} (u τ) 2 4 u 2 du π 51/67

Numerical phase transition (n = 30) Figure credit: Amelunxen, Lotz, McCoy, & Tropp 13 52/67

Sampling operators that do NOT satisfy RIP Unfortunately, many sampling operators fail to satisfy RIP (e.g. none of the 4 motivating examples in this lecture satisfies RIP) 53/67

Matrix completion 54/67

Sampling operators for matrix completion Observation operator (projection onto matrices supported on Ω) Y = P Ω (M) where (i, j) Ω with prob. p (random sampling) P Ω does NOT satisfy RIP when p 1! For example, 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } {{ } M???????????? } {{ } Ω P Ω (M) F = 0, or equivalently, 1+δ ub K 1 δ lb 2+K = 55/67

Which sampling pattern? Consider the following sampling pattern????? If some rows/columns are not sampled, recovery is impossible. 56/67

Which low-rank matrices can we recover? Compare following rank-1 matrices: 1 0 0 0 0 0 0 0... 0 } 0 0 {{ 0 } hard vs. 1 1 1 1 1 1 1 1... 1 } 1 1 {{ 1 } easy Column / row spaces cannot be aligned with canonical basis vectors 57/67

Coherence Definition 6.10 Coherence parameter µ of M = UΣV is smallest quantity s.t. max i U e i 2 µr n and max i V e i 2 µr n µ 1 (since n i=1 U e i 2 = U 2 F = r) µ = 1 if 1 n 1 = U = V (most incoherent) µ = n r if e i U (most coherent) 58/67

Performance guarantee Theorem 6.11 (Candes & Recht 09, Candes & Tao 10, Gross 11,...) Nuclear norm minimization is exact and unique with high probability, provided that m µnr log 2 n This result is optimal up to a logarithmic factor Established via a RIPless theory 59/67

Numerical performance of nuclear-norm minimization n = 50 Fig. credit: Candes, Recht 09 60/67

Subgradient of nuclear norm Subdifferential (set of subgradients) of at M is M = { } UV + W : P T (W ) = 0, W 1 Does not depend on singular values of M Z M iff P T (Z) = UV, P T (Z) 1. 61/67

KKT condition Lagrangian: L (X, Λ) = X + Λ, P Ω (X) P Ω (M) = X + P Ω (Λ), X M When M is minimizer, KKT condition reads 0 X L(X, Λ) X=M Λ s.t. P Ω (Λ) M W s.t. UV + W is supported on Ω, P T (W ) = 0, and W 1 62/67

Optimality condition via dual certificate Slightly stronger condition than KKT guarantees uniqueness: Lemma 6.12 M is unique minimizer of nuclear norm minimization if sampling operator P Ω restricted to T is injective, i.e. P Ω (H) 0 nonzero H T W s.t. UV + W is supported on Ω, P T (W ) = 0, and W < 1 63/67

Proof of Lemma 6.12 For any W 0 obeying W 0 1 and P T (W ) = 0, one has M + H M + UV + W 0, H = M + UV + W, H + W 0 W, H = M + P Ω ( UV + W ), H + P T (W 0 W ), H = M + UV + W, P Ω (H) + W 0 W, P T (H) if we take W 0 s.t. W 0, P T (H) = P T (H) M + P T (H) W P T (H) = M + (1 W ) P T (H) > M unless P T (H) = 0. But if P T (H) = 0, then H = 0 by injectivity. Thus, M + H > M for any H 0, concluding the proof. 64/67

Constructing dual certificates Use golfing scheme to produce approximate dual certificate (Gross 11) Think of it as an iterative algorithm (with sample splitting) to solve KKT 65/67

Reference [1] Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, B. Recht, M. Fazel, P. Parrilo, SIAM Review, 2010. [2] Exact matrix completion via convex optimization, E. Candes, and B. Recht, Foundations of Computational Mathematics, 2009 [3] Matrix rank minimization with applications, M. Fazel, Ph.D. Thesis, 2002. [4] The power of convex relaxation: Near-optimal matrix completion, E. Candes, and T. Tao, 2010. [5] Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements, E. Candes, and Y. Plan, IEEE Transactions on Information Theory, 2011. [6] Shape and motion from image streams under orthography: a factorization method, C. Tomasi and T. Kanade, International Journal of Computer Vision, 1992. 66/67

Reference [7] An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems, K. C. Toh and S. Yun, S, Pacific Journal of optimization, 2010. [8] Topics in random matrix theory, T. Tao, American mathematical society, 2012. [9] A singular value thresholding algorithm for matrix completion, J. Cai, E. Candes, Z. Shen, SIAM Journal on Optimization, 2010. [10] Recovering low-rank matrices from few coefficients in any basis, D. Gross, IEEE Transactions on Information Theory, 2011. [11] Incoherence-optimal matrix completion, Y. Chen, IEEE Transactions on Information Theory, 2015. [12] Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization, M. Jaggi, ICML, 2013. 67/67