ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Size: px

Start display at page:

Download "ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference"

Letitia Patterson
5 years ago
Views:

1 ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring /45

2 Outline Low-rank matrix completion and recovery Nuclear norm minimization (last lecture) RIP and low-rank matrix recovery Matrix completion Algorithms for nuclear norm minimization Non-convex methods (this lecture) Global landscape Spectral methods (Projected) gradient descent 2/45

3 Why nonconvex? Consider completing an n n matrix, with rank r: minimize X P Ω (X M) 2 F s.t. rank(x) r, where r n. The size of observation Ω is about nrpolylogn; The degrees of freedom in X is about nr; Question: Can we develop algorithms that work with computational and memory complexity that nearly linear in n? This means that we don t even want to store the matrix X which takes n 2 storage. A nonconvex approach will store and update a low-dimensional representation of X throughout the execution of the algorithm. 3/45

4 Convex vs. nonconvex minimize x f(x) convex vs. nonconvex 4/45

5 Prelude: low-rank matrix approximation an optimization perspective 5/45

6 Low-rank matrix approximation / PCA Given M R n n (not necessarily low-rank), solve the low-rank approximation problem (best rank-r approximation): M = argmin X X M 2 F s.t. rank(x) r. this is a nonconvex optimization problem. The solution is known as the Eckart-Young theorem: denote the SVD of M = n i=1 σ i u i v i, where σ i s are in a descending order; then nonconvex, but tractable. r M = σ i u i vi. i=1 6/45

7 Optimization viewpoint Let us factorize X = UV, where U, V R n r. Our problem is equivalent to minimize U,V f(u, V ) := UV M 2 F. The size of U, V are of O(nr), which is much smaller than X; Identifiability issues: for any orthonormal R R r r, we have UV = (αur)(α 1 V R). If (U, V ) is a global minimizer (..), so does (αur, α 1 V R). Question: what does f(u, V ) look like (landscape)? (we already found its global minima.) 7/45

8 The PSD case For simplicity, consider the PSD case. Let M be PSD, so that M = n i=1 σ i u i u i. Let X = UU, where U R n r. We re interested in the landscape of f(u) := 1 4 UU M 2 F. Identifiability: for any orthonormal R R r r, we have UU = (UR)(UR). make the exposition even simpler: set r = 1. f(u) = 1 4 uu M 2 F. 8/45

9 Take f(u) = uu Good news: benign landscape [ ] 2. F Global optima: x = ± local minima. [ ] 1, strict saddle x = 1 [ ] 0. No spurious 0 9/45

10 Critical points Definition 7.1 A first-order critical point (stationary point) satisfies f(u) = 0. Figure credit: Li et al., /45

11 Critical points of f(u) Any u R n satisfies f(u) = (uu M)u = 0. Mu = u 2 2u u aligns with eigenvectors of M. or u = 0. 11/45

12 Critical points of f(u) Any u R n satisfies f(u) = (uu M)u = 0. Mu = u 2 2u u aligns with eigenvectors of M. or u = 0. Since Mu i = σ i u i, the set of critical points are given as { σ i u i, i = 1,..., n}. 11/45

13 Categorization of critical points Need to examine the Hessian: 2 f(u) := 2uu + u 2 2I M. Plug in the non-zero critical points: ũ k := σ k u k, 2 f(ũ k ) = 2σ k u k u k + σ k I M ( = 2σ k u k u n k + σ k u i u i i=1 ) n σ i u i u i i=1 = i k(σ k σ i )u i u i + 2σ k u k u k Assume σ 1 > σ 2 >... > σ n > 0: k = 1: 2 f(ũ 1 ) 0 local minima 1 < k n: λ min ( 2 f(ũ k )) < 0, λ max ( 2 f(ũ k )) > 0, strict saddle u = 0: 2 f(0) 0 local maxima 12/45

14 Categorization of critical points Need to examine the Hessian: 2 f(u) := 2uu + u 2 2I M. Plug in the non-zero critical points: ũ k := σ k u k, 2 f(ũ k ) = 2σ k u k u k + σ k I M ( = 2σ k u k u n k + σ k u i u i i=1 ) n σ i u i u i i=1 = i k(σ k σ i )u i u i + 2σ k u k u k Assume σ 1 > σ 2... σ n 0: k = 1: 2 f(ũ 1 ) 0 local minima 1 < k n: λ min ( 2 f(ũ k )) < 0, λ max ( 2 f(ũ k )) > 0, strict saddle u = 0: 2 f(0) 0 strict saddle 13/45

15 Summary f(u) := 1 4 UU M 2 F, U R n r, If σ r > σ r+1, all local minima are global: U contains the top-r eigenvectors up to an orthonormal transformation; strict saddle points: all stationary points are saddle points except the global optimum. 14/45

16 Consider linear measurements: Undersampled regime y = A(M), y R m, m n 2 where M = U 0 U0 simplicity). Rn n is rank-r, r n, and PSD (for The loss function we consider: f(u) := 1 4 A(UU M) 2 F. If E[A A] = I, then E[f(U)] = 1 4 UU M 2 F. Does f(u) inherit the benign landscape? 15/45

17 Landscape preserving under RIP Recall the definition of RIP: Definition 7.2 The rank-r restricted isometry constants δ r is the smallest quantity (1 δ r ) X 2 F A(X) 2 F (1 + δ r ) X 2 F, X : rank(x) r 16/45

18 Landscape preserving under RIP Recall the definition of RIP: Definition 7.2 The rank-r restricted isometry constants δ r is the smallest quantity (1 δ r ) X 2 F A(X) 2 F (1 + δ r ) X 2 F, X : rank(x) r Theorem 7.3 (Bhojanapalli et al. 2016, Ge et al. 2017) If A satisfies the RIP with δ 2r < 1 10, then f(u) satisfies all local min are global: for any local minimum U of f(u), it satisfies UU = M; strict saddle points: for non-local min critical point U, it satisfies λ min [ 2 f(u)] 2 5 σ r. 16/45

19 Proof of Theorem 7.3 when r = 1 Without loss of generality, assume M = u 0 u 0, and σ 1 = 1. Step 1: check all the critical points: m f(u) = A i, uu u 0 u 0 A i u = 0 }{{} i=1 M Step 2: verify the Hessian at all the critical points: m 2 f(u) = A i, uu u 0 u 0 A i + 2A i uu A i i=1 17/45

20 Proof of Theorem 7.3 when r = 1 Proof: Assume u is first-order optimal. Consider the descent direction: = u u 0 : 2 f(u) = = m [ A i, uu u 0 u 0 A i, + 2 A i, u 2] i=1 m [ A i, 2 3 A i, uu u 0 u 0 2]. i=1 where we have used the first order optimality condition. 18/45

21 Proof of Theorem 7.3 when r = 1 By the RIP property: m 2 f(u) = [ A i, (u u 0 )(u u 0 ) 2 3 A i, uu u 0 u 0 2] i=1 (1 + δ) (u u 0 )(u u 0 ) 2 F 3(1 δ) uu u 0 u 0 2 F [2(1 + δ) 3(1 δ)] uu u 0 u 0 2 F (1 5δ) uu u 0 u 0 2 F where we use (u u 0 )(u u 0 ) 2 F 2 uu u 0 u 0 2 F. 19/45

22 Landscape without RIP In matrix completion, we need to regularize the loss function by promoting incoherent solutions: set m Q(U) = ( e i U 2 α) 4 + i=1 where α is some regularization parameter, and z + = max{z, 0}. 20/45

23 Landscape without RIP In matrix completion, we need to regularize the loss function by promoting incoherent solutions: set m Q(U) = ( e i U 2 α) 4 + i=1 where α is some regularization parameter, and z + = max{z, 0}. Consider the loss function f(u) = 1 p P Ω(UU M) 2 F + λq(u) where λ is a regularization parameter. adding Q(U) doesn t affect the global optimizer if α is set properly. 20/45

24 MC doesn t have spurious local minima Theorem 7.4 (Ge et al, 2016) If p µ4 r 6 log n n, α 2 = Θ( µrσ 1 n ) and λ = Θ( n µr ), then with probability at least 1 n 1, all local min are global: for any local minimum U of f(u), it satisfies UU = M; saddle points that are not local minima are strict saddle points. saddle-point escaping algorithms can be used to guarantee convergence to local minima, which in our problem are global minima. active research area for constructing saddle-point escaping algorithms: (perturbed) gradient descent, trust-region methods, etc... 21/45

25 Spectral methods: a one-shot approach 22/45

26 Setup Consider M R n n (square case for simplicity) rank(m) = r n The thin Singular value decomposition (SVD) of M: where Σ = M = UΣV }{{ } = (2n r)r degrees of freedom σ 1... σ r r σ i u i vi T i=1 contain all singular values {σ i }; U := [u 1,, u r ], V := [v 1,, v r ] consist of singular vectors 23/45

27 Signal + noise (i, j) Ω independently with prob. p One can write observation P Ω (M) as 1 p P Ω(M) = }{{} M + 1 p P Ω(M) M signal }{{} noise [ ] Noise has mean zero: E 1 p P Ω(M) = M 24/45

28 Low-rank denoising 1 p P Ω(M) = }{{} M + 1 p P Ω(M) M low-rank signal }{{} :=E (zero-mean noise) Algorithm 7.1 Spectral method ˆM best rank-r approximation of 1 p P Ω(M) The spectral method can be solved via power methods or Lanczos methods, and we don t need to realize the matrix 1 p P Ω(M). 25/45

Histograms of singular values of P Ω (M) A 10 4 10 4 random rank-3

29 Histograms of singular values of P Ω (M) A random rank-3 matrix M with p = Fig. credit: Keshavan, Montanari, Oh 10 26/45

30 Performance of spectral methods Theorem 7.5 (Keshavan, Montanari, Oh 10) Suppose number of observed entries m obeys m n log n. Then M M F max i,j M i,j nr log 2 n M 1 F n M F m, }{{} :=ν ν reflects whether energy of M is spread out, M i,j µr/n; When m ν 2 n log 2 n, estimate ˆM is very close to truth 1 Degrees of freedom nr nearly-optimal sample complexity for incoherent matrices 1 The logarithmic factor can be improved. 27/45

31 Perturbation bounds To ensure M is good estimate, it suffices to control noise E Lemma 7.6 Suppose rank(m) = r. For any perturbation E, P r (M + E) M P r (M + E) M F 2 E 2 2r E where P r (X) is best rank-r approximation of X. 28/45

32 Prior on matrix perturbation theory Lemma 7.7 (Weyl s inequality, 1912) Let M, E be n n matrices. Then σ k (M + E) σ k (M) E, k = 1,..., n. Proof: Invoke the Courant-Fisher Minimax Characterization: σ k (A) = max min dim(s)=k 0 v S Av 2 v 2. 29/45

33 Proof of Lemma 7.6 By matrix perturbation theory, P r (M + E) M triangle inequality Weyl s inequality P r (M + E) (M + E) + (M + E) M σ r+1 (M + E) + E σ r+1 (M) + E + E = 2 E. }{{} =0 The 2nd inequality of Lemma 7.6 follows since both P r (M + E) and M are rank-r, and hence P r (M + E) M F 2r P r (M + E) M. 30/45

34 Controlling the noise Recall that entries of E = 1 p P Ω(M) M are zero-mean and independent. A bit of random matrix theory... Lemma 7.8 (Chapter 2.3, Tao 12) Suppose X R n n is a random symmetric matrix obeying {X i,j : i < j} are independent E[X i,j ] = 0 and Var[X i,j ] 1 max i,j X i,j n Then X n log n. 31/45

35 Proof of Theorem 7.5 If we look at the zero-mean matrix Ẽ = p µ/ne, then Var [ Ẽ i,j ] ( p = p(1 p) µ/n 1 ) 2 p M i,j Ẽi,j M i,j pµ/n 1 p, ( ) 2 Mi,j 1, µ/n where we have used the fact M i,j = e i UΣV e j U e i σ 1 V e j Lemma 7.8 tells us that if p log n n, then (by our assumptions) Ẽ n log n E µ pn log n µr n µ n This together with Lemma 7.6 and the fact m pn 2 establishes Theorem /45

36 Gradient methods: iterative refinements 33/45

37 Iterative methods: an overview minimize U,V f(u, V ) := P Ω (UV M) 2 F. Gradient descent: (our focus) ] U t+1 = P U [U t η t U f(u t, V t ), ] V t+1 = P V [V t η t V f(u t, V t ). where η t is the step size and P U, P V denote the Euclidean projection onto some contraint sets; Alternating minimization: One optimizes U, V alternatively while fixing the other, which is a convex problem. U t+1 = argmin U f(u, V t ), V t+1 = argmin V f(u t+1, V ). 34/45

38 Gradient descent for matrix completion minimize X R n r f(x) = (j,k) Ω ( e j XX e k M j,k ) 2 Algorithm 7.2 Gradient descent for MC Input: Y = [Y j,k ] 1 j,k n, r, p. Spectral initialization: Let U 0 Σ 0 U 0 be the rank-r eigendecomposition of M 0 := 1 p P Ω(Y ), and set X 0 = U 0 ( Σ 0) 1/2. Gradient updates: for t = 0, 1, 2,..., T 1 do ( X t+1 = X t η t f X t). 35/45

39 Gradient descent for matrix completion Define the optimal transform from the tth iterate X t to X as Q t := argmin R O r r X t R X F. Theorem 7.9 (Ma et al., 2017) Suppose M = X X is rank-r, incoherent and well-conditioned. Vanilla GD (with spectral initialization) achieves X t Q t X F ρ t µr 1 np X F, X t Q t X ρ t µr 1 np X, log n (spectral) X t Q t X 2, ρ t µr np X 2,, (incoherence) where ρ = 1 σ minη 5 < 1, if step size η 1/σ max and sample complexity n 2 p µ 3 nr 3 log 3 n. 36/45

40 Gradient descent for matrix completion Define the optimal transform from the tth iterate X t to X as Q t := argmin R O r r X t R X F. Theorem 7.9 (Ma et al., 2017) Suppose M = X X is rank-r, incoherent and well-conditioned. Vanilla GD (with spectral initialization) achieves X t Q t X F ρ t µr 1 np X F, X t Q t X ρ t µr 1 np X, (spectral) log n X t Q t X 2, ρ t µr np X 2,, (incoherence) where ρ = 1 σ minη 5 < 1, if step size η 1/σ max and sample complexity n 2 p µ 3 nr 3 log 3 n. linear convergence of X t X t M in Frobenius, spectral and infinity norms. 36/45

41 Numerical evidence for noiseless data Figure 7.1: Relative error of X t X t (measured by F,, ) vs. iteration count for matrix completion, where n = 1000, r = 10, p = 0.1, and η t = /45

42 Set SNR := M 2 F n 2 σ 2. Numerical evidence for noisy data Figure 7.2: Squared relative error of the estimate ˆX (measured by F,, 2, ) and ˆM = ˆX ˆX (measured by ) vs. SNR, where n = 500, r = 10, p = 0.1, and η t = /45

43 Restricted strong convexity and smoothness Lemma 7.10 (Restricted strong convexity and smoothness) Suppose that n 2 p Cκ 2 µrn log n for some C > 0. Then with high probability, the Hessian 2 f(x) obeys vec (V ) 2 f (X) vec (V ) σ min 2 V 2 F (restricted strong convexity) 2 f (X) 5 2 σ max (smoothness) for all X and V = Y H Y Z, H Y := arg min R O r r Y R Z F satisfying X X 2, ɛ X 2, (incoherence region), Z X δ X, where ɛ 1/ κ 3 µr log 2 n and δ 1/κ. 39/45

44 Linear convergence induction I Given the definition of Q t+1, we have X t+1 Q t+1 X F X t+1 Q t X F (i) = (ii) = [ ( X t η f X t)] Q t X F X t Q t η f ( X t Q t) X F (iii) = X t Q t η f ( X t Q t) ( X η f ( X )) F, }{{} :=α where (i) follows from the GD rule, (ii) follows from the identity f(x t R) = f ( X t) R for any R O r r, and (iii) follows from f(x ) = 0. 40/45

45 Linear convergence induction II The fundamental theorem of calculus reveals [ vec X t Q t η f ( X t Q t) (X η f ( X ))] [ = vec X t Q t X ] [ η vec f ( X t Q t) f (X )] ( ) 1 ( = I nr η 2 f (X(τ)) dτ vec X t Q t X ), (7.1) 0 } {{ } :=A where we denote X(τ) := X + τ(x t Q t X ). Taking the squared Euclidean norm of both sides of the equality (7.1) leads to α 2 = vec ( X t Q t X ) (Inr ηa) 2 vec ( X t Q t X ) X t Q t X 2 + F η2 A 2 X t Q t X 2 F 2η vec ( X t Q t X ) A vec ( X t Q t X ), (7.2) 41/45

46 Linear convergence induction III Based on the incoherence of X and X t, τ [0, 1], X (τ) X 2, X t Q t X log n 2, Cµr np X 2, } {{ } incoherence hypothesis Taking X = X (τ), Y = X t and Z = X in Lemma 7.10, one can easily verify the assumptions therein given n 2 p κ 3 µ 3 r 3 n log 3 n. Hence, and vec ( X t Q t X ) ( A vec X t Q t X ) σ min X t Q t X 2 2 F A 5 2 σ max.. 42/45

47 Linear convergence induction IV Substituting these two inequalities into (7.2) yields ( α ) 4 η2 σmax 2 σ min η X t Ĥ t X 2 F ( 1 σ min 2 η ) X t Q t X 2 F as long as 0 < η (2σ min )/(25σ 2 max), which further implies that α ( 1 σ min 4 η ) X t Q t X F. The incoherence hypothesis is important for fast convergence: the fact that X t stays incoherent throughout the execution is called implicit regularization and can be established by a leave-one-out analysis trick [Ma et al., 2017]. 43/45

48 Reference [1] Guaranteed matrix completion via non-convex factorization, R. Sun, T. Luo, IEEE Transactions on Information Theory, [2] The rotation of eigenvectors by a perturbation, C. Davis, and W. Kahan, SIAM Journal on Numerical Analysis, [3] Matrix completion from a few entries, R. Keshavan, A. Montanari, and S. Oh, IEEE Transactions on Information Theory, [4] Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees, Y. Chen and M. Wainwright, arxiv preprint arxiv: , [5] Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion and Blind Deconvolution, C. Ma, K. Wang, Y. Chi and Y. Chen, arxiv preprint arxiv: , /45

49 Reference [6] No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis, R. Ge, C. Jin, and Y. Zheng, ICML, [7] Symmetry, Saddle Points, and Global Optimization Landscape of Nonconvex Matrix Factorization, X. Li et al., arxiv preprint arxiv: , [8] Topics in random matrix theory, T. Tao, American mathematical society, [9] Harnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation, Y. Chen, and Y. Chi, arxiv preprint arxiv: , [10] How to escape saddle points efficiently, Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I, arxiv preprint arxiv: , /45

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear