ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Similar documents
Low-Rank Matrix Recovery

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Compressed Sensing and Sparse Recovery

Robust Principal Component Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Exact Low-rank Matrix Recovery via Nonconvex M p -Minimization

Solving Corrupted Quadratic Equations, Provably

Lecture Notes 9: Constrained Optimization

The convex algebraic geometry of rank minimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Conditional Gradient (Frank-Wolfe) Method

Compressed Sensing and Robust Recovery of Low Rank Matrices

Lecture: Introduction to Compressed Sensing Sparse Recovery Guarantees

Rank minimization via the γ 2 norm

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Matrix Completion: Fundamental Limits and Efficient Algorithms

Optimization methods

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies

STAT 200C: High-dimensional Statistics

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Constrained optimization

An iterative hard thresholding estimator for low rank matrix recovery

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Sensing systems limited by constraints: physical size, time, cost, energy

Introduction to Compressed Sensing

Sparse and Low Rank Recovery via Null Space Properties

Structured matrix factorizations. Example: Eigenfaces

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

arxiv: v1 [math.oc] 11 Jun 2009

Dual and primal-dual methods

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Lecture 9: September 28

1-Bit Matrix Completion

From Compressed Sensing to Matrix Completion and Beyond. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Matrix Completion with Noise

Low Rank Matrix Completion Formulation and Algorithm

Lecture 8: February 9

Recovering any low-rank matrix, provably

Optimization methods

10-725/36-725: Convex Optimization Prerequisite Topics

Compressive Sensing and Beyond

1-Bit Matrix Completion

CSC 576: Variants of Sparse Learning

Sparse and Low-Rank Matrix Decompositions

Adaptive one-bit matrix completion

Supplementary Materials for Riemannian Pursuit for Big Matrix Recovery

Optimisation Combinatoire et Convexe.

Stochastic dynamical modeling:

Lecture Notes 10: Matrix Factorization

SPARSE signal representations have gained popularity in recent

PHASE RETRIEVAL OF SPARSE SIGNALS FROM MAGNITUDE INFORMATION. A Thesis MELTEM APAYDIN

Recent Developments in Compressed Sensing

EE227 Project Report Robust Principal Component Analysis. Maximilian Balandat Walid Krichene Chi Pang Lam Ka Kit Lam

Strengthened Sobolev inequalities for a random subspace of functions

Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation

Matrix Completion from a Few Entries

Analysis of Robust PCA via Local Incoherence

1 Regression with High Dimensional Data

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May

PU Learning for Matrix Completion

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

ELE 538B: Mathematics of High-Dimensional Data. Spectral methods. Yuxin Chen Princeton University, Fall 2018

A Simple Algorithm for Nuclear Norm Regularized Problems

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Signal Recovery from Permuted Observations

Minimizing the Difference of L 1 and L 2 Norms with Applications

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Analysis of Greedy Algorithms

On the Convex Geometry of Weighted Nuclear Norm Minimization

Sparsity Regularization

Provable Alternating Minimization Methods for Non-convex Optimization

Optimization for Machine Learning

Collaborative Filtering Matrix Completion Alternating Least Squares

1-Bit Matrix Completion

Dense Error Correction for Low-Rank Matrices via Principal Component Pursuit

Conditions for Robust Principal Component Analysis

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Relaxations and Randomized Methods for Nonconvex QCQPs

Harnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation

Universal low-rank matrix recovery from Pauli measurements

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2013 PROBLEM SET 2

Binary matrix completion

Lecture Note 5: Semidefinite Programming for Stability Analysis

Key Words: low-rank matrix recovery, iteratively reweighted least squares, matrix completion.

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

Information-Theoretic Limits of Matrix Completion

6 Compressed Sensing and Sparse Recovery

Sparse Optimization Lecture: Basic Sparse Optimization Models

Three Generalizations of Compressed Sensing

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Coherent imaging without phases

Transcription:

ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/67

Outline Low-rank matrix completion and recovery Nuclear norm minimization (this lecture) RIP and low-rank matrix recovery Matrix completion Algorithms for nuclear norm minization Non-convex methods (next lecture) Spectral methods (Projected) gradient descent 2/67

Low-rank matrix completion and recovery: motivation 3/67

Motivation 1: recommendation systems??????????????? Netflix challenge: Netflix provides highly incomplete ratings from 0.5 million users for & 17,770 movies How to predict unseen user ratings for movies? 4/67

In general, we cannot infer missing ratings????????????????????????????? Underdetermined system (more unknowns than observations) 5/67

... unless rating matrix has other structure??????????????? A few factors explain most of the data 6/67

... unless rating matrix has other structure??????????????? A few factors explain most of the data low-rank approximation How to exploit (approx.) low-rank structure in prediction? 6/67

Motivation 2: sensor localization n sensors / points x j R 3, j = 1,, n Observe partial information about pairwise distances D i,j = x i x j 2 = x i 2 + x j 2 2x i x j Want to infer distance between every pair of nodes 7/67

Motivation 2: sensor localization Introduce x 1 x X = 2. Rn 3 x n then distance matrix D = [D i,j ] 1 i,j n can be written as where d 2 := [ x 1 2,, x n 2 ] D = d 2 e + ed 2 2XX }{{} low rank rank(d) n low-rank matrix completion 8/67

Motivation 3: structure from motion Structure from motion: reconstruct 3D scene geometry and }{{} structure camera poses from multiple images }{{} motion Unknown camera viewpoints Given multiple images and a few correspondences between image features, how to estimate locations of 3D points? 9/67

Motivation 3: structure from motion Tomasi and Kanade s factorization: Consider n 3D points in m different 2D frames x i,j R 2 1 : locations of j th point in i th frame x i,j = P i }{{} projection matrix R 2 3 Matrix of all 2D locations rank(x) = 3. x 1,1 x 1,n X =..... x m,1 x m,n = P 1. s j }{{} 3D position R 3 [ s 1 ] s n R 2m n P m }{{} low-rank factorization Due to occlusions, there are many missing entries in the matrix X. Goal: Can we complete the missing entries? 10/67

Motivation 4: missing phase problem Detectors record intensities of diffracted rays electric field x(t 1, t 2 ) Fourier transform ˆx(f 1, f 2 ) Fig credit: Stanford SLAC intensity of electrical field: ˆx(f1, f 2 ) 2 = x(t 1, t 2 )e i2π(f1t1+f2t2) dt 1 dt 2 2 Phase retrieval: recover signal x(t 1, t 2 ) from intensity ˆx(f 1, f 2 ) 2 11/67

A discrete-time model: solving quadratic systems A x Ax y = Ax 2 1-3 2-1 4 2-2 -1 3 4 1 9 4 1 16 4 4 1 9 16 Solve for x C n in m quadratic equations y k = a k, x 2, k = 1,..., m or y = Ax 2 where z 2 := { z 1 2,, z m 2 } 12/67

An equivalent view: low-rank factorization Lifting: introduce X = xx to linearize constraints y k = a kx 2 = a k(xx )a k = y k = a kxa k (6.1) find X 0 s.t. y k = a k a k, X, k = 1,, m rank(x) = 1 13/67

The list continues system identification and time series analysis; spatial-temporal data: low-rank due to correlations, e.g. MRI video, network traffic,.. face recognition; quantum state tomography; community detection;... 14/67

Low-rank matrix completion and recovery: setup and algorithms 15/67

Setup Consider M R n n (square case for simplicity) rank(m) = r n The thin Singular value decomposition (SVD) of M: where Σ = M = UΣV }{{ } = (2n r)r degrees of freedom σ 1... σ r r σ i u i vi T i=1 contain all singular values {σ i }; U := [u 1,, u r ], V := [v 1,, v r ] consist of singular vectors 16/67

Observed entries Low-rank matrix completion M i,j, (i, j) }{{} Ω sampling set Completion via rank minimization minimize X rank(x) s.t. X i,j = M i,j, (i, j) Ω 17/67

Observed entries Low-rank matrix completion M i,j, (i, j) }{{} Ω sampling set An operator P Ω : orthogonal projection onto subspace of matrices supported on Ω { Xi,j if (i, j) Ω [P Ω (X)] i,j = 0 otherwise Completion via rank minimization minimize X rank(x) s.t. P Ω (X) = P Ω (M) 17/67

More general: low-rank matrix recovery Linear measurements y i = A i, M = Tr(A i M), i = 1,... m An operator form, with A : R n n R m : y = A(M) := A 1, M. A m, M Recovery via rank minimization minimize X rank(x) s.t. y = A(X) 18/67

Nuclear norm minimization 19/67

Convex relaxation Low-rank matrix completion: minimize X s.t. rank(x) }{{} nonconvex P Ω (X) = P Ω (M) Low-rank matrix recovery: minimize X s.t. rank(x) }{{} nonconvex A(X) = A(M) Question: what is convex surrogate for rank( )? 20/67

Analogy with sparse recovery For a given matrix X R n n, take the full SVD of X: X = Û Σ V ] = [u 1,, u n σ 1... [ ] v 1,, v n. } {{ σ n } diag(σ) Then n rank(x) = 1{σ i 0} = σ 0 i=1 Convex relaxation: from σ 0 to σ 1? 21/67

Nuclear norm Definition 6.1 The nuclear norm of X is X := n σ i (X) }{{} i=1 i th largest singular value Nuclear norm is a counterpart of l 1 norm for rank function Equivalence among different norms (r = rank(x)) X X F X r X F r X. where X = σ 1 (X); X F = ( n i=1 σ 2 i (X)) 1/2. 22/67

Tightness of relaxation Recall: the l 1 norm ball is convex hull of 1-sparse, unit-norm vectors. (a) l 1 norm ball (b) nuclear norm ball Fact 6.2 The nuclear norm ball {X : X 1} is the convex hull of rank-1 matrices, unit-norm matrices obeying uv = 1. 23/67

Additivity of nuclear norm Fact 6.3 Let A and B be matrices of the same dimensions. If AB = 0 and A B = 0, then A + B = A + B. If row (resp. column) spaces of A and B are orthogonal, then A + B = A + B Similar to l 1 norm: when x and y have disjoint support, x + y 1 = x 1 + y 1 which is a key to study l 1 -min under RIP. 24/67

Proof of Fact 6.3 Suppose A = U A Σ A V A and B = U BΣ B VB, which gives AB = 0 A B = 0 V A V B = 0 U A U B = 0 Thus, one can write A = [U A, U B, U C ] B = [U A, U B, U C ] Σ A 0 0 0 Σ B 0 [V A, V B, V C ] [V A, V B, V C ] and hence A + B = [U A, U B ] [ ΣA Σ B ] [V A, V B ] = A + B 25/67

Dual norm Definition 6.4 (Dual norm) For a given norm A, the dual norm is defined as X A := max{ X, Y : Y A 1} l 1 norm l 2 norm Frobenius norm nuclear norm dual dual dual dual l norm l 2 norm Frobenius norm spectral norm 26/67

Schur complement Given a block matrix [ ] A B D = B R (p+q) (p+q), C where A R p p, B R p q and C R q q. The Schur complement of the block A (assume it s invertible) in D is given as C B A 1 B. Fact 6.5 D 0 A 0, C B A 1 B 0 27/67

Representing nuclear norm via SDP Since spectral norm is dual norm of nuclear norm, X = max{ X, Y : Y 1} The constraint is equivalent to Y 1 Y Y I Schur complement [ I Y Y I ] 0 Fact 6.6 X = max Y { X, Y [ ] } I Y Y 0 I 28/67

Representing nuclear norm via SDP Since spectral norm is dual norm of nuclear norm, X = max{ X, Y : Y 1} The constraint is equivalent to Y 1 Y Y I Schur complement [ I Y Y I ] 0 Fact 6.7 (Dual characterization) X = min W 1,W 2 { 1 2 Tr(W 1) + 1 2 Tr(W 2) [ ] } W1 X X 0 W 2 Optimal point: W 1 = UΣU, W 2 = V ΣV (where X = UΣV ) 28/67

Aside: dual of semidefinite program (primal) minimize X C, X s.t. A i, X = b i, 1 i m X 0 (dual) maximize y b y s.t. m y i A i + S = C Exercise: use this to verify Fact 6.7 i=1 S 0 29/67

Nuclear norm minimization via SDP Nuclear norm minization ˆM = argmin X X s.t. y = A(X) This is solvable via SDP minimize X,W1,W 2 1 2 Tr(W 1) + 1 2 Tr(W 2) s.t. y = A(X), [ ] W1 X X 0 W 2 30/67

Rank minimization vs Cardinality minimization Fig. credit: Fazel et.al. 10 31/67

Proximal algorithm In the presence of noise, one needs to solve minimize X 1 2 y A(X) 2 F + λ X which can be solved via proximal methods. Algorithm 6.1 Proximal gradient methods for t = 0, 1, : ) X t+1 = prox µtλ (X t µ t A (A(X t ) y) where µ t : step size / learning rate 32/67

Proximal operator for nuclear norm Proximal operator: prox λ (X) = arg min Z = UT λ (Σ)V { } 1 Z X 2F + λ Z 2 where SVD of X is X = UΣV with Σ = diag({σ i }), and T λ (Σ) = diag({(σ i λ) + }) 33/67

Accelerated proximal gradient Algorithm 6.2 Accelerated proximal gradient methods for t = 0, 1, : ) X t+1 = prox µtλ (Z t µ t A (A(Z t ) y) ( Z t+1 = X t+1 + α t X t+1 X t) } {{ } momentum term where µ t : step size / learning rate, α t is the momentum. ( ) Convergence rate: O 1 ɛ iterations to reach ɛ-accuracy. Per-iteration cost is a (partial-)svd. 34/67

Frank-Wolfe for nuclear norm minimization Consider the constrained problem: minimize X 1 2 y A(X) 2 F s.t. X τ. which can be solved via conditional gradient method (Frank-Wolfe 1956). More generally, consider the problem: minimize β f(β) s.t. β C, where f(x) is smooth and convex, and C is a convex set. Recall projected gradient descent: ) β t+1 = P C (β t µ t f(β t ) 35/67

Conditional Gradient Method Algorithm 6.3 Frank-Wolfe for t = 0, 1, : s t = argmin s C f(β t ) s β t+1 = (1 γ t )β t + γ t s t where γ t := 2/(t + 1) is the (default) step size. The first step is a constrained optimization of a linear approximation at f(β t ); The second step controls how much we move towards s t. 36/67

Figure illustration Figure credit: Jaggi 2011 37/67

Frank-Wolfe for nuclear norm minimization Algorithm 6.4 Frank-Wolfe for nuclear norm minimization for t = 0, 1, : S t = argmin S τ f(x t ), S, X t+1 = (1 γ t )X t + γ t S t ; where γ t := 2/(t + 1) is the (default) step size. (Homework) Note that f(x t ) = A (A(X t ) y)), and S t = τ argmin S 1 f(x t ), S = τuv T, where u, v are the left and right top singular vector of f(x t ). 38/67

Further comments on Frank-Wolfe Extremely low per-iteration cost (only top singular vectors are needed); Every iteration is a rank-1 update; Convergence rate: O( 1 ɛ ) to reach ɛ-accuracy, which can be very slow. Various ways to speed up; active research area. 39/67

RIP and low-rank matrix recovery 40/67

RIP for low-rank matrices Almost parallel results to compressed sensing... 1 Definition 6.8 The r-restricted isometry constants δr ub (A) and δr lb (A) are smallest quantities s.t. (1 δ lb r ) X F A(X) F (1 + δ ub r ) X F, X : rank(x) r 1 One can also define RIP w.r.t. 2 F rather than F. 41/67

RIP and low-rank matrix recovery Theorem 6.9 (Recht, Fazel, Parrilo 10, Candes, Plan 11) Suppose rank(m) = r. For any fixed integer K > 0, if 1+δ ub Kr 1 δ lb (2+K)r < K 2, then nuclear norm minimization is exact Can be easily extended to account for noise and imperfect structural assumption 42/67

Geometry of nuclear norm ball Level set of nuclear norm ball: [ x y y z ] 1 Fig. credit: Candes 14 43/67

Some notation Recall M = UΣV Let T be span of matrices of the form (called tangent space) T = {UX + Y V : X, Y R n r } Let P T be orthogonal projection onto T : P T (X) = UU X + XV V UU XV V Its complement P T = I P T : P T (X) = (I UU )X(I V V ) MP T (X) = 0 and M P T (X) = 0 44/67

Proof of Theorem 6.9 Suppose X = M + H is feasible and obeys M + H M. The goal is to show that H = 0 under RIP. The key is to decompose H into H 0 + H 1 + H 2 +... }{{} H c H 0 = P T (H) (rank 2r) H c = P T (H) (obeying MH c = 0 and M H c = 0) H 1 : best rank-(kr) approximation of H c H 2 : best rank-(kr) approximation of H c H 1... (K is const) 45/67

Proof of Theorem 6.9 Informally, the proof proceeds by showing that 1. H 0 dominates i 2 H i (by objective function) 2. (converse) i 2 H i dominates H 0 + H 1 (by RIP + feasibility) These can happen simultaneously only when H = 0 46/67

Proof of Theorem 6.9 Step 1 (which does not rely on RIP). Show that H j F H 0 / Kr. (6.2) j 2 This follows immediately by combining the following 2 observations: (i) Since M + H is assumed to be a better estimate: M M + H M + H c H 0 (6.3) = M + H c }{{} H 0 Fact 6.3 (MHc =0 and M H c=0) = H c H 0 (6.4) (ii) Since nonzero singular values of H j 1 dominate those of H j (j 2): H j F Kr H j Kr [ H j 1 /(Kr) ] H j 1 / Kr = H j F 1 H j 1 = 1 H c (6.5) j 2 Kr j 2 Kr

Proof of Theorem 6.9 Step 2 (using feasibility + RIP). Show that ρ < K/2 s.t. If this claim holds, then H 0 + H 1 F ρ j 2 H j F (6.6) H 0 + H 1 F ρ H (6.2) j F ρ 1 H 0 j 2 Kr ρ 1 ( ) 2 2r H0 F = ρ Kr K H 0 F 2 ρ K H 0 + H 1 F. (6.7) This cannot hold with ρ < K/2 unless H 0 + H 1 = 0 }{{} equivalently, H 0=H 1=0 48/67

Proof of Theorem 6.9 We now prove (6.6). To connect H 0 + H 1 with j 2 H j, we use feasibility: A(H) = 0 A(H 0 + H 1 ) = j 2 A(H j), which taken collectively with RIP yields (1 δ(2+k)r lb ) H 0 + H 1 F A(H0 + H 1 ) F = A(H j) j 2 A(H j ) j 2 F (1 + δub Kr) H j F j 2 F This establishes (6.6) as long as ρ := 1+δub Kr 1 δ lb (2+K)r < K 2. 49/67

Gaussian sampling operators satisfy RIP If entries of {A i } m i=1 are i.i.d. N (0, 1/m), then δ 5r (A) < 3 2 3 + 2 with high prob., provided that m nr (near-optimal sample size) This satisfies assumption of Theorem 6.9 with K = 3 50/67

Precise phase transition Using statistical dimension machienry, we can locate precise phase transition (Amelunxen, Lotz, McCoy & Tropp 13) nuclear norm min { works if m > stat-dim ( D (, X) ) fails if m < stat-dim ( D (, X) ) where stat-dim ( D (, X) ) ( r n 2 ψ n) and ψ (ρ) = inf τ 0 { ρ + (1 ρ) [ ρ(1 + τ 2 ) + (1 ρ) 2 τ ]} (u τ) 2 4 u 2 du π 51/67

Numerical phase transition (n = 30) Figure credit: Amelunxen, Lotz, McCoy, & Tropp 13 52/67

Sampling operators that do NOT satisfy RIP Unfortunately, many sampling operators fail to satisfy RIP (e.g. none of the 4 motivating examples in this lecture satisfies RIP) 53/67

Matrix completion 54/67

Sampling operators for matrix completion Observation operator (projection onto matrices supported on Ω) Y = P Ω (M) where (i, j) Ω with prob. p (random sampling) P Ω does NOT satisfy RIP when p 1! For example, 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } {{ } M???????????? } {{ } Ω P Ω (M) F = 0, or equivalently, 1+δ ub K 1 δ lb 2+K = 55/67

Which sampling pattern? Consider the following sampling pattern????? If some rows/columns are not sampled, recovery is impossible. 56/67

Which low-rank matrices can we recover? Compare following rank-1 matrices: 1 0 0 0 0 0 0 0... 0 } 0 0 {{ 0 } hard vs. 1 1 1 1 1 1 1 1... 1 } 1 1 {{ 1 } easy Column / row spaces cannot be aligned with canonical basis vectors 57/67

Coherence Definition 6.10 Coherence parameter µ of M = UΣV is smallest quantity s.t. max i U e i 2 µr n and max i V e i 2 µr n µ 1 (since n i=1 U e i 2 = U 2 F = r) µ = 1 if 1 n 1 = U = V (most incoherent) µ = n r if e i U (most coherent) 58/67

Performance guarantee Theorem 6.11 (Candes & Recht 09, Candes & Tao 10, Gross 11,...) Nuclear norm minimization is exact and unique with high probability, provided that m µnr log 2 n This result is optimal up to a logarithmic factor Established via a RIPless theory 59/67

Numerical performance of nuclear-norm minimization n = 50 Fig. credit: Candes, Recht 09 60/67

Subgradient of nuclear norm Subdifferential (set of subgradients) of at M is M = { } UV + W : P T (W ) = 0, W 1 Does not depend on singular values of M Z M iff P T (Z) = UV, P T (Z) 1. 61/67

KKT condition Lagrangian: L (X, Λ) = X + Λ, P Ω (X) P Ω (M) = X + P Ω (Λ), X M When M is minimizer, KKT condition reads 0 X L(X, Λ) X=M Λ s.t. P Ω (Λ) M W s.t. UV + W is supported on Ω, P T (W ) = 0, and W 1 62/67

Optimality condition via dual certificate Slightly stronger condition than KKT guarantees uniqueness: Lemma 6.12 M is unique minimizer of nuclear norm minimization if sampling operator P Ω restricted to T is injective, i.e. P Ω (H) 0 nonzero H T W s.t. UV + W is supported on Ω, P T (W ) = 0, and W < 1 63/67

Proof of Lemma 6.12 For any W 0 obeying W 0 1 and P T (W ) = 0, one has M + H M + UV + W 0, H = M + UV + W, H + W 0 W, H = M + P Ω ( UV + W ), H + P T (W 0 W ), H = M + UV + W, P Ω (H) + W 0 W, P T (H) if we take W 0 s.t. W 0, P T (H) = P T (H) M + P T (H) W P T (H) = M + (1 W ) P T (H) > M unless P T (H) = 0. But if P T (H) = 0, then H = 0 by injectivity. Thus, M + H > M for any H 0, concluding the proof. 64/67

Constructing dual certificates Use golfing scheme to produce approximate dual certificate (Gross 11) Think of it as an iterative algorithm (with sample splitting) to solve KKT 65/67

Reference [1] Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, B. Recht, M. Fazel, P. Parrilo, SIAM Review, 2010. [2] Exact matrix completion via convex optimization, E. Candes, and B. Recht, Foundations of Computational Mathematics, 2009 [3] Matrix rank minimization with applications, M. Fazel, Ph.D. Thesis, 2002. [4] The power of convex relaxation: Near-optimal matrix completion, E. Candes, and T. Tao, 2010. [5] Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements, E. Candes, and Y. Plan, IEEE Transactions on Information Theory, 2011. [6] Shape and motion from image streams under orthography: a factorization method, C. Tomasi and T. Kanade, International Journal of Computer Vision, 1992. 66/67

Reference [7] An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems, K. C. Toh and S. Yun, S, Pacific Journal of optimization, 2010. [8] Topics in random matrix theory, T. Tao, American mathematical society, 2012. [9] A singular value thresholding algorithm for matrix completion, J. Cai, E. Candes, Z. Shen, SIAM Journal on Optimization, 2010. [10] Recovering low-rank matrices from few coefficients in any basis, D. Gross, IEEE Transactions on Information Theory, 2011. [11] Incoherence-optimal matrix completion, Y. Chen, IEEE Transactions on Information Theory, 2015. [12] Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization, M. Jaggi, ICML, 2013. 67/67