Lecture 20 November 7, 2013

Similar documents
Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Lecture 21. Interior Point Methods Setup and Algorithm

Exact tensor completion with sum-of-squares

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Lower Bounds for Quantized Matrix Completion

Feature Extraction Techniques

A Simple Regression Problem

Lecture 9 November 23, 2015

The Hilbert Schmidt version of the commutator theorem for zero trace matrices

Block designs and statistics

CS Lecture 13. More Maximum Likelihood

Sharp Time Data Tradeoffs for Linear Inverse Problems

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A note on the multiplication of sparse matrices

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

Lecture 21 Nov 18, 2015

1 Generalization bounds based on Rademacher complexity

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

A Probabilistic and RIPless Theory of Compressed Sensing

Computational and Statistical Learning Theory

Hybrid System Identification: An SDP Approach

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

arxiv: v1 [cs.ds] 29 Jan 2012

Least Squares Fitting of Data

Bipartite subgraphs and the smallest eigenvalue

Computational and Statistical Learning Theory

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010

Randomized Recovery for Boolean Compressed Sensing

3.3 Variational Characterization of Singular Values

Introduction to Machine Learning. Recitation 11

1 Identical Parallel Machines

Compressive Distilled Sensing: Sparse Recovery Using Adaptivity in Compressive Measurements

Physics 215 Winter The Density Matrix

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

1 Bounding the Margin

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Chaotic Coupled Map Lattices

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

arxiv: v1 [cs.ds] 17 Mar 2016

Convex Programming for Scheduling Unrelated Parallel Machines

Support recovery in compressed sensing: An estimation theoretic approach

Polygonal Designs: Existence and Construction

Finite fields. and we ve used it in various examples and homework problems. In these notes I will introduce more finite fields

COS 424: Interacting with Data. Written Exercises

G G G G G. Spec k G. G Spec k G G. G G m G. G Spec k. Spec k

List Scheduling and LPT Oliver Braun (09/05/2017)

1 Rademacher Complexity Bounds

The Weierstrass Approximation Theorem

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Chapter 6 1-D Continuous Groups

1 Proof of learning bounds

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

Testing Properties of Collections of Distributions

Support Vector Machines MIT Course Notes Cynthia Rudin

Ch 12: Variations on Backpropagation

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

A Bernstein-Markov Theorem for Normed Spaces

arxiv: v5 [cs.it] 16 Mar 2012

PHY307F/407F - Computational Physics Background Material for Expt. 3 - Heat Equation David Harrison

Learnability and Stability in the General Learning Setting

Boosting with log-loss

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

A Theoretical Analysis of a Warm Start Technique

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

Highly Robust Error Correction by Convex Programming

On the Use of A Priori Information for Sparse Signal Approximations

Lean Walsh Transform

Multi-Scale/Multi-Resolution: Wavelet Transform

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Understanding Machine Learning Solution Manual

Kernel Methods and Support Vector Machines

Lecture 13 Eigenvalue Problems

Supplement to: Subsampling Methods for Persistent Homology

Weighted- 1 minimization with multiple weighting sets

Introduction to Optimization Techniques. Nonlinear Programming

arxiv: v1 [math.na] 10 Oct 2016

Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

Topic 5a Introduction to Curve Fitting & Linear Regression

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Detection and Estimation Theory

Combining Classifiers

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

arxiv: v1 [cs.ds] 3 Feb 2014

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Birthday Paradox Calculations and Approximation

Pattern Recognition and Machine Learning. Artificial Neural networks

Asynchronous Gossip Algorithms for Stochastic Optimization

i ij j ( ) sin cos x y z x x x interchangeably.)

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Principal Components Analysis

Tail estimates for norms of sums of log-concave random vectors

Transcription:

CS 229r: Algoriths for Big Data Fall 2013 Prof. Jelani Nelson Lecture 20 Noveber 7, 2013 Scribe: Yun Willia Yu 1 Introduction Today we re going to go through the analysis of atrix copletion. First though, let s go through the history of prior work on this proble. Recall the setup and odel: Matrix copletion setup: Model: Want to recover M R n 1 n 2, under the assuption that rank(m) = r, where r is sall. Only soe sall subset of the entries (M ij ) ij Ω are revealed, where Ω [n 1 ] [n 2 ], Ω = n 1, n 2 ties we saple i, j uniforly at rando + insert into Ω (so Ω is a ultiset). Note that the sae results hold if we choose entries without replaceent, but it s easier to analyze this way. In fact, if you can show that if recovery works with replaceent, then that iplies that recovery works without replaceent, which akes sense because you d only be seeing ore inforation about M. We recover M by Nuclear Nor Miniization (NNM): Solve the progra in X s.t. i, j Ω, X ij = M ij [Recht, Fazel, Parrilo 10] [RFP10] was first to give soe rigorous guarantees for NNM. As you ll see on the pset, you can actually solve this in polynoial tie since it s an instance of what s known as a sei-definite progra. [Candés, Recht, 09] [CR09] was the first paper to show provable guarantees for NNM applied to atrix copletion. There were soe quantitative iproveents (in the paraeters) in two papers: [Candés, Tao 09] [CT10] and [Keshavan, Montanari, Oh 09] [KMO10] Today we re going to cover an even newer analysis given in [Recht, 2011] [Rec11], which has a couple of advantages. First, it has the laxest of all the conditions. Second, it s also the siplest of all the analyses in the papers. Thus, it s really better in every way there is. The approach of [Rec11] was inspired by work in quantu toography [GLF+10]. A ore general theore than the one proven in class today was later proven by Gross [Gross]. 1

2 Theore Stateent We re alost ready to forally state the ain theore, but need a couple of definitions first. Definition 1. Let M = UΣV be the singular value decoposition. (Note that U R n 1 r and V R n 2 r.) Definition 2. Define the incoherence of the subspace U as µ(u) = n 1 r ax i P U e i 2, where P U is projection onto U. Siilarly, the incoherence of V is µ(v ) = n 2 r ax i P V e i 2, where P V is projection onto V. Definition 3. µ 0 def = ax{µ(u), µ(v )}. Definition 4. µ 1 def = UV n1 n 2 /r, where UV is the largest agnitude of an entry of UV. Theore 1. If ax{µ 2 1, µ 0} n 2 r log 2 (n 2 ) then with high probability M is the unique solution to the sei-definite progra in X s.t. i, j Ω, X ij = M ij. Note that 1 µ 0 n 2 r. The way µ 0 can be n 2 r is if a standard basis vector appears in a colun of V, and the way µ 0 can get all the way down to 1 is like the best case scenario where all the entries 1 of V are like 1 n2 and all the entries of U are like n1, so for exaple if you took a Fourier atrix and cut off soe of its coluns. Thus, the condition on is a good bound if the atrix has low incoherence. One ight wonder about the necessity of all the funny ters in the condition on. Unfortunately, [Candes, Tao, 09] [CT10] showed µ 0 n 2 r log(n 2 ) is needed. If you want to have any decent chance of recovering M over the rando choice of Ω using this SDP, then you need to saple at least that any entries. The condition isn t copletely tight because of the square in the log factor and the dependence on µ 2 1. However, you can show that µ2 1 µ2 0 r. Just like in copressed sensing, there are also soe iterative algoriths to recover M, but we re not going to analyze the in class. For exaple, the SparSA algorith given in [Wright, Nowak, Figueiredo 09] [WNF09] (thanks for Ben Recht for pointing this out to e). That algorith roughly looks as follows when one wants to iniize AX M 2 F + µ X : Pick X 0, and a stepsize t and iterate (a)-(d) soe nuber of ties: (a) Z = X k t A T (AX k M) (b) [U, diag(s), V ] = svd(z) (c) r = ax(s µt, 0) (d) X k+1 = Udiag(r)V T As an aside, trace-nor iniization is actually tolerant to noise, but I not going to cover that. 3 Analysis The way that the analysis is going to go is we re going to condition on lots of good events all happening, and if those good events happen, then the iniization works. The way I going to structure the proof is I ll first state what all those events are, then I ll show why those events ake the iniization work, and finally I ll bound the probability of those events not happening. 2

3.1 Background and ore notation Before I do that, I want to say one things about the trace nor. How any people are failiar with dual nors? How any people have heard of the Hahn-Banach theore? OK, good. Definition 5. A, B def = Tr(A B) = i,j A ijb ij Clai 1. The dual of the trace nor is the operator nor: = sup A, B B s.t. B 1 This akes sense because the dual of l 1 for vectors is l and this sort of looks like that because the trace nor and operator nor are respectively like the l 1 and l nor of the singular value vector. More rigorously, we can prove it by proving inequality in both directions. One direction is not so hard, but the other requires the following lea. Lea 1. }{{} (1) = in X,Y s.t. A=XY X F Y F }{{} (2) 1 ( ) X 2 F 2 F A=XY } {{ } (3) = in X,Y s.t. Proof of lea. (2) (3): AM-GM inequality: xy 1 2 (x2 + y 2 ). (3) (1): We basically just need to exhibit an X and Y which are give soething that is at ost the. Set X = Y = A 1/2. In general, given f : R + R +, then f(a) = Uf(Σ)V. i.e. write the SVD of A and apply f to each diagonal entry of Σ. You can easily check that A 1/2 A 1/2 = A and that the square of the Frobenius nor of A 1/2 is exactly the trace nor. (1) (2): Let X, Y be soe atrices such that A = XY. Then = XY sup {a i } orthonoral basis {b i } orthonoral basis = sup sup Y a i, X b i i Y a i X b i i sup( i Y a i 2 ) 1/2 ( i XY a i, b i i X b i 2 ) 1/2 This can be seen to be true by letting a i =v i and b i =u i (fro the SVD), when we get equality. (by Cauchy-Schwarz) = X F Y F because {a i },{b i } are orthonoral bases and the Frobenius nor is rotationally invariant 3

Proof of clai. Part 1: sup A, B. B =1 We show this by writing A = UΣV. Then take B = i u ivi. That will give you soething on the right that is at least the trace nor. As an aside, in general, this is how dual nors are defined. Given a nor X the dual nor is defined by Z X = sup Y X 1 Z, Y. In this case, we re proving the dual of the operator nor is the trace nor. Or, for exaple, the dual nor of the Schatten p-nor is the Schatten q-nor where 1 p + 1 q = 1. As an aside, if X is a nored space with nor then X is the set of all linear functionals λ x : X R for x X with dual nor λ x = sup y X x, y. One can then ap x X to (X ) by the evaluation ap f : X (X ) : for λ X, f(x)(λ) = λ(x). Then f is injective and the nors of x and f(x) are equal by the Hahn Banach theore, though f need not be surjective (in the case where it is, X is called a reflexive Banach space). You can learn ore on wiki if you want, or take a functional analysis class. Part 2: We show this using the lea. A, B B s.t. B = 1. Write A = XY s.t. = X F Y F (lea guarantees that there exists such an X and Y ). Write B = i σ ia i b i, i, σ i 1. Then using a siilar arguent to last tie A, B = XY, i σ i a i b i = i i σ i Y a i, X b i Y a i, X b i = X F Y = which concludes the proof of the clai. Recall that the set of atrices that are n 1 n 2 is itself a vector space. I going to decopose that vector space into T and the orthogonal copleent of T by defining the following projection operators. P T (Z) def = (I P U )Z(I P V ) P T (Z) def = Z P T (Z) 4

So basically, the atrices that are in the vector space T are the atrices that can be written as the su of rank 1 atrices a i b i where the a i s are orthogonal to all the u s and the b i s are orthogonal to all the v s. Also define R Ω (Z) as only keeping entries in Ω, ultiplied by ultiplicity in Ω. If you think of the operator R Ω : R n 1n 2 R n 1n 2 as a atrix, it is a diagonal atrix with the ultiplicity of entries in Ω on the diagonal. 3.2 Good events With high probability probability 1 1 poly(n 2 ), and you can ake the 1 poly(n 2 ) factor decay as uch as you want by increasing the constant in fro of all these events happen: 1. P T R Ω P T P T µ 0 r(n 1 +n 2 ) log(n 2 ) 1 2 (this is a deviation inequality fro the expectation over the randoness coing fro Ω) 2. ( n 1n 2 R Ω I)Z n 1 n 2 2 log(n 1+n 2 ) Z (this is another deviation inequality fro the expectation) 3. If Z T then P T R Ω (Z) Z µ 0 rn 2 log(n 2 ) Z 4. R Ω log(n 2 ) This one is actually really easy (also the shortest): it s just balls and bins. We ve already said it s a diagonal atrix, so the operator nor is just the largest diagonal entry. Iagine we have balls, and we re throwing the independently at rando into bins, naely the diagonal entries, and this is just how loaded is the axiu bin. In particular, <, or else we wouldn t be doing atrix copletion since we d have the whole atrix. In general, when you throw t balls into t bins, the axiu load by the Chernoff bound is at ost log t. In fact, it s at ost log t/ log log t, but who cares, since that would save us an extra log log soewhere. Actually, I not even sure it would save us that since there are other log s that coe into play. 5. Y in range(r Ω ) s.t. (5a) P T (Y ) UV F r 2n 2 (5b) P T (Y ) < 1 2 3.3 Recovery conditioned on good events Now that we ve stated all these things, let s show that they iply trace nor iniization actually works. We want to ake sure arg in X X s.t. R Ω (X)=R Ω (M) is unique and equal to M. 5

Let Z Ker(R Ω ), (Z 0); we want to show M + Z > M. First we want to argue that P T (Z) F cannot be big. Lea 2. P T (Z) F n 2 2r P T (Z) F Proof. Also Also have 0 = R Ω (Z) F R Ω (P T (Z)) F R Ω (P T (Z)) F R Ω (P T (Z)) 2 F = R ΩP T Z, R Ω P T Z P T Z, R Ω P T Z = Z, P T R Ω P T Z = P T Z, P T R Ω P T P T Z = P T Z, P T P T Z P T Z 2 F n 1 n 2 P T R Ω P T P T Z 2 F P T Z, (P T R Ω P T P T Z 2 F )P T Z R Ω (P T (Z)) 2 F R Ω 2 P T (Z) 2 F log 2 (n 2 ) P T (Z) 2 F Suarize: cobining all the inequalities together, and then aking use of our choice of, log 2 (n 2 ) P T (Z) F P T (Z) F n2 2r P T (Z) F Pick U, V s.t. U V, P T (Z) = P T (Z) and s.t. [U, U ], [V, V ] orthogonal atrices. We know fro clai 1 that the trace nor is exactly the sup over all B atrices of the inner product. But the B atrix that achieves the sup has all singular values equal to 1, so B = U V, because P T (Z) is in the orthogonal space so B should also be in the orthogonal space. Now we have a long chain of inequalities to show that the trace of any M + Z is greater than the trace of M: 6

M + Z UV + U V, M + Z by clai 1 = M + UV + U V, Z since M U V = M + UV + U V Y, Z since Z ker(r Ω ) and Y range(r Ω ) = M + UV P T (Y ), P T (Z) + U V P T (Y ), P T (Z) decoposition into T & T M UV P T (Y ) F P T (Z) F x,y x 2 y 2 + P T (Z) by our choice of UV P T (Y ) P T (Z) F nor inequality But note that the operator nor is always bigger than the Frobenius nor, so P T (Z) P T (Z) F. We want to ensure that that ter is strictly bigger than the two negative ters. By condition (5b), we ensure that P T (Y ) P T (Z) F < 1 2 P T (Z) F. By condition (5a) and lea 2, we can also ensure that UV P T (Y ) F P T (Z) F < 1 2 P T (Z) F. Thus, back to the ain equation: r M + Z > M P T (Z) 2n F + 1 2 2 P T (Z) F M Hence, when all of the good conditions hold, iniizing the trace nor recovers M. 3.4 Probability of good events holding Unfortunately, we do not have enough tie to go through the full analysis. We ight overflow soe of this into next lecture, but for now, let s introduce the noncoutative Bernstein inequality we use to get conditions (1) and (2). As an aside, I tend to call all of these inequalities Chernoff inequalities, since they re all quite siilar, but this one really should have a different nae because it the proof for this atrix Bernstein is very different fro the proof of ordinary Chernoff. Theore 2 (Non-coutative Bernstein Chernoff inequality). Suppose X 1,..., X N are rando atrices of the sae diensions and E X i = 0 s.t. 1. X i M, i w.p. 1 2. σ 2 i = ax{ E X ix i, E X i X i } Then ( ) N P X i > λ i=1 { (n 1 + n 2 ) ax exp ( Cλ 2 ɛσ 2 i ) ( )} Cλ, exp M As entioned, conditions (2) and (3) were deviation inequalities fro expectation, so we can get the using Bernstein on the rando atrices over distribution of Ω (subtracting out the expectation to set expectation 0 where appropriate). As an additional aside, conditions (4), (5), and (1) were used in the proofs above. However, we only need conditions (2) and (3) to show (5). Next tie if we have tie, we ight say soething about proving (5). 7

4 Concluding rearks Why would you think of trace iniization as solving atrix copletion? Analogously, why would you use l 1 iniization for copressed sensing? In soe way, these two questions are very siilar in that rank is like the support size of the singular value vector, and trace nor is the l 1 nor of the singular value vector, so the two are very analogous. l 1 iniization sees like a natural choice, since it is the closest convex function to support size fro all the l p nors (and being convex allows us to solve the progra in polynoial tie). References [CR09] [CT10] [Gross] Eanuel J Candès and Benjain Recht, Exact atrix copletion via convex optiization, Foundations of Coputational atheatics 9 (2009), no. 6, 717 772. Eanuel J Candès and Terence Tao, The power of convex relaxation: Near-optial atrix copletion, Inforation Theory, IEEE Transactions on 56 (2010), no. 5, 2053 2080. David Gross, Recovering low-rank atrices fro few coefficients in any basis, Inforation Theory, IEEE Transactions on (2011), no. 57, :1548-1566. [GLF+10] David Gross, Yi-Kai Liu, Steven T. Flaia, Stephen Becker, and Jens Eisert. Quantu state toography via copressed sensing, Physical Review Letters (2010), 105(15):150401. [KMO10] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh, Matrix copletion fro noisy entries, The Journal of Machine Learning Research 99 (2010), 2057 2078. [Rec11] [RFP10] Benjain Recht, A sipler approach to atrix copletion, The Journal of Machine Learning Research 12 (2011), 3413 3430. Benjain Recht, Marya Fazel, and Pablo A Parrilo, Guaranteed iniu-rank solutions of linear atrix equations via nuclear nor iniization, SIAM review 52 (2010), no. 3, 471 501. [WNF09] Stephen J Wright, Robert D Nowak, and Mário AT Figueiredo, Sparse reconstruction by separable approxiation, Signal Processing, IEEE Transactions on 57 (2009), no. 7, 2479 2493. 8