Lower Bounds for Quantized Matrix Completion

Size: px

Start display at page:

Download "Lower Bounds for Quantized Matrix Completion"

Dominic Nelson
5 years ago
Views:

1 Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, Mark A. Davenport School of Elec. & Cop. Engineering Georgia Institute of Technology Atlanta, GA Eail: Ewout van den Berg Departent of Statistics Stanford University Stanford, CA Eail: Abstract In this paper we consider the proble of -bit atrix copletion, where instead of observing a subset of the real-valued entries of a atrix M, we obtain a sall nuber of binary (-bit) easureents generated according to a probability distribution deterined by the real-valued entries of M. The central question we ask is whether or not it is possible to obtain an accurate estiate of M fro this data. In general this would see ipossible, however, it has recently been shown in [] that under certain assuptions it is possible to recover M by optiizing a siple convex progra. In this paper we provide lower bounds showing that these estiates are near-optial. I. INTRODUCTION The proble of recovering a atrix fro an incoplete sapling of its entries also known as atrix copletion arises in a wide variety of practical situations. In any of these settings, however, the observations are not only incoplete, but also highly quantized, often even to a single bit. In this paper we consider a statistical odel for such data where instead of observing a real-valued entry as in the original atrix copletion proble, we are now only able to see a positive or negative rating. This binary output is generated according to a probability distribution which is paraeterized by the corresponding entry of the unknown low-rank atrix M. A natural question in this context is: Given observations of this for, can we recover the underlying atrix? The first theoretical accuracy guarantees in this context were recently established by the current authors in []. Inspired by recent theoretical results on unquantized atrix copletion, [] establishes that O(rd) binary observations are sufficient to accurately recover a d d, rank-r atrix by convex prograing. In this paper we review the upper bounds of [], and provide lower bounds showing that these upper bounds are nearly optial. First, however, we provide a brief review of the existing results on atrix copletion. A. Matrix copletion Matrix copletion arises in a wide variety of practical contexts, including collaborative filtering [2], syste identification [3], sensor localization [4 6], rank aggregation [7], and any ore. There is now a strong base of theoretical results concerning atrix copletion [8 ]; a typical result is that a generic d d atrix of rank r can be exactly recovered fro O(r d polylog(d)) randoly chosen entries. Although these results are quite ipressive, there is an iportant gap between the stateent of the proble as considered in the atrix copletion literature and any of the ost coon applications discussed therein. As an exaple, consider collaborative filtering and the now-faous Netflix proble. In this setting, we assue that there is soe unknown atrix whose entries each represent a rating for a particular user on a particular ovie. Since any user will rate only a sall subset of possible ovies, we are only able to observe a sall fraction of the total entries in the atrix, and our goal is to infer the unseen ratings fro the observed ones. If the rating atrix is low-rank, then this would see to be the exact proble studied in the atrix copletion literature. However, there is a subtle difference: the theory developed in this literature generally assues that observations consist of (possibly noisy) continuous-valued entries of the atrix, whereas in the Netflix proble the observations are quantized to the set of integers between and 5. If we believe that it is possible for a user s true rating for a particular ovie to be, for exaple, 4.5, then we ust account for the ipact of this quantization noise on our recovery. One could potentially treat quantization siply as a for of bounded noise, but this is unsatisfying because the ratings aren t just quantized there are also hard liits placed on the iniu and axiu allowable ratings. (Why should we suppose that a ovie given a rating of 5 could not have a true underlying rating of 6 or 7 or 0?) The inadequacy of standard atrix copletion techniques in dealing with this effect is particularly pronounced when we consider recoender systes where each rating consists of a single bit representing a positive or negative rating (consider for exaple rating usic on Pandora, the relevance of advertiseents on Hulu, or posts on Reddit or MathOverflow). In such a case, the assuptions ade in the existing theory of atrix copletion do not apply, standard algoriths are ill-posed, and a new theory is required. B. Outline and notation The reainder of the paper is organized as follows. In Section II we provide an overview of the -bit atrix copletion proble, introducing the observation odel, describing the assuptions necessary for -bit atrix copletion to be wellposed, and suarizing the upper bounds established in []. In Section III we describe a series of lower bounds that show that the upper bounds presented in Section II are nearly optial.

2 Finally, Section IV concludes with a discussion of future work. We now provide a brief suary of soe of the key notation used in this paper. We use [d] to denote the set of integers,..., d}. We use capital boldface to denote a atrix (e.g., M) and standard text to denote its entries (e.g., M i,j ). We let M denote the operator nor of M, M F = i,j M i,j 2 denote the Frobenius nor of M, M denote the nuclear or Schatten- nor of M (the su of the singular values), and M = ax i,j M i,j denote the entry-wise infinity-nor of M. For two scalars p, q [0, ], the Hellinger distance, d 2 H(p, q) := ( p q) 2 + ( p q) 2 is a standard notion of distance between two binary probability distributions. We allow the Hellinger distance to act on atrices in a natural way: for atrices P, Q [0, ] d d2, we define d 2 H(P, Q) := d 2 d d H(P i,j, Q i,j ). 2 II. THE -BIT MATRIX COMPLETION PROBLEM A. Observation odel We now introduce the -bit observation odel considered in []. Given a atrix M R d d2, a subset of indices Ω [d ] [d 2 ], and a differentiable function f : R [0, ], we observe Y i,j = + w. p. f(m i,j ), w. p. f(m i,j ) for (i, j) Ω. () When the function f behaves like a cuulative distribution function, the odel () is equivalent to the following natural latent variable odel. Suppose that Z is a rando atrix with i.i.d. entries, and choose f(x) := P ( Z, x ). Then we can rewrite () as + if M i,j + Z i,j 0 Y i,j = for (i, j) Ω. (2) if M i,j + Z i,j < 0 As exaples, we consider two natural choices for f (or equivalently, for Z). Exaple (Logistic regression/logistic noise). The logistic regression odel, which is coon in statistics, is captured by () with f(x) = ex +e and by (2) with Z x i,j i.i.d. according to the standard logistic distribution. Exaple 2 (Probit regression/gaussian noise). The probit regression odel is captured by () by setting f(x) = Φ( x/σ) = Φ(x/σ) where Φ is the cuulative distribution function of a standard Gaussian and by (2) with Z i,j i.i.d. according to a ean-zero Gaussian distribution with variance σ 2. As has been iportant in previous work on atrix copletion, in order to show that -bit atrix copletion is feasible we will assue that Ω is chosen at rando with E Ω =. Specifically, Ω follows a binoial odel in which each entry i,j d d 2, (i, j) [d ] [d 2 ] is included in Ω with probability independently. However, our lower bounds will not require this assuption, and hold for arbitrary sets Ω. B. Assuptions The ajority of the literature on atrix copletion assues that the first r singular values of M are nonzero and the reainder are exactly zero. However, in any applications the singular values instead exhibit only a gradual decay towards zero. Thus, in this paper we allow a relaxation of the assuption that M has rank exactly r. Instead, we assue that M α rd d 2, where α is a paraeter left to be deterined, but which will often be of constant order. In other words, the singular values of M belong to a scaled l ball. In copressed sensing, belonging to an l p ball with p (0, ] is a coon relaxation of exact sparsity; in atrix copletion, the nuclear-nor ball (or Schatten- ball) plays an analogous role. The particular choice of scaling, α rd d 2, arises fro the following considerations. Suppose that each entry of M is bounded in agnitude by α and that rank(m) r. Then M r M F rd d 2 M α rd d 2. Thus, the assuption that M α rd d 2 is a relaxation of the conditions that rank(m) r and M α. The condition that M α essentially eans that the probability of seeing a + or does not depend on the diension. It is also a way of enforcing that M should not be too spiky ; this is an iportant assuption in order to ake the recovery of M well-posed (for exaple, see [2]). Finally, soe assuptions ust be ade on f for recovery of M to be feasible. We define two quantities L α and β α which control the steepness and flatness of f, respectively: and f (x) L α := sup x α f(x)( f(x)) (3) f(x)( f(x)) β α := sup x α (f (x)) 2. (4) We will restrict our attention to f such that L α and β α are well-defined. In particular, we assue that f and f are nonzero in [ α, α]. This assuption is fairly ild for exaple, it includes the logistic and probit odels (as we will see below in Reark ). The quantity L α appears only in the upper bounds, but it is generally well behaved. The quantity β α appears both in the upper and lower bounds. Intuitively, it controls the flatness of f in the interval [ α, α] the flatter f is, the larger β α is. It is clear that soe dependence on β α is necessary. Indeed, if f is perfectly flat, then the agnitudes of the entries of M cannot be recovered. For exaple, if M is a rank- atrix, i.e., we can write M = uv T for soe pair of vectors u and v, then without noise, M is indistinguishable fro M = ũṽ T, where ũ and ṽ have the sae sign patterns and u and v, respectively. Of course, when α is a fixed constant and f is a fixed function, both L α and β α are bounded by fixed constants independent of the diension.

3 C. Upper bounds for -bit atrix copletion In this section, we review the upper bounds of [], which have two goals. The first goal is to accurately recover M itself, and the second is to accurately recover the distribution of Y given by f(m). Both of these results are obtained by axiizing the log-likelihood function of the optiization variable X given our observations subject to a set of convex constraints. In any cases, including the logistic and probit odels, the log-likelihood function is concave, and hence recovery aounts to solving a convex progra. Theore. Assue that M α d d 2 r and M α. Suppose that Ω is chosen at rando following the binoial odel of Section II-A with E Ω =. Suppose that Y is generated as in (). Let L α and β α be as in (3) and (4). If (d + d 2 ) log(d d 2 ), then there is an algorith which outputs M such that with probability at least C /(d +d 2 ), d d 2 M M 2 F C α r(d + d 2 ) with C α := C 2 αl α β α. Above, C and C 2 are absolute constants. Reark (Recovery in the logistic and probit odels). The logistic odel satisfies the hypotheses of Theore with β α = (+eα ) 2 e e α and L α α =. The probit odel has β α c σ 2 e α2 2σ 2 and L α c 2 α σ + σ where we can take c = π and c 2 = 8. In particular, in the probit odel the bound in (5) reduces to d d M M 2 F Cα (α + σ) e r(d + d 2 ) 2σ 2. (6) 2 Hence, when σ < α, increasing the size of the noise leads to significantly iproved error bounds this is not an artifact of the proof. We will see in Section III that the exponential dependence on α in the logistic odel (and on α/σ in the probit odel) is intrinsic to the proble. Intuitively we should expect this since for such odels, as M grows large, we essentially revert to the noiseless setting where estiation of M is ipossible. Furtherore, in Section III we will also see that when α (or α/σ) is bounded by a constant, the error bound (5) is optial up to a constant factor. Fortunately, in any applications, one would expect α to be sall, and in particular to have little, if any, dependence on the diension. In any situations, we ight not be interested in the underlying atrix M, but rather in deterining the distribution of the unknown entries of Y. For exaple, in recoender systes, a natural question would be to deterine the likelihood that a user would enjoy a particular unrated ite. Surprisingly, this distribution ay be accurately recovered without any restriction on the infinity-nor of M. This ay be unexpected Strictly speaking, f(m) [0, ] d d 2 is siply a atrix of scalars, but these scalars iplicitly define the distribution of Y, so we will soeties abuse notation slightly and refer to f(m) as the distribution of Y. (5) to those failiar with the atrix copletion literature in which non-spikiness constraints see to be unavoidable. In fact, we will show in Section III that the bound in Theore 2 is nearoptial even under the added constraint that M α, it would be ipossible to estiate f(m) significantly ore accurately. Theore 2. Assue that M α d d 2 r. Suppose that Ω is chosen at rando following the binoial odel of Section II-A with E Ω =. Suppose that Y is generated as in (), and let L = li α L α. If (d + d 2 ) log(d d 2 ), then there is an algorith which outputs M such that with probability at least C /(d + d 2 ), d 2 H(f( M), r(d + d 2 ) f(m)) C 2 αl. (7) Above, C and C 2 are absolute constants. While L = for the logistic odel, the astute reader will have noticed that for the probit odel L is unbounded that is, L α tends to as α. L would also be unbounded for the case where f(x) takes values of or 0 outside of soe range (as would be the case in (2) if the distribution of the noise had copact support). Fortunately, one can recover a result for these cases by enforcing an infinity-nor constraint []. III. LOWER BOUNDS The upper bounds in Theores and 2 show that we ay recover M or f(m) fro incoplete, binary easureents. In this section, we discuss the extent to which these theores are optial. We give three theores, all proved using inforation theoretic ethods, which show that these results are nearly tight, even when soe of the assuptions are relaxed. Theore 3 gives a lower bound to nearly atch the upper bound on the error in recovering M derived in Theore. Theore 4 copares our upper bounds to those available without discretization and shows that very little is lost when discretizing to a single bit. Finally, Theore 5 gives a lower bound atching, up to a constant factor, the upper bound on the error in recovering the distribution f(m) given in Theore 2. Theore 5 also shows that Theore 2 does not suffer by dropping the canonical spikiness constraint. Our lower bounds require a few assuptions, so before we delve into the bounds theselves, we briefly argue that these assuptions are innocuous. For notational siplicity, we will assue without loss of generality that d d 2 throughout this section. Siilarly, without loss of generality (since we can always adjust f to account for rescaling M), we assue that α. Next, we require that the paraeters be sufficiently large so that rd C 0 (8) for an absolute constant C 0. Note that we could replace this with a sipler, but still ild, condition that d > C 0. Finally, we also require that r c where c is a constant and that r O(d 2 / ), where O( ) hides paraeters (which ay differ in each theore) that we ake explicit below. This last

4 assuption siply eans that we are in the situation where r is significantly saller than d and d 2, i.e., the atrix is of approxiately low rank. In the following, let K = M : M α } rd d 2, M α (9) denote the set of atrices whose recovery is guaranteed by Theore. A. Recovery fro -bit easureents First, we show that the error in Theore is nearly optial: there is no algorith which can approxiate M uch better than the algoriths in []. Theore 3. Fix α, r, d, and d 2 to be such that r 4 and (8) holds. Let β α be defined as in (4), and suppose that f (x) is decreasing for x > 0. Let Ω be any subset of [d ] [d 2 ] with cardinality, and let Y be as in (). Consider any algorith which, for any M K, takes as input Y i,j for (i, j) Ω and returns M. Then, there exists M K such that with probability at least 3/4, d d 2 M M 2 F in C, C 2 α β 3 4 α } rd (0) as long as the right-hand side of (0) exceeds r /d 2. Above, C and C 2 are absolute constants. 2 The requireent that the right-hand side of (0) be larger than r /d 2 is satisfied as long as r O(d 2 / ). In particular, it is satisfied whenever in, β 0 } d 2 r C 3 for a fixed constant C 3. Note also that in the latent variable odel in (2), f (x) is siply the probability density of Z i,j. Thus, the requireent that f (x) be decreasing is siply that the probability density have decreasing tails. One can easily check that this is satisfied for the logistic and probit odels. Note that if α is bounded by a constant and f is fixed (in which case β α and β α are bounded by a constant), then the lower bound of Theore 3 atches the upper bound given in (5) up to a constant. When α is not treated as a constant, the bounds differ by a factor of β α. In the logistic odel β α e α and so this aounts to the difference between e α/2 and e α. The probit odel has a siilar change in the constant of the exponent. B. Recovery fro unquantized easureents Next we show that, surprisingly, very little is lost by discretizing to a single bit. In Theore 4, we consider an unquantized version of the latent variable odel in (2) with Gaussian noise. That is, let Z be a atrix of i.i.d. Gaussian rando variables, and suppose that the noisy entries 2 Here and in the theores below, the choice of 3/4 in the probability bound is arbitrary, and can be adjusted at the cost of changing C 0 in (8) and C and C 2. Siilarly, β 3 4 α can be replaced by β ( ɛ)α for any ɛ > 0. M i,j + Z i,j are observed directly, without discretization. In this setting, we give a lower bound that still nearly atches the upper bound given in Theore, up to the β α ter. Theore 4. Fix α, r, d, and d 2 to be such that r and (8) holds. Let Ω be any subset of [d ] [d 2 ] with cardinality, and let Z be a d d 2 atrix with i.i.d. Gaussian entries with variance σ 2. Consider any algorith which, for any M K, takes as input Y i,j = M i,j + Z i,j for (i, j) Ω and returns M. Then, there exists M K such that with probability at least 3/4, d d 2 M M 2 F in } rd C, C 2 ασ () as long as the right-hand side of () exceeds r /d 2. Above, C and C 2 are absolute constants. The requireent that the right-hand side of () be larger than r /d 2 is satisfied whenever r C 3 in, σ 2 }d 2 for a fixed constant C 3. Following Reark, the lower bound given in () atches the upper bound proven in Theore up to a constant, as long as α/σ is bounded by a constant. In other words: When the signal-to-noise ratio is constant, alost nothing is lost by quantizing to a single bit. Perhaps it is not particularly surprising that -bit quantization induces little loss of inforation in the regie where the noise is coparable to the underlying quantity we wish to estiate however, what is soewhat of a surprise is that a siple convex progra can successfully recover all of the inforation contained in these -bit easureents. Before proceeding, we also briefly note that our Theore 4 is soewhat siilar to Theore 3 in [2]. The authors in [2] consider slightly different sets K: these sets are ore restrictive in the sense that it is required that α 32 log n and less restrictive because the nuclear-nor constraint ay be replaced by a general Schatten-p nor constraint. It was iportant for us to allow α = O() in order to copare with our upper bounds due to the exponential dependence of β α on α in Theore for the probit odel. This led to soe new challenges in the proof. Finally, it is also noteworthy that our stateents hold for arbitrary sets Ω, while the arguent in [2] is only valid for a rando choice of Ω. C. Recovery of the distribution fro -bit easureents Finally, we address the optiality of Theore 2. We show that under ild conditions on f, any algorith that recovers the distribution f(m) ust yield an estiate whose Hellinger distance deviates fro the true distribution by an aount proportional to α rd d 2 /, atching the upper bound of (7) up to a constant. Notice that the lower bound holds even if the algorith is proised that M α, which the upper bound did not require.

5 Theore 5. Fix α, r, d, and d 2 to be such that r 4 and (8) holds. Let L be defined as in (3), and suppose that f (x) c and c f(x) c for x [, ], for soe constants c, c > 0. Let Ω be any subset of [d ] [d 2 ] with cardinality, and let Y be as in (). Consider any algorith which, for any M K, takes as input Y i,j for (i, j) Ω and returns M. Then, there exists M K such that with probability at least 3/4, d 2 H(f(M), f( M)) in C, C 2 α L } rd (2) as long as the right-hand side of (2) exceeds r /d 2. Above, C and C 2 are constants that depend on c, c. The requireent that the right-hand side of (2) be larger than r /d 2 is satisfied whenever } d2 r C 3 in, for a constant C 3 that depends only on c, c. Note also that the condition that f and f be well-behaved in the interval [, ] is satisfied for the logistic odel with c = /4 and c = +e Siilarly, we ay take c = and c = 0.59 in the probit odel. D. Proof sketches All of our lower bounds follow the sae general outline, and rest on Fano s inequality and the following lea. Lea. Let K be defined as in (9), let γ be such that r r γ is an integer, and suppose that 2 γ d 2. There is a set X K with ( ) rd2 X exp 6γ 2 L 2 with the following properties: ) For all X X, each entry has X i,j = αγ. 2) For all X (i), X (j) X, i j, X (i) X (j) 2 F > α2 γ 2 d d 2. 2 The proof of Lea is probabilistic. Briefly, we consider drawing each the atrices X (i) independently, as follows: we draw independent sign flips (appropriately noralized) to fill an appropriately sized block of X (i). To fill out the rest of X (i), we siply repeat this block. It will turn out that the expected distance E X (i) X (j) 2 F is large for distinct i, j, and so the proof rests on showing that these distances (which are rando variables) are sufficiently concentrated that the probability of choosing a set with the required iniu distance is greater than zero. With Lea in hand, the proofs of our lower bounds follow a siilar fraework, with different paraeter settings in Lea. Suppose that a recovery algorith exists, which approxiately recovers M or f(m) with few observations Y Ω. We will iagine running the recovery algorith on a randoly chosen eleent X (i) fro the set X. If the algorith is successful, we will arrange the paraeters so that it recovers the correct X (i) to within a Frobenius distance at ost half the iniu distance of X. Thus, we ay use this algorith to recover X (i) exactly, with good probability. On the other hand, we will show that the utual inforation between the correct choice of X (i) and the output of the algorith cannot be too large. Thus, Fano s inequality will iply that such recovery is ipossible, and thus no such algorith exists. The interested reader is referred to the preprint [] for ore details about the proofs. IV. CONCLUSION Many of the applications of atrix copletion consider discrete data, soeties consisting of binary easureents. This work addresses such situations. However, atrix copletion fro noiseless binary easureents is extreely ill-posed, even if one collects a binary easureent fro all of the atrix entries. Fortunately, when there are soe stochastic variations (noise) in the proble, atrix reconstruction becoes well-posed, and in [] it was shown that the unknown atrix can be accurately and efficiently recovered fro binary easureents. In this work, we focus on lower bounds, and show that the aforeentioned upper bounds are nearly optial, even when soe restrictions are dropped. ACKNOWLEDGMENT This work was partially supported by NSF grants DMS , DMS and CCF REFERENCES [] M. Davenport, Y. Plan, E. van den Berg, and M. Wootters, -bit atrix copletion, Arxiv preprint arxiv: , 202. [2] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry, Using collaborative filtering to weave an inforation tapestry, Co. ACM, vol. 35, no. 2, pp. 6 70, 992. [3] Z. Liu and L. Vandenberghe, Interior-point ethod for nuclear nor approxiation with application to syste identification, SIAM J. Matrix Analysis and Applications, vol. 3, no. 3, pp , [4] P. Biswas, T.-C. Lian, T.-C. Wang, and Y. Ye, Seidefinite prograing based algoriths for sensor network localization, ACM Trans. Sen. Netw., vol. 2, no. 2, pp , [5] A. Singer, A reark on global positioning fro local distances, Proc. Natl. Acad. Sci., vol. 05, no. 28, pp , [6] A. Singer and M. Cucuringu, Uniqueness of low-rank atrix copletion by rigidity theory, SIAM J. Matrix Analysis and Applications, vol. 3, no. 4, pp , 200. [7] D. Gleich and L.-H. Li, Rank aggregation via nuclear nor iniization, in Proc. ACM SIGKDD Int. Conf. on Knowledge, Discovery, and Data Mining (KDD), San Diego, CA, Aug. 20. [8] D. Gross, Recovering low-rank atrices fro few coefficients in any basis, IEEE Trans. Infor. Theory, vol. 57, no. 3, pp , 20. [9] E. Candès and B. Recht, Exact atrix copletion via convex optiization, Found. Coput. Math., vol. 9, no. 6, pp , [0] E. Candès and Y. Plan, Matrix copletion with noise, Proc. IEEE, vol. 98, no. 6, pp , 200. [] B. Recht, A sipler approach to atrix copletion, J. Machine Learning Research, vol. 2, pp , 20. [2] S. Negahban and M. Wainwright, Restricted strong convexity and weighted atrix copletion: Optial bounds with noise, J. Machine Learning Research, vol. 3, pp , 202.

1-Bit matrix completion

Information and Inference: A Journal of the IMA 2014) 3, 189 223 doi:10.1093/imaiai/iau006 Advance Access publication on 11 July 2014 1-Bit matrix completion Mark A. Davenport School of Electrical and