The rapid growth in the size and scope of datasets in science

Size: px

Start display at page:

Download "The rapid growth in the size and scope of datasets in science"

Naomi Francis
5 years ago
Views:

1 Comutational and statistical tradeoffs via convex relaxation Venkat Chandrasekaran a and Michael I. Jordan b, a Deartments of Comuting and Mathematical Sciences and Electrical Engineering, California Institute of Technology, Pasadena, CA 95; and b Deartments of Statistics and Electrical Engineering and Comuter Sciences, University of California, Berkeley, CA 947 Contributed by Michael I. Jordan, February 5, 3 (sent for review November 7, ) Modern massive datasets create a fundamental roblem at the intersection of the comutational and statistical sciences: how to rovide guarantees on the quality of statistical inference given bounds on comutational resources, such as time or sace. Our aroach to this roblem is to defineanotionof algorithmic weakening, in which a hierarchy of algorithms is ordered by both comutational efficiency and statistical efficiency, allowing the growing strength of the data at scale to be traded off against the need for sohisticated rocessing. We illustrate this aroach in the setting of denoising roblems, using convex relaxation as the core inferential tool. Hierarchies of convex relaxations have been widely used in theoretical comuter science to yield tractable aroximation algorithms to many comutationally intractable tasks. In the current aer, we show how to endow such hierarchies with a statistical characterization and thereby obtain concrete tradeoffs relating algorithmic runtime to amount of data. convex geometry convex otimization high-dimensional statistics The raid growth in the size and scoe of datasets in science and technology has created a need for novel foundational ersectives on data analysis that blend comuter science and statistics. That classical ersectives from these fields are not adequate to address emerging roblems in Big Data is aarent from their sharly divergent nature at an elementary level: In comuter science, the growth of the number of data oints is a source of comlexity that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data oints is a source of simlicity in that inferences are generally stronger and asymtotic results can be invoked. In classical statistics, where one considers the increase in inferential accuracy as the number of data oints grows, there is little or no consideration of comutational comlexity. Indeed, if one imoses the additional constraint, as is revalent in real-world alications, that a certain level of inferential accuracy must be achieved within a limited time budget, classical theory rovides no guidance as to how to design an inferential strategy. [Note that classical statistics contains a branch known as sequential analysis that does discuss methods that sto collecting data oints after a target error level has been reached (e.g., ref. ), but this is different from the comutational comlexity guarantees (the number of stes that a comutational rocedure requires) that are our focus.] In classical comuter science, ractical solutions to large-scale roblems are often framed in terms of aroximations to idealized roblems; however, even when such aroximations are sought, they are rarely exressed in terms of the coin of the realm of the theory of inference: the statistical risk function. Thus, there is little or no consideration of the idea that comutation can be simlified in large datasets because of the enhanced inferential ower in the data. In general, in comuter science, datasets are not viewed formally as a resource on a ar with time and sace (such that the more of the resource, the better). On intuitive grounds, it is not imlausible that strategies can be designed to yield monotonically imroving risk as data accumulate, even in the face of a time budget. In articular, if an algorithm simly ignores all future data once a time budget is exhausted, statistical risk will not increase (under various assumtions that may not be desirable in ractical alications). Alternatively, one might allow linear growth in the time budget (e.g., in a real-time setting) and attemt to achieve such growth via a subsamling strategy in which some fraction of the data are droed. Executing such a strategy may be difficult, however, in that the aroriate fraction deends on the risk function, and thus on a mathematical analysis that may be difficult to carry out. Moreover, subsamling is a limited strategy for controlling comutational comlexity. More generally, one would like to consider some notion of algorithm weakening, where as data accumulate, one can back off to simler algorithmic strategies that nonetheless achieve a desired risk. The challenge is to do this in a theoretically sound manner. We base our aroach to this roblem on the notion of a time-data comlexity class. In articular, we defineaclass TD(t(), n(), «()) of arameter estimation roblems in which a -dimensional arameter underlying an unknown oulation can be estimated with a risk of «(), given n() indeendent and identically distributed (i.i.d.) samles using an inference rocedure with runtime t(). Our definition arallels the definition of the time-sace (TISP) comlexity class in comutational comlexity theory for describing algorithmic tradeoffs between time and sace resources (). In this formalization, classical results in estimation theory can be viewed as emhasizing the tradeoffs between the second and third arameters (amount of dataandrisk).ourfocusinthisaeristofix «() to some desired level of accuracy and to investigate the tradeoffs between the first two arameters, namely, runtime and dataset size. Although classical statistics gave little consideration to comutational comlexity, comutational issues have come increasingly to the fore in modern high-dimensional statistics (3), where the number of arameters is relatively large and the number of data oints n is relatively small. In this setting, methods based on convex otimization have been emhasized (articularly methods based on l enalties). This is due, in art, to the favorable analytical roerties of convex functions and convex sets, but also to the fact that such methods tend to have favorable comutational scaling. However, the treatment of comutation has remained in- Significance The growth in the size and scoe of datasets in science and technology has created a need for foundational ersectives on data analysis that blend comuter science and statistics. Secifically, the core challenge with massive datasets is that of guaranteeing imroved accuracy of an analysis rocedure as data accrue, even in the face of a time budget. We address this roblem via a notion of algorithmic weakening, whereby as data scale, the rocedure backs off to cheaer algorithms, leveraging the growing inferential strength of the data to ensure that a desired level of accuracy is achieved within the comutational budget. Author contributions: V.C. and M.I.J. designed research, erformed research, and wrote the aer. The authors declare no conflict of interest. Freely available online through the PNAS oen access otion. To whom corresondence should be addressed. jordan@eecs.berkeley.edu. This article contains suorting information online at 73/nas.393/-/DCSulemental. STATISTICS PNAS PLUS PNAS Published online March, 3 E8 E9

2 formal, with little attemt to characterize tradeoffs between comutation time and estimation quality. In our work, we aim exlicitly at such tradeoffs in the setting in which both n and are large. To develo a notion of algorithm weakening that combines comutational and statistical considerations, we consider estimation rocedures for which we can characterize the comutational benefits as well as the loss in estimation erformance due to the use of weaker algorithms. Reflecting the fact that the sace of all algorithms is oorly understood, we retain the focus on convex otimization from high-dimensional statistics, but we consider arameterized hierarchies of otimization rocedures in which a form of algorithm weakening is obtained by using successively weaker outer aroximations to convex sets. Such convex relaxations have been widely used to give efficient aroximation algorithms for intractable roblems in comuter science (4). As we will discuss, a recise characterization of both the estimation erformance and the comutational comlexity of using a articular relaxation of a convex set can be obtained by aealing to convex geometry and to results on the comlexity of solving convex rograms. Secifically, the tighter relaxations in these families offer better aroximation quality (and better estimation erformance in our context) but are comutationally more comlex. On the other hand, the weaker relaxations are comutationally more tractable and can rovide the same estimation erformance as the tighter ones but with access to more data. In this manner, convex relaxations rovide a rinciled mechanism to weaken inference algorithms so as to reduce the runtime in rocessing larger datasets. To demonstrate exlicit tradeoffs in high-dimensional, largescale inference, we focus, for simlicity and concreteness, on estimation in sequence models (5): y ¼ x* þ σz; [] where σ >, the noise vector z R is standard normal, and the unknown arameter x* belongs to a known subset S R. The objective is to estimate x* based on n indeendent observations fy i g n i¼ of y. This denoising setu has a long history and has been at the center of some remarkable results in the high-dimensional setting over the ast two decades, beginning with the aers of Donoho and Johnstone (6, 7). The estimators discussed next roceed by first comuting the samle mean y ¼ n i¼ y i and then using y as inut to a suitable convex rogram. Of course, this is equivalent to a denoising roblem in which the noise variance is σ /n, and we are given just one samle. The reason why we consider the elaborate two-ste rocedure is to account more accurately both for data aggregation and for subsequent rocessing in our runtime calculations. Indeed, in a real-world setting, one is tyically faced with a massive dataset in unaggregated form, and when both and n may be large, summarizing the data before any further rocessing can itself be an exensive comutation. As will be seen in concrete calculations of time-data tradeoffs, the number of oerations corresonding to data aggregation is sometimes comarable to or even larger than the number of oerations required for subsequent rocessing in a massive data setting. To estimate x*, we consider the following natural shrinkage estimator given by a rojection of the samle mean y onto a convex set C that is an outer aroximation to S (i.e., S C): ^x n ðcþ ¼ arg min y x x R l s:t: x C: [] We study the estimation erformance of a family of shrinkage estimators f^x n ðc i Þg that use as the convex constraint one of a sequence of convex outer aroximations {C i }withc C... S. Given the same number of samles, using a weaker relaxation, such as C, leads to an estimator with a larger risk than would result from using a tighter relaxation, such as C. On the other hand, given access to more data samles, the weaker aroximations rovide the same estimation guarantees as the tighter ones. In settings in which comuting a weaker aroximation is more tractable than comuting a tighter one, a natural comutation/ samle tradeoff arises. We characterize this tradeoff in a number of stylized examles, motivated by roblems such as collaborative filtering, learning an ordering of a collection of random variables, and inference in networks. More broadly, this aer highlights the role of comutation in estimation by jointly studying both the comutational and statistical asects of high-dimensional inference. Such an understanding is articularly of interest in modern inferential tasks in data-rich settings. Furthermore, an observation from our examles on timedata tradeoffs is that, in many contexts, one does not need too many extra data samles to go from a comutationally inefficient estimator based on a tight relaxation to an extremely efficient estimator based on a weaker relaxation. Consequently, in alication domains in which obtaining more data is not too exensive, it may be referable to acquire more data, with the ushot being that the comutational infrastructure can be relatively less sohisticated. We should note that we investigate only one algorithmweakening mechanism, namely, convex relaxation, and one class of statistical estimation roblems, namely, denoising in a highdimensional sequence model. There is reason to believe, however, that the rinciles described in this aer are relevant more generally. Convex otimization-based rocedures are used in a variety of large-scale data analysis tasks (3, 8), and it is likely to be interesting to exlore hierarchies of convex relaxations in such tasks. In addition, there are a number of otentially interesting mechanisms beyond convex relaxation for weakening inference rocedures, such as dimensionality reduction or other forms of data quantization and aroaches based on clustering or coresets. We discuss these and other research directions in Conclusions. Related Work A number of aers have considered comutational and samle comlexity tradeoffs in the setting of learning binary classifiers. Secifically, several authors have described settings under which seedus in running time of a classifier learning algorithm are ossible, given a substantial increase in dataset size (9 ). In contrast, in the denoising setu considered in this aer, several of our examles of time-data tradeoffs demonstrate significant comutational seedus with just a constant factor increase in dataset size. Another attemt in the binary classifier learning setting, building on earlier work on classifier learning in data-rich roblems (3), has shown that modest imrovements in runtime (of constant factors) may be ossible with access to more data by using the stochastic gradient descent method (4). Time-data tradeoffs have also been characterized in Boolean network training from time series data (5), but the comutational seedus offered there are from exonential-time algorithms to slightly faster but still exonential-time algorithms. Two recent aers (6, 7) have considered time-data tradeoffs in sarse rincial comonent analysis (PCA) (8) and in biclustering, in which one wishes to estimate the suort of a sarse eigenvector that consists of most of the energy of a matrix. We also study time-data tradeoffs for this roblem, but from a denoising ersective rather than from one of estimating the suort of the leading sarse eigenvector. In our discussion of Examle 3 (Time-Data Tradeoffs), we discuss the differences between our roblem setu and these latter two aers (6, 7). Finally, a recent aer (9) studies time-data tradeoffs in model selection roblems by investigating rocedures that oerate within a comutational budget. As a general contrast to all these revious results, a major contribution of the resent aer is the demonstration of the efficacy of convex relaxation as a owerful algorithm-weakening mechanism for rocessing massive datasets in a broad range of settings. Paer Outline The main sections of this aer roceed in the following sequence. The next section describes a framework for formally stating results on time-data tradeoffs. We then rovide some background on convex otimization and relaxations of convex sets. Following this, we investigate in detail the denoising roblem [] and char- E8 Chandrasekaran and Jordan

3 acterize the risk obtained when one emloys a convex rogramming estimator of the tye []. Subsequently, we give several examles of time-data tradeoffs in concrete denoising roblems. Finally, we conclude with a discussion of directions for further research. Formally Stating Time-Data Tradeoffs In this section, we describe a framework to state results on comutational and statistical tradeoffs in estimation. Our discussion is relevant to general arameter estimation roblems and inference rocedures; one may kee in mind the denoising roblem [] for concreteness. Consider a sequence of estimation roblems indexed by the dimension of the arameter to be estimated. Fix a risk function «() that secifies the desired error of an estimator. For examle, in the denoising roblem [], the error of an estimator of the form [] may be secified as the worst case mean squared error taken over all elements of the set S (i.e., su x* S E½kx* ^x n ðcþk l Š). One can informally view an estimation algorithm that achieves a risk of «() by rocessing n() samles with runtime t() asa oint on a D lot as shown in Fig., with one axis reresenting the runtime and the other reresenting the samle comlexity. To be recise, the axes in the lot index functions (of ) that reresent runtime and number of samles, but we do not emhasize such formalities and rather use these lots to rovide a useful qualitative comarison of inference algorithms. In Fig., rocedure A requires fewer samles than rocedure C to achieve the same error, but this reduction in samle comlexity comes at the exense of a larger runtime. Procedure B has both a larger samle comlexity and a larger runtime than rocedure C; thus, it is strictly dominated by rocedure C. Given an error function «(), there is a lower bound on the number of samles n() required to achieve this error using any comutational rocedure [i.e., no constraints on t()]; such information-theoretic or minimax risk lower bounds corresond to vertical lines in the lot in Fig.. Characterizing these fundamental limits on samle comlexity has been a traditional focus in the estimation theory literature, with a fairly comlete setofresultsavailableinmanysettings.onecanimagineasking for similar lower bounds on the comutational side, corresonding to horizontal lines in the lot in Fig. : Given a desired risk «() and access to an unbounded number of samles, what is a nontrivial lower bound on the runtime t() of any inference algorithm that achieves a risk of «()? Such comlexity-theoretic lower bounds are significantly harder to obtain, and they remain a central oen roblem in comutational comlexity theory. Fig.. Tradeoff between the runtime and samle comlexity in a stylized arameter estimation roblem. Here, the risk is assumed to be fixed to some desired level, and the oints in the lot refer to different algorithms that require a certain runtime and a certain number of samles to achieve the desired risk. The vertical and horizontal lines refer to lower bounds in samle comlexity and in runtime, resectively. This research landscae informs the qualitative nature of the statements on time-data tradeoffs we make in this aer. First, we will not attemt to rove combined lower bounds, as is traditionally done in the characterization of tradeoffs between hysical quantities, involving n() and t() jointly; this is because obtaining a lower bound just on t() remains a substantial challenge. Hence, our time-data tradeoff results on the use of more efficient algorithms for larger datasets refer to a reduction in the uer bounds on runtimes of estimation rocedures with increases in dataset size. Second, in any setting in which there is a comutational cost associated with touching each data samle and in which the samles are exchangeable, there is a samle threshold beyond which it is comutationally more efficient to throw away excess data samles than to rocess them in any form. This observation suggests that there is a floor, as in Fig. with rocedures E, F, G, and H, beyond which additional data do not lead to a reduction in runtime. Precisely characterizing this samle threshold is generally very hard because it deends on difficultto-obtain comutational lower bounds for estimation tasks as well as on the articular sace of estimation algorithms that one may use. We will comment further on this oint when we consider concrete examles of time-data tradeoffs. To state our results concerning time-data tradeoffs formally, we define a resource class constrained by runtime and samle comlexity as follows. Definition : Consider a sequence of arameter estimation roblems indexed by the dimension of the sace of arameters that index an underlying oulation. This sequence of estimation roblems belongs to a time-data class TD(t(), n(), «()) if there exists an inference rocedure for the sequence of roblems with runtime uer-bounded by t(), with the number of i.i.d. samles rocessed bounded by n(), and which achieves a risk bounded by «(). We note that our definition of a time-data resource class arallels the time-sace resource classes considered in comlexity theory (). In that literature, TISP(t(), s()) denotes a class of roblems of inut size that can be solved by some algorithm using t() oerations and s() units of sace. With this formalism, classical minimax bounds can be stated as follows. Given some function nðþ for the number of samles, suose a arameter estimation roblem has a minimax risk of «minimax () [which deends on the function nðþ]. If an estimator achieving a risk of «minimax () iscomutable with runtime tðþ, this estimation roblem then lies in TDðtðÞ; nðþ; e minimax ðþþ. Thus, the emhasis is fundamentally on the relationshi between nðþ and «minimax (), without much focus on the comutational rocedure that achieves the minimax risk bound. Our interest in this aer is to fix the risk «() =«desired () to be equal to some desired level of accuracy and to investigate the tradeoffs between t() andn() so that a arameter estimation roblem lies in TD(t(), n(), «desired ()). Convex Relaxation In this section, we describe the articular algorithmic toolbox on which we focus, namely, convex rograms. Convex otimization methods offer a owerful framework for statistical inference due to the broad class of estimators that can be effectively modeled as convex rograms. Furthermore, the theory of convex analysis is useful both for characterizing the statistical roerties of convex rogramming-based estimators and for develoing methods to comute such estimators efficiently. Most imortantly from our viewoint, convex otimization methods rovide a rinciled and general framework for algorithm weakening based on relaxations of convex sets. We briefly discussthekey ideas from this literature that are relevant to this aer in this section. A central notion to the geometric viewoint adoted in this section is that of a convex cone, which is a convex set that is closed under nonnegative linear combinations. Reresentation of Convex Sets. Convex rograms refer to a class of otimization roblems in which we seek to minimize a convex STATISTICS PNAS PLUS Chandrasekaran and Jordan PNAS Published online March, 3 E83

4 function over a convex constraint set (8). For examle, linear rogramming (LP) and semidefinite rogramming (SDP) are two rominent subclasses in which linear functions are minimized over constraint sets given by affine saces intersecting the nonnegative orthant (in LP) and the ositive semidefinite cone (in SDP). Roughly seaking, convex rograms are tractable to solve comutationally if the convex objective function can be comuted efficiently and if membershi of an arbitrary oint in the convex constraint sets can be certified efficiently; we will informally refer to this latter oeration as comuting the convex constraint set. (More recisely, one requires an efficient searation oracle that resonds YES if the oint is in the convex set and otherwise rovides a hyerlane that searates the oint from the convex set). It is then clear that the main comutational bottleneck associated with solving convex rograms of the form [] isthe efficiency of comuting the constraint sets. A central insight from the literature on convex otimization is that the comlexity of comuting a convex set is closely linked to how efficiently the set can be reresented. Secifically, if a convex set can be exressed as the intersection of a small number of basic or elementary convex sets, each of which is tractable to comute, the original convex set is then also tractable to comute, and one can, in turn, otimize over this set efficiently. Examles of basic convex sets include affine saces or cones, such as the nonnegative orthant and the cone of ositive semidefinite matrices. Indeed, a canonical method with which to reresent a convex set is to exress the set as the intersection of a cone and an affine sace. In what follows, we will consider such conic reresentations of convex sets in R. Definition : Let C R be a convex set, and let K R be a convex cone. C is then said to be K-reresentable if C can be exressed as follows for A R m, b R m : C¼fxjx K; Ax ¼ bg: [3] Such a reresentation of C is called a K -reresentation. Informally, if K is the nonnegative orthant (or the semidefinite cone), we will refer to the resulting reresentations as LP reresentations (or SDP reresentations), following commonly used terminology in the literature. A virtue of conic reresentations of convex sets based on the orthant or the semidefinite cone is that these reresentations lead to a numerical recie for solving convex otimization roblems of the form [] viaanatural associated barrier enalty (). The comutational comlexity of these rocedures is olynomial in the dimension of the cone, and we discuss runtimes for secific instances in our discussion of concrete examles of time-data tradeoffs. Examle : The -dimensional simlex is an examle of an LP reresentable set: Δ ¼ fxj x ¼ ; x g; [4] where R is the all-ones vector. The -simlex is the set of robability vectors in R. The next examle is one of an SDP-reresentable set that is commonly encountered both in otimization and in statistics. Examle : The ellitoe, or the set of correlation matrices, in the sace of m m symmetrical matrices is defined as follows: E m m ¼ fxjx ; X ii ¼ ig: [5] Conic reresentations are somewhat limited in their modeling caacity, and an imortant generalization is obtained by considering lifted reresentations. In articular, the notion of liftand-roject lays a critical role in many examles of efficient reresentations of convex sets. The lift-and-roject concet is simle: We wish to exress a convex set C R as the rojection of a convex set C R in some higher dimensional sace (i.e., > ). The comlexity of solving the associated convex rograms is now a function of the lifting dimension. Thus, liftand-roject techniques are useful if isnottoomuchlarger than and if C has an efficient reresentation in the higher dimensional sace R. Lift-and-roject rovides a very owerful reresentation tool, as seen in the following examle. Examle 3: The cross-olytoe is the unit ball of the l -norm: B l ¼ x R jx i j : i The l -norm has been the focus of much attention recently in statistical model selection and feature selection due to its sarsity-inducing roerties (, ). Although the cross-olytoe has vertices, a direct secification in terms of linear constraints involves inequalities: B l ¼ x R z i x i ; z f ; þg : i However, we can obtain a tractable reresentation by lifting to R andthenrojectingontothefirst coordinates: B l ¼ x R j z R s:t: z i x i z i i; z i : i Note that in R with the additional variables z, we have only + inequalities. Another examle of a olytoe that requires many inequalities in a direct descrition is the ermutahedron (3), the convex hull of all the ermutations of the vector [,..., ] R. In fact, the ermutahedron requires exonentially many linear inequalities in a direct descrition, whereas a lifted reresentation involves O( log()) additional variables and about O( log()) inequalities in the higher dimensional sace (4). We refer the reader to the literature on conic reresentations for other examles (ref. 5 and references therein), including lifted semidefinite reresentations. Hierarchies of Convex Relaxations. In many cases of interest, convex sets may not have tractable reresentations. Lifted reresentations in such cases have lifting dimensions that are suerolynomially large in the dimension of the original convex set; thus, the associated numerical techniques lead to intractable comutational rocedures that have suerolynomial runtime with resect to the dimension of the original set. A rominent examle of a convex set that is difficult to comute is the cut olytoe: CUT m m ¼ convfmm jm f ; þg m g: [6] Rank-one signed matrices and their convex combinations are of interest in collaborative filtering and clustering roblems (see Time-Data Tradeoffs). There is no known tractable reresentation of the cut olytoe; lifted linear or semidefinite reresentations have lifting dimensions that are suerolynomial in size. Such comutational issues have led to a large literature on aroximating intractable convex sets by tractable ones. For the uroses of this aer, and following the dominant trend in the literature, we focus on outer aroximations. For examle, the ellitoe [5] is an outer relaxation of the cut olytoe, and it has been used in aroximation algorithms for intractable combinatorial otimization roblems, such as finding the maximumweight cut in a grah (6). More generally, one can imagine a hierarchy of increasingly tighter aroximations {C i } of a convex set C as follows: C C 3 C C : There exist several mechanisms for deriving such hierarchies, and we describe three frameworks here. In the first framework, which was develoed by Sherali and Adams (7), the set C is assumed to be olyhedral and each element of the family {C i } is also olyhedral. Secifically, each C i is exressed via a lifted LP reresentation. Tighter aroximations are obtained by resorting to larger sized lifts such that the lifting dimension increases with the level i in the hierarchy. The second framework is similar in sirit to the first one, but the set C is now a convex basic, closed semialgebraic set and the aroximations {C i } are given by lifted SDP reresentations. [A basic, closed semialgebraic set is the collection of solutions of a E84 Chandrasekaran and Jordan

5 system of olynomial equations and olynomial inequalities (8).] Again, the lifting dimension increases with the level i in the hierarchy. This method was initially ioneered by Parrilo (9, 3) and by Lasserre (3), and it was studied in greater detail subsequently by Gouveia et al. (3). Both of these first and second frameworks are similar in sirit in that tighter aroximations are obtained via lifted reresentations with successively larger lifting dimensions. The third framework that we mention here is qualitatively different from the first two. Suose C is a convex set that has a K-reresentation; by successively weakening the cone K itself, one obtains increasingly weaker aroximations to C. Secifically, we consider the setting in which the cone K is a hyerbolicity cone (33). Such cones have rich geometric and algebraic structure, and their boundary is given in terms of the vanishing of hyerbolic olynomials. They include the orthant and the semidefinite cone as secial cases. We do not go into further technical details and formal definitions of these cones here; instead, we refer the interested reader to the work of Renegar (33). The main idea is that one can obtain a family of relaxations {K i } to a hyerbolicity cone K R, where each K i is a convex cone (in fact, hyerbolic) and is a subset of R : K K 3 K K : [7] These outer conic aroximations are obtained by taking certain derivatives of the hyerbolic olynomial used to define the original cone K (more details are rovided in ref. 33). One then constructs a hierarchy of aroximations {C i }toc by relacing the cone K in the reresentation of C by the family of conic aroximations {K i }. From 3 and 7, it is clear that the aroximations {C i }sodefined satisfy C... C 3 C C. The imortant oint in these three frameworks is that the family of aroximations {C i } obtained in each case is ordered both by aroximation quality and by comutational comlexity; that is, the weaker aroximations in the hierarchy are also the ones that are more tractable to comute. This observation leads to an algorithmweakening mechanism that is useful for rocessing larger datasets more coarsely. As demonstrated concretely in the next section, the estimator [] based on a weaker aroximation to C can rovide the same statistical erformance as one based on a stronger aroximation to C, rovided that the former estimator is evaluated with more data. The ushot is that the first estimator is more tractable to comute than the second. Thus, we obtain a technique for reducing the runtime required to rocess a larger dataset. Estimation via Convex Otimization In this section, we investigate the statistical roerties of the estimator [] for the denoising roblem []. The signal set S in differs based on the alication of interest. For examle, S may be the set of sarse vectors in a fixed basis, which could corresond to the roblem of denoising sarse vectors in wavelet bases (6). The signal set S may be the set of low-rank matrices, which leads to roblems of collaborative filtering (34). Finally, S may be a set of ermutation matrices corresonding to rankings over a collection of items. Our analysis in this section is general, and it is alicable to these and other settings (concrete examles are rovided in Time-Data Tradeoffs). In some denoising roblems, one is interested in noise models other than Gaussian. We comment on the erformance of the estimator [] in settings with non-gaussian noise, although we rimarily focus on the Gaussian case for simlicity. Convex Programming Estimators. To analyze the erformance of the estimator [], we introduce a few concets from convex analysis (35). Given a closed convex set C R and a oint a C, we define the tangent cone at a with resect to C as T C ðaþ ¼conefb ajb Cg: [8] Here, cone( ) refers to the conic hull of a set obtained by taking nonnegative linear combinations of elements of the set. The cone T C (a) isthesetofdirectionstoointsinc from the oint a. Theolar K* R of a cone K R is the cone K* ¼ fh R jhh; di d Kg: The normal cone N C (a) ata with resect to the convex set C is the olar cone of the tangent cone T C (a): N C ðaþ ¼T C ðaþ* : [9] Thus, the normal cone consists of vectors that form an obtuse angle with every vector in the tangent cone T C (x). Both the tangent and normal cones are convex cones. A key quantity that will aear in our error bounds is the following notion of the comlexity or size of a tangent cone. Definition 3: The Gaussian squared-comlexity of a set D R is defined as: gðdþ ¼ E su ha; gi ; a D where the exectation is with resect to g N (, I ). This quantity is closely related to the Gaussian comlexity of a set (36, 37), which consists of no squaring of the term inside the exectation. The Gaussian squared-comlexity shares many roerties in common with the Gaussian comlexity, and we describe those that are relevant to this aer in the next subsection. Secifically, we discuss methods to estimate this quantity for sets D that have some structure. With these definitions and letting B l denote the l ball in R, we have the following result on the error between ^x n ðcþ and x*. Proosition 4. For x* S R and with C R convex such that S C, we have the error bound x* E ^xn ðcþ l σ n g T C ðx* Þ B l : Proof: We have that y ¼ x* þ σ ffiffi n z. We begin by establishing a bound that is derived by conditioning on z ¼ ~z. Subsequently, taking exectations concludes the roof. We have from the otimality conditions (35) of the convex rogram [] that x* þ σ n ~z ^x nðcþj z¼~z N C ^x n ðcþj z¼~z : Here, ^x n ðcþj z¼~z reresents the otimal value of [] conditioned on z ¼ ~z. Because x* S C, wehavethatx* ^x n ðcþj z¼~z T C ð^x n ðcþj z¼~z Þ. Because the normal and tangent cones are olar to each other, we have that D x* þ σ ffiffiffi ~z ^x n ðcþj n z¼~z ; x* ^x n ðcþj z¼~z E : It then follows that x* ^xn ðcþj z¼~z σ ffiffiffi ^x l n ðcþj n z¼~z x* ; ~z ¼ σ ffiffiffi ^x n ðcþj n z¼~z x* * + ^x n ðcþj z¼~z x* l ^x n ðcþj z¼~z x* ; ~z l σ ffiffiffi ^x n ðcþj n z¼~z x* l su d; ~z : d T Cðx*Þ;kdk l Dividing both sides by k^x n ðcþj z¼~z x*k l, squaring both sides, and finally taking exectations comletes the roof. Note that the basic structure of the error bound rovided by the estimator [] in fact holds for an arbitrary distribution on the noise z with the Gaussian squared-comlexity suitably modified. However, we focus for the rest of this aer on the Gaussian case, z N (, I ). To summarize in words, the mean squared error is bounded by the noise variance times the Gaussian squared-comlexity of the STATISTICS PNAS PLUS Chandrasekaran and Jordan PNAS Published online March, 3 E85

6 normalized tangent cone with resect to C at the true arameter x*. Essentially, it measures the amount of noise restricted to the tangent cone, which is intuitively reasonable because only the noise that moves one away from x* in a feasible direction in C must contribute toward the error. Therefore, if the convex constraint set C is shar at x* so that the cone T C (x*) is narrow, the error is then small. At the other extreme, if the constraint set C = R, the error is then σ n as one would exect. Although Proosition 4 is useful and indeed will suffice for the uroses of demonstrating time-data tradeoffs in the next section, there are a coule of shortcomings in the result as stated. First, suose the signal set S is contained in a ball around the origin, with the radius of the ball being small relative to the noise variance σ n. In such a setting, the estimator ^x ¼ leads to a smaller mean squared error than one would obtain from Proosition 4. Second, and somewhat more subtly, suose that one does not have a erfect bound on the size of the signal set. For concreteness, consider a setting in which S is a set of sarse vectors with bounded l norm, in which case a good choice for the constraint set C in the estimator [] is an aroriately scaled l ball. However, if we do not know the l norm of x* a riori, we may then end u using a constraint set C such that x* does not belong to C (hence, x* is an infeasible solution) or such that x* lies strictly in the interior of C [hence, T C (x*) is all of R ]. Both these situations are undesirable because they limit the alicability of Proosition 4 and rovide very loose error bounds. The following result addresses these shortcomings by weakening the assumtions of Proosition 4. Proosition 5. Let x* S R, and let C R be a convex set. Suose there exists a oint ~x C such that C ~x ¼ Q Q, with Q,Q lying in orthogonal subsaces of R and Q αb l for α. We then have that x* E ^xn ðcþ σ l 6 n g coneðq Þ B l þ x* ~x l þ α : Here, cone(q ) is the conic hull of Q. The roof of this result is resented in SI Aendix. A number of remarks are in order here. With resect to the first shortcoming in Proosition 4 stated above, if C is chosen such that S C, one can set ~x ¼ x* ; Q ¼ and Q ¼C ~x in Proosition 5 and can readily obtain a bound that scales only with the diameter of the convex constraint set C. With regard to the second shortcoming in Proosition 4 described above, if a oint ~x C near x* has a narrow tangent cone T C ð~xþ, one can then rovide an error bound with resect to gðt C ð~xþþ with an extra additive term that deends on kx* ~xk l ; this is done by setting Q ¼C ~x and Q = (thus, α =)inproosition 5. More generally, Proosition 5 incororates both of these imrovements in a single error bound with resect to an arbitrary oint ~x C; thus, one can further otimize the error bound over the choice of ~x C (as well as the choice of the decomosition Q and Q ). Proerties and Comutation of Gaussian Squared-Comlexity. We record some roerties of the Gaussian squared-comlexity that are subsequently useful when we demonstrate concrete time-data tradeoffs. It is clear that g( ) is monotonic with resect to set nesting [i.e., g(d ) g(d ) for sets D D ]. If D is a subsace, one can then check that g(d) = dim(d). To estimate squaredcomlexities of families of cones, one can imagine aealing to techniques similar to those used for estimating Gaussian comlexities of sets (36, 37). Most rominent among these are arguments based on covering number and metric entroy bounds. However, these arguments are frequently not shar and introduce extraneous log-factors in the resulting error bounds. In a recent aer by Chandrasekaran et al. (38), shar uer bounds on the Gaussian comlexities of normalized cones have been established for families of cones of interest in a class of linear inverse roblems. The (square of the) Gaussian comlexity can be uer-bounded by the Gaussian squared-comlexity g(d) via Jensen s inequality: E su hd; gi gðdþ; d D where g is a standard normal vector. In fact, most of the bounds in the aer by Chandrasekaran et al. (38) were obtained by bounding g(d); thus, they are directly relevant to our setting. In the rest of this section, we resent the bounds on g(d) from the aer by Chandrasekaran et al. (38) that will be used in this aer, deferring to that aer for roofs in most cases. In some cases, the roofs do require modifications with resect to their counterarts in the aer by Chandrasekaran et al. (38), and for these cases, we give full roofs in SI Aendix. The first result, roved in (38), is a direct consequence of convex duality and rovides a fruitful general technique to comute shar estimates of Gaussian squared-comlexities. Let dist(a, D) denote the l distance from a oint a to the set D. Lemma. Let K R be a convex cone, and let K* R be its olar. We then have for any a R that su d K B l hd; ai ¼ distða; K* Þ: Therefore, we have the following result as a simle corollary. Corollary 6. Let K R be a convex cone, and let K* R be its olar. For g N (, I ), we have that g K B l ¼ E hdistðg; K* Þ i : Based on the duality result of Lemma and Corollary 6, one can comute the following shar bounds on the Gaussian squaredcomlexities of tangent cones with resect to the l norm and nuclear norm balls. These are esecially relevant when one wishes to estimate sarse signals or low-rank matrices; in these settings, the l and nuclear norm balls serve as useful constraint sets for denoising because the tangent cones with resect to these sets at sarse vectors and at low-rank matrices are articularly narrow. Bothoftheseresults,rovedin(38),areusedwhenwedescribe time-data tradeoffs. Proosition 7. Let x R be a vector containing s nonzero entries. Let T be the tangent cone at x with resect to an l norm ball scaled so that x lies on the boundary of the ball (i.e., a scaling of the unit l norm ball by a factor kxk l ). Then, g T B l s log þ 5 s 4 s: Next, we state a result, roved in (38), for low-rank matrices and the nuclear norm ball. Proosition 8. m m Let X R be a matrix of rank r. Let T be the tangent cone at X with resect to a nuclear norm ball scaled so that X lies on the boundary of the ball (i.e., a scaling of the unit nuclear norm ball by a factor equal to the nuclear norm of X). Then, g T B mm l 3rðm þ m rþ: Next, we state and rove a result that allows us to estimate Gaussian squared-comlexities of general cones. The bound is based on the volume of the dual of the cone of interest, and the roof involves an aeal to Gaussian isoerimetry (39). A similar result on Gaussian comlexities of cones (without the square) was roved by Chandrasekaran et al. (38), but that result does not directly imly our statement, and we therefore give a comlete self-contained roof in SI Aendix. The volume of a cone E86 Chandrasekaran and Jordan

$is assumed to be normalized (between and ); thus, we consider the relative fraction of a unit Euclidean shere that is covered by acone. Proosition 9.$

7 is assumed to be normalized (between and ); thus, we consider the relative fraction of a unit Euclidean shere that is covered by acone. Proosition 9. Let K R be a cone such that its olar K* R has a normalized volume of μ 4exf =g;. For, we have that 4e g K B l log : 4μ If a cone is narrow, its olar will then be wide, leading to a large value of μ, and hence a small quantity on the right-hand side of the bound. This result leads to bounds on Gaussian squaredcomlexity in settings in which one can easily obtain estimates of volumes. One setting in which such estimates are easily obtained is the case of tangent cones with resect to vertex transitive olytoes. We recall that a vertex transitive olytoe (3) is one in which there exists a symmetry of the olytoe for each air of vertices maing the two vertices isomorhically to each other. Roughly seaking, all the vertices in such olytoes are the same. Some examles include the cross-olytoe (the l norm ball), the simlex [4], the hyercube (the l norm ball), and many olytoes generated by the action of grous (4). We will see many examles of such olytoes in our examles on time-data tradeoffs; thus, we will aeal to the following corollary reeatedly. Corollary. Suose that P R is a vertex transitive olytoe with v vertices, and let x be a vertex of this olytoe. If 4e v 4 ex{/}, gðt P ðxþþ log ðv=4þ: Proof: The normal cones at the vertices of P artition R. If the olytoe is vertex-transitive, the normal cones are then all equivalenttoeachother(utoorthogonal transformations). Consequently, the (normalized) volume of the normal cone at any vertex is /v. Because the normal cone at a vertex is olar to the tangent cone, we have the desired result from Proosition 9. Time-Data Tradeoffs Preliminaries. We now turn our attention to giving examles of time-data tradeoffs in denoising roblems via convex relaxation. As described reviously, we must set a desired risk to realize a time-data tradeoff; in the examles in the rest of this section, we will fix the desired risk to be equal to indeendent of the roblem dimension. Thus, these denoising roblems belong to the time-data comlexity classes TD (t(), n(), ) for different runtime constraints t() and samle budgets n(). The following corollary gives the number of samles required to obtain a mean squared error of via convex otimization in our denoising setu. (In each of our time-data tradeoff examles, the signal sets S and the associated convex relaxations are symmetric so that no oint in S is distinguished; therefore, without loss of generality, we comute the risk by considering the mean squared error at an arbitrary oint in S rather than taking a suremum over x* S.) Corollary. For x* S and with S C for a closed, convex set C, if n σ g T C ðx* Þ B l ; then, E½kx* ^x n ðcþk l Š. Proof: The result follows by a rearrangement of the terms in the bound in Proosition 4. This corollary states that if we have access to a dataset with n samles, we can then use any convex constraint set C such that the term on the right-hand side in the corollary is smaller than n. Recalling that larger constraint sets C lead to larger tangent cones T C, we observe that if n is large, one can otentially use very weak (and comutationally inexensive) relaxations and still obtain a risk of. This observation, combined with the imortant oint that the hierarchies of convex relaxations described reviously are simultaneously ordered both by aroximation quality and by comutational tractability, allows us to realize a time-data tradeoff by using convex relaxation as an algorithm-weakening mechanism. A simle demonstration is rovided in Fig.. We further consider settings with σ = and in which our signal sets S R ffiffi consist of elements that have Euclidean norm on the order of (measured from the centroid of S). In such regimes, the James Stein shrinkage estimator (4) offers about the same level of erformance as the maximum-likelihood estimator, and both of these are outerformed in statistical risk by nonlinear estimators of the form [] based on convex otimization. Finally, we briefly remark on the runtimes of our estimators. The runtime for each of the rocedures below is calculated by adding the number of oerations required to comute the samle mean y and the number of oerations required to solve [] to some accuracy. Hence, if the number of samles used is n and if f C () denotes the number of oerations required to roject y onto C, the total runtime is then n + f C (). Thus, the number of samles enters the runtime calculations as just an additive term. As we rocess larger datasets, the first term in this calculation becomes larger, but this increase is offset by a more substantial decrease in the second term due to the use of a comutationally tractable convex relaxation. We note that such a runtime calculation extends to more general inference roblems in which one emloys estimators of the form [] but with different loss functions in the objective; secifically, the runtime is calculated as above so long as the loss function deends only on some sufficient statistic comuted from the data. If the loss function is instead of the form n i¼ lðx; y iþ and it cannot be summarized via a sufficient statistic of the data fy i g n i¼, the number of samles then enters the runtime comutation in a multilicative manner as θ(n)f C () for some function θ( ). Examle : Denoising Signed Matrices. We consider the roblem of recovering signed matrices corruted by noise: S¼ aa ja f ; þg ffiffi : We have a R ffiffi so that S R. Inferring such signals is of interest in collaborative filtering, where one wishes to aroximate matrices as the sum of a small number of rank-one signed matrices (34). Such matrices may reresent, for examle, the movie references of users as in the Netflix roblem. The tightest convex constraint that one could use in this case is C = conv(s), which is the cut olytoe [6]. To obtain a risk of Fig.. (Left) Signal set S consisting of x*. (Center) Two convex constraint sets C and C, where C is the convex hull of S and C is a relaxation that is more efficiently comutable than C. (Right) Tangent cone T C (x*) is contained inside the tangent cone T C (x*). Consequently, the Gaussian squaredcomlexity gðt C ðx* Þ B Þ is smaller than the comlexity gðt l C ðx* Þ B Þ,sothat l the estimator ^x n ðcþ requires fewer samles than the estimator ^x n ðc Þ for a risk of at most. STATISTICS PNAS PLUS Chandrasekaran and Jordan PNAS Published online March, 3 E87

8 ffiffi with this constraint, one requires n ¼ c by alying Corollary and Corollary based on the symmetry of the cut olytoe. The cut olytoe is generally intractable to comute. Hence, the best-known algorithms to roject onto C would require a runtime that is suerolynomial in. Consequently, the total runtime of this algorithm is c.5 + sueroly(). A commonly used tractable relaxation of the cut olytoe is the ellitoe [5]. By comuting the Gaussian squared-comlexity of the tangent cones at rank-one signed matrices ffiffi with resect to this set, it is ossible to show that n ¼ c leads to a risk of (with c > c ). Furthermore, interior oint-based convex otimization algorithms for solving [] that exloit the secial structure of the ellitoe require O(.5 ) oerations (8, 4, 43). (The exonent is a result of the manner in which we define our signal set so that a rank-one signed matrix lives in R.) Hence, the total runtime of this rocedure is c.5 + O(.5 ). Finally, an even weaker relaxation of the cut olytoe than the ellitoe ffiffi is the unit ball of the nuclear norm scaled by a factor of ; one can verify that the elements of S lie on the boundary of this set and are, in fact, extreme oints. Aealing to Proosition 8 (using the fact that the elements of S are ffiffi rank-one matrices) and Corollary, we conclude that n ¼ c 3 samles rovide a mean squared error of (with c 3 > c ). Projecting onto the scaled nuclear norm ball can be done by comuting a singular value decomosition (SVD) and then truncating the sequence of singular ffiffi values in descending order when their cumulative sum exceeds (in effect, ffiffi rojecting the vector of singular values onto an l ball of size ). This oeration requires O(.5 ) oerations; thus, the total runtime is c O(.5 ). To summarize, the cut-matrix denoising ffiffi roblem lives in the ffiffi time-data class TD(sueroly(),c,), in TDðOð :5 Þ; c ; Þ, and in TDðOð :5 ffiffi Þ; c 3 ; Þ, with constants c < c < c 3. Examle : Ordering Variables. In many data analysis tasks, one is given a collection of variables that are suitably ordered so that the oulation covariance is banded. Under such a constraint, thresholding the entries of the emirical covariance matrix based on their distance from the diagonal has been shown to be a owerful method for estimation in the high-dimensional setting (44). However, if an ordering of the variables is not known a riori, one must jointly learn an ordering for the variables and estimate their underlying covariance. As a stylized version of this variable ordering roblem, let M R ffiffi ffiffi ffiffi be a known tridiagonal matrix (with Euclidean norm Oð Þ) and consider the following signal set: S¼ ffiffi ffiffiffi ΠMΠ jπ is a ermutation matrix : The matrix M here is to be viewed as a covariance matrix. Thus, the corresonding denoising roblem [] is that we wish to estimate a covariance matrix in the absence of knowledge of the ordering of the underlying variables. In a real-world scenario, one might wish to consider covariance matrices M that belong to some class of banded matrices and then construct S as done here, but we stick with the case of a fixed M for simlicity. Furthermore, the noise in a ractical setting is better modeled as coming from a Wishart distribution; again, we focus on the Gaussian case for simlicity. The tightest convex constraint set that one could use in this case is the convex hull of S, which is generally intractable to comute for arbitrary matrices M. For examle, if one were able to comute this set in olynomial time for any tridiagonal matrix M, one would be able to solve the intractable longest ath roblem (45) (finding the longest ath between any two vertices in a grah) in olynomial time. With this convex constraint ffiffi set, we find using Corollary and Corollary that n ¼ c logðþ samles would lead to a risk of. This follows from the ffiffi fact that conv(s) is a vertex-transitive olytoe with about ð Þ! vertices. Thus, the total runtime is c.5 log() + sueroly(). An efficiently comutable relaxation of conv(s) is a scaled l ball (scaled by the l norm of M). Aealing to Proosition 7 on tangent cones with ffiffi resect to the l ball and to Corollary, we find that n ¼ c logðþ samles suffice to rovide a risk of. In alying Proosition 7, we note ffiffi that M is assumed to be tridiagonal, and therefore has Oð Þ nonzero entries. The runtime of this rocedure is c.5 log() +O( log()). Thus, the variable ordering ffiffi denoising roblem belongs to TD(sueroly(),c ffiffi logðþ,) and to TDðOð :5 logðþþ; logðþ; Þ, withconstantsc < c. c Examle 3: Sarse PCA and Network Activity Identification. As our third examle, we consider sarse PCA (8) in which one wishes to learn from samles a sarse eigenvector that contains most of the energy of a covariance matrix. As a simlified version of this roblem, one can imagine a matrix M R ffiffi ffiffi with entries equal to ffiffiffi =k in the to-left k k block ffiffi and zeros elsewhere (so that the Euclidean norm of M is ), and with S defined as: S¼ ffiffi ffiffi ΠMΠ jπ is a ermutation matrix : In addition to sarse PCA, such signal sets are of interest in identifying activity in noisy networks (46) as well as in related combinatorial otimization roblems, such as the lanted clique roblem (45). In the sarse PCA context, Amini and Wainwright (6) study time-data tradeoffs by investigating the samle comlexities of two rocedures: a simle one based on thresholding and a more sohisticated one based on semidefinite rogramming. Kolar et al. (46) investigate the samle comlexities of a number of rocedures ranging from a combinatorial search method to thresholding and sarse SVD. We note that the time-data tradeoffs studied in these two aers (6, 46) relate to the roblem of learning the suort of the leading sarse eigenvector; in contrast, in our setu, the objective is simly to denoise an element of S. Furthermore, although the Gaussian noise setting is of interest in some of these domains, in a more realistic sarse PCA roblem (e.g., the one considered in ref. 6), the noise is Wishart rather than Gaussian as considered here. Nevertheless, we stick with our stylized roblem setting because it rovides some useful insights on ffiffiffi time-data tradeoffs. Finally, the size of the block k f;...; g deends on the alication of interest, and it is tyically far from the extremes and ffiffiffi.we will consider the case k /4 for concreteness. [This setting is an interesting threshold case in the lanted clique context (47 49), where k = /4 is the square root of the number of nodes of the grah reresented by M (viewed as an adjacency matrix).] As usual, the tightest convex constraint set one can use in this setting is the convex hull of S, which is generally intractable to comute; an efficient characterization of this olytoe would lead to an efficient solution of the intractable lanted-clique roblem (finding a fully connected subgrah inside a larger grah). Using this convex constraint set gives an estimator that requires about n = O( /4 log()) samles to roduce a risk- estimate. We obtain this threshold by aealing to Corollary and to Corollary, as well as to the observation that conv(s) isa ffiffi vertex-transitive olytoe with about =4 vertices. Thus, the overall runtime is O( 5/4 log()) + sueroly(). A convex relaxation ffiffi of conv(s) is the nuclear norm ball scaled by a factor of so that the elements of S lie on the boundary. From Proosition 8 (observing that the elements of S are rankone matrices) and Corollary, wehavethatn¼c ffiffi samles give a risk- estimate with this rocedure. As comuted in the examle with cut matrices, the overall runtime of this nuclear norm rocedure is c.5 + O(.5 ). In conclusion, the denoising version of sarse PCA lies in TD(sueroly(), O( /4 log()),) and in TDðOð :5 ffiffi Þ; Oð Þ; Þ. Examle 4: Estimating Matchings. As our final examle, we consider signals that reresent the set of all erfect matchings in the comlete grah. A matching is any subset of edges of a grah such that no node of the grah is incident to more than one edge in the subset, and a erfect matching is a subset of edges in which every node is incident to exactly one edge in the subset. Grah matchings arise in a range of inference roblems, such as in chemical structure E88 Chandrasekaran and Jordan

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost