A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares

Jounal of Machine Leaning Reseach 7 (206) -3 Submitted 8/5; Published /6 A Statistical Pesective on Randomized Sketching fo Odinay Least-Squaes Gavesh Raskutti Deatment of Statistics Univesity of Wisconsin Madison Madison, WI 53706, USA askutti@stat.wisc.edu Michael W. Mahoney mmahoney@stat.bekeley.edu Intenational Comute Science Institute and Deatment of Statistics Univesity of Califonia at Bekeley Bekeley, CA 94720, USA Edito: Mehya Mohi Abstact We conside statistical as well as algoithmic asects of solving lage-scale least-squaes (LS) oblems using andomized sketching algoithms. Fo a LS oblem with inut data (X, Y ) R n R n, sketching algoithms use a sketching matix, S R n, whee n. Then, athe than solving the LS oblem using the full data (X, Y ), sketching algoithms solve the LS oblem using only the sketched data (SX, SY ). Pio wok has tyically adoted an algoithmic esective, in that it has made no statistical assumtions on the inut X and Y, and instead it has been assumed that the data (X, Y ) ae fixed and wost-case (WC). Pio esults show that, when using sketching matices such as andom ojections and leveage-scoe samling algoithms, with n, the WC eo is the same as solving the oiginal oblem, u to a small constant. Fom a statistical esective, we tyically conside the mean-squaed eo efomance of andomized sketching algoithms, when data (X, Y ) ae geneated accoding to a statistical linea model Y = Xβ +ɛ, whee ɛ is a noise ocess. In this ae, we ovide a igoous comaison of both esectives leading to insights on how they diffe. To do this, we fist develo a famewok fo assessing, in a unified manne, algoithmic and statistical asects of andomized sketching methods. We then conside the statistical ediction efficiency (PE) and the statistical esidual efficiency (RE) of the sketched LS estimato; and we use ou famewok to ovide ue bounds fo seveal tyes of andom ojection and andom samling sketching algoithms. Among othe esults, we show that the RE can be ue bounded when n while the PE tyically equies the samle size to be substantially lage. Lowe bounds develoed in subsequent esults show that ou ue bounds on PE can not be imoved. Keywods: algoithmic leveaging, andomized linea algeba, sketching, andom ojection, statistical leveage, statistical efficiency. Intoduction Recent wok in lage-scale data analysis has focused on develoing so-called sketching algoithms: given a data set and an objective function of inteest, constuct a small sketch. A eliminay vesion of this ae aeaed as Raskutti and Mahoney (204, 205). c 206 Gavesh Raskutti and Michael W. Mahoney.

Raskutti and Mahoney of the full data set, e.g., by using andom samling o andom ojection methods, and use that sketch as a suogate to efom comutations of inteest fo the full data set (see Mahoney (20) fo a eview). Most effot in this aea has adoted an algoithmic esective, wheeby one shows that, when the sketches ae constucted aoiately, one can obtain answes that ae aoximately as good as the exact answe fo the inut data at hand, in less time than would be equied to comute an exact answe fo the data at hand. In statistics, howeve, one is often moe inteested in how well a ocedue efoms elative to an hyothesized model than how well it efoms on the aticula data set at hand. Thus an imotant to question to conside is whethe the insights fom the algoithmic esective of sketching cay ove to the statistical setting. Thus, in this ae, we develo a unified aoach that consides both the statistical esective as well as algoithmic esective on ecently-develoed andomized sketching algoithms, and we ovide bounds on two statistical objectives fo seveal tyes of andom ojection and andom samling sketching algoithms.. Oveview of the Poblem The oblem we conside in this ae is the odinay least-squaes (LS o OLS) oblem: given as inut a matix X R n of obseved featues o covaiates and a vecto Y R n of obseved esonses, etun as outut a vecto β OLS that solves the following otimization oblem: β OLS = ag min β R Y Xβ 2 2. () We will assume that n and ae both vey lage, with n, and fo simlicity we will assume ank(x) =, e.g., to ensue a unique full-dimensional solution. The OLS solution, β OLS = (X T X) X T Y, has a numbe of well-known desiable statistical oeties (Chattejee and Hadi, 988); and it is also well-known that the unning time o comutational comlexity fo this oblem is O(n 2 ) (Golub and Loan, 996). 2 Fo many moden alications, howeve, n may be on the ode of 0 6 0 9 and may be on the ode of 0 3 0 4, and thus comuting the exact LS solution with taditional O(n 2 ) methods can be comutationally challenging. This, couled with the obsevation that aoximate answes often suffice fo downsteam alications, has led to a lage body of wok on develoing fast aoximation algoithms to the LS oblem (Mahoney, 20). One vey oula aoach to educing comutation is to efom LS on a caefullyconstucted sketch of the full data set. That is, athe than comuting a LS estimato fom Poblem () fom the full data (X, Y ), geneate sketched data (SX, SY ) whee S R n, with n, is a sketching matix, and then comute a LS estimato fom the following sketched oblem: β S ag min β R SY SXβ 2 2. (2) 2. That is, O(n 2 ) time suffices to comute the LS solution fom Poblem () fo abitay o wost-case inut, with, e.g., the Cholesky Decomosition on the nomal equations, with a QR decomosition, o with the Singula Value Decomosition (Golub and Loan, 996). 2

Statistical Pesective on Randomized Sketching Algoithms Once the sketching oeation has been efomed, the additional comutational comlexity of β S is O( 2 ), i.e., simly call a taditional LS solve on the sketched oblem. Thus, when using a sketching algoithm, two citeia ae imotant: fist, ensue the accuacy of the sketched LS estimato is comaable to, e.g., not much wose, than the efomance of the oiginal LS estimato; and second, ensue that comuting and alying the sketching matix S is not too comutationally intensive, e.g., that is faste than solving the oiginal oblem exactly..2 Pio Results Random samling and andom ojections ovide two aoaches to constuct sketching matices S that satisfy both of these citeia and that have eceived attention ecently in the comute science community. Vey loosely seaking, a andom ojection matix S is a dense matix, 3 whee each enty is a mean-zeo bounded-vaiance Gaussian o Rademache andom vaiable, although othe constuctions based on andomized Hadamad tansfomations ae also of inteest; and a andom samling matix S is a vey sase matix that has exactly non-zeo enty (which tyically equals one multilied by a escaling facto) in each ow, whee that one non-zeo can be chosen unifomly, non-unifomly based on hyotheses about the data, o non-unifomly based on emiical statistics of the data such as the leveage scoes of the matix X. In aticula, note that a sketch constucted fom an n andom ojection matix S consists of linea combinations of most o all of the ows of (X, Y ), and a sketch constucted fom a andom samling matix S consists of tyically-escaled ows of (X, Y ). Random ojection algoithms have eceived a geat deal of attention moe geneally, lagely due to thei connections with the Johnson-Lindenstauss lemma (Johnson and Lindenstauss, 984) and its extensions; and andom samling algoithms have eceived a geat deal of attention, lagely due to thei alications in lage-scale data analysis alications (Mahoney and Dineas, 2009). A detailed oveview of andom ojection and andom samling algoithms fo matix oblems may be found in the ecent monogah of Mahoney (20). Hee, we biefly summaize the most elevant asects of the theoy. In tems of unning time guaantees, the unning time bottleneck fo andom ojection algoithms fo the LS oblem is the alication of the ojection to the inut data, i.e., actually efoming the matix-matix multilication to imlement the ojection and comute the sketch. By using fast Hadamad-based andom ojections, howeve, Dineas et al. (20) develoed a andom ojection algoithm that uns on abitay o wost-case inut in o(n 2 ) time. (See Dineas et al. (20) fo a ecise statement of the unning time.) As fo andom samling, it is tivial to imlement unifom andom samling, but it is vey easy to show examles of inut data on which unifom samling efoms vey ooly. On the othe hand, Dineas et al. (2006b, 202) have shown that if the andom samling is efomed with esect to nonunifom imotance samling obabilities that deend on the emiical statistical leveage scoes of the inut matix X, i.e., the diagonal enties of the hat matix H = X(X T X) X T, then one obtains a andom samling algoithm that achieves much bette esults fo abitay o wost-case inut. 3. The eade should, howeve, be awae of ecently-develoed inut-sasity time andom ojection methods (Clakson and Wooduff, 203; Meng and Mahoney, 203; Nelson and Huy, 203). 3

Raskutti and Mahoney Leveage scoes have a long histoy in obust statistics and exeimental design. In the obust statistics community, samles with high leveage scoes ae tyically flagged as otential outlies (see, e.g., Chattejee and Hadi (2006, 988); Hamel et al. (986); Hoaglin and Welsch (978); Hube and Ronchetti (98)). In the exeimental design community, samles with high leveage have been shown to imove oveall efficiency, ovided that the undelying statistical model is accuate (see, e.g., Royall (970); Zavlavsky et al. (2008)). This should be contasted with thei use in theoetical comute science. Fom the algoithmic esective of wost-case analysis, that was adoted by Dineas et al. (20) and Dineas et al. (202), samles with high leveage tend to contain the most imotant infomation fo subsamling/sketching, and thus it is beneficial fo wost-case analysis to bias the andom samle to include samles with lage statistical leveage scoes o to otate to a andom basis whee the leveage scoes ae aoximately unifomized. The unning-time bottleneck fo this leveage-based andom samling algoithm is the comutation of the leveage scoes of the inut data; and the obvious well-known algoithm fo this involves O(n 2 ) time to efom a QR decomosition to comute an othogonal basis fo X (Golub and Loan, 996). By using fast Hadamad-based andom ojections, howeve, Dineas et al. (202) showed that one can comute aoximate QR decomositions and thus aoximate leveage scoes in o(n 2 ) time, and (based on evious wok (Dineas et al., 2006b)) this immediately imlies a leveage-based andom samling algoithm that uns on abitay o wost-case inut in o(n 2 ) time (Dineas et al., 202). Reades inteested in the actical efomance of these andomized algoithms should consult Bendenik (Avon et al., 200) o LSRN (Meng et al., 204). In tems of accuacy guaantees, both Dineas et al. (20) and Dineas et al. (202) ove that thei esective andom ojection and leveage-based andom samling LS sketching algoithms each achieve the following wost-case (WC) eo guaantee: fo any abitay (X, Y ), Y Xβ S 2 2 ( + κ) Y Xβ OLS 2 2, (3) with high obability fo some e-secified eo aamete κ (0, ). 4 This +κ elativeeo guaantee 5 is extemely stong, and it is alicable to abitay o wost-case inut. That is, wheeas in statistics one tyically assumes a model, e.g., a standad linea model on Y, Y = Xβ + ɛ, (4) whee β R is the tue aamete and ɛ R n is a standadized noise vecto, with E[ɛ] = 0 and E[ɛɛ T ] = I n n, in Dineas et al. (20) and Dineas et al. (202) no statistical model is assumed on X and Y, and thus the unning time and quality-of-aoximation bounds aly to any abitay (X, Y ) inut data..3 Ou Aoach and Main Results In this ae, we adot a statistical esective on these andomized sketching algoithms, and we addess the following fundamental questions. Fist, unde a standad linea model, e.g., as given in Eqn. (4), what oeties of a sketching matix S ae sufficient to ensue 4. The quantity β S β OLS 2 2 is also bounded by Dineas et al. (20) and Dineas et al. (202). 5. The nonstandad aamete κ is used hee fo the eo aamete since ɛ is used below to efe to the noise o eo ocess. 4

Statistical Pesective on Randomized Sketching Algoithms low statistical eo, e.g., mean-squaed, eo? Second, how do existing andom ojection algoithms and leveage-based andom samling algoithms efom by this statistical measue? Thid, how does this elate to the oeties of a sketching matix S that ae sufficient to ensue low wost-case eo, e.g., of the fom of Eqn. (3), as has been established eviously in Dineas et al. (20, 202); Mahoney (20)? We addess these elated questions in a numbe of stes. In Section 2, we will esent a famewok fo evaluating the algoithmic and statistical oeties of andomized sketching methods in a unified manne; and we will show that oviding wost-case eo bounds of the fom of Eqn. (3) and oviding bounds on two elated statistical objectives boil down to contolling diffeent stuctual oeties of how the sketching matix S inteacts with the left singula subsace of the design matix. In aticula, we will conside the oblique ojection matix, Π U S = U(SU) S, whee ( ) denotes the Mooe-Penose seudo-invese of a matix and U is the left singula matix of X. This famewok will allow us to daw a comaison between the wost-case eo and two elated statistical efficiency citeia, the statistical ediction efficiency (PE) (which is based on the ediction eo E[ X( β β) 2 2 ] and which is given in Eqn. (7) below) and the statistical esidual efficiency (RE) (which is based on esidual eo E[ Y X β 2 2 ] and which is given in Eqn. (8) below); and it will allow us to ovide sufficient conditions that any sketching matix S must satisfy in ode to achieve efomance guaantees fo these two statistical objectives. In Section 3, we will esent ou main theoetical esults, which consist of bounds fo these two statistical quantities fo vaiants of andom samling and andom ojection sketching algoithms. In aticula, we ovide ue bounds on the PE and RE (as well as the wost-case WC) fo fou sketching schemes: () an aoximate leveage-based andom samling algoithm, as is analyzed by Dineas et al. (202); (2) a vaiant of leveage-based andom samling, whee the andom samles ae not e-scaled io to thei inclusion in the sketch, as is consideed by Ma et al. (204, 205); (3) a vanilla andom ojection algoithm, whee S is a andom matix containing i.i.d. Gaussian o Rademache andom vaiables, as is oula in statistics and scientific comuting; and (4) a andom ojection algoithm, whee S is a andom Hadamad-based andom ojection, as analyzed in Boutsidis and Gittens (203). Fo sketching schemes (), (3), and (4), ou ue bounds fo each of the two measues of statistical efficiency ae identical u to constants; and they show that the RE scales as +, while the PE scales as n. In aticula, this means that it is ossible to obtain good bounds fo the RE when n (in a manne simila to the samling comlexity of the WC bounds); but in ode to obtain even nea-constant bounds fo PE, must be at least of constant ode comaed to n. We then esent a lowe bound develoed in subsequent wok by Pilanci and Wainwight (204) which shows that unde geneal conditions on S, ou ue bound of n fo PE can not be imoved. Fo the sketching scheme (2), we show, on the othe hand, that unde the stong assumtion that thee ae k lage leveage scoes and the emaining n k ae small, then the WC scales as + k, the RE scales as + n, and the PE scales as k. That is, shae bounds ae ossible fo leveage-scoe samling without e-scaling in the statistical setting, but much stonge assumtions ae needed on the inut data. In Section 4, we will sulement ou theoetical esults by esenting ou main emiical esults, which consist of an evaluation of the comlementay oeties of andom samling 5

Raskutti and Mahoney vesus andom ojection methods. Ou emiical esults suot ou theoetical esults, and they also show that fo lage than but much close to than n, ojection-based methods tend to out-efom samling-based methods, while fo significantly lage than, ou leveage-based samling methods efom slightly bette. In Section 5, we will ovide a bief discussion and conclusion and we ovide oofs of ou main esults in the Aendix..4 Additional Related Wok Vey ecently Ma et al. (204) consideed statistical asects of leveage-based samling algoithms (called algoithmic leveaging in Ma et al. (204)). Assuming a standad linea model on Y of the fom of Eqn. (4), the authos develoed fist-ode Taylo aoximations to the statistical elative efficiency of diffeent estimatos comuted with leveage-based samling algoithms, and they veified the quality of those aoximations with comutations on eal and synthetic data. Taken as a whole, thei esults suggest that, if one is inteested in the statistical efomance of these andomized sketching algoithms, then thee ae nontivial tade-offs that ae not taken into account by standad wost-case analysis. Thei aoach, howeve, does not immediately aly to andom ojections o othe moe geneal sketching matices. Futhe, the ealm of alicability of the fist-ode Taylo aoximation was not ecisely quantified, and they left oen the question of stuctual chaacteizations of andom sketching matices that wee sufficient to ensue good statistical oeties on the sketched data. We addess these issues in this ae. Afte the aeaance of the oiginal technical eot vesion of this ae (Raskutti and Mahoney, 204), we wee made awae of subsequent wok by Pilanci and Wainwight (204), who also conside a statistical esective on sketching. Amongst othe esults, they develo a lowe bound which confims that using a single andomized sketching matix S can not achieve a PE bette than n. This lowe bound comlements ou ue bounds develoed in this ae. Thei main focus is to use this insight to develo an iteative sketching scheme which yields bounds on the PE when an n sketch is alied eeatedly. 2. Geneal Famewok and Stuctual Results In this section, we develo a famewok that allows us to view the algoithmic and statistical esectives on LS oblems fom a common esective. We then use this famewok to show that existing wost-case bounds as well as ou novel statistical bounds fo the meansquaed eos can be exessed in tems of diffeent stuctual conditions on how the sketching matix S inteacts with the data (X, Y ). 2. A Statistical-Algoithmic Famewok Recall that we ae given as inut a data set, (X, Y ) R n R n, and the objective function of inteest is the standad LS objective, as given in Eqn. (). Since we ae assuming, without loss of geneality, that ank(x) =, we have that β OLS = X Y = (X T X) X T Y, (5) 6

Statistical Pesective on Randomized Sketching Algoithms whee ( ) denotes the Mooe-Penose seudo-invese of a matix, and whee the second equality follows since ank(x) =. To esent ou famewok and objectives, let S R n denote an abitay sketching matix. That is, although we will be most inteested in sketches constucted fom andom samling o andom ojection oeations, fo now we let S be any n matix. Then, we ae inteested in analyzing the efomance of objectives chaacteizing the quality of a sketched LS objective, as given in Eqn (2), whee again we ae inteested in solutions of the fom β S = (SX) SY. (6) (We emhasize that this does not in geneal equal ((SX) T SX) (SX) T SY, since the invese will not exist if the sketching ocess does not eseve ank.) Ou goal hee is to comae the efomance of β S to β OLS. We will do so by consideing thee elated efomance citeia, two of a statistical flavo, and one of a moe algoithmic o wost-case flavo. Fom a statistical esective, it is common to assume a standad linea model on Y, Y = Xβ + ɛ, whee we emind the eade that β R is the tue aamete and ɛ R n is a standadized noise vecto, with E[ɛ] = 0 and E[ɛɛ T ] = I n n. Fom this statistical esective, we will conside the following two citeia. The fist statistical citeion we conside is the ediction efficiency (PE), defined as follows: C P E (S) = E[ X(β β S) 2 2 ] E[ X(β β OLS ) 2 (7) 2 ], whee the exectation E[ ] is taken ove the andom noise ɛ. The second statistical citeion we conside is the esidual efficiency (RE), defined as follows: C RE (S) = E[ Y Xβ S 2 2 ] E[ Y Xβ OLS 2 (8) 2 ], whee, again, the exectation E[ ] is taken ove the andom noise ɛ. Recall that the standad elative statistical efficiency fo two estimatos β and β 2 is defined as eff(β, β 2 ) = Va(β ), whee Va( ) denotes the vaiance of the estimato (see e.g., Va(β 2 ) Lehmann (998)). Fo the PE, we have elaced the vaiance of each estimato by the meansquaed ediction eo. Fo the RE, we use the tem esidual since fo any estimato β, Y X β ae the esiduals fo estimating Y. Fom an algoithmic esective, thee is no noise ocess ɛ. Instead, X and Y ae abitay, and β is simly comuted fom Eqn (5). To daw a aallel with the usual statistical geneative ocess, howeve, and to undestand bette the elationshi between vaious objectives, conside defining Y in tems of X by the following linea model : Y = Xβ + ɛ, whee β R and ɛ R n. Imotantly, β and ɛ hee eesent diffeent quantities than in the usual statistical setting. Rathe than ɛ eesenting a noise ocess and β eesenting 7

Raskutti and Mahoney a tue aamete that is obseved though a noisy Y, hee in the algoithmic setting, we will take advantage of the ank-nullity theoem in linea algeba to elate X and Y. 6 To define a wost case model Y = Xβ + ɛ fo the algoithmic setting, one can view the noise ocess ɛ to consist of any vecto that lies in the null-sace of X T. Then, since the choice of β R is abitay, one can constuct any abitay o wost-case inut data Y. Fom this algoithmic case, we will conside the following citeion. The algoithmic citeion we conside is the wost-case (WC) eo, defined as follows: C W C (S) = su Y Y Xβ S 2 2 Y Xβ OLS 2. (9) 2 This citeion is wost-case since we take a suemum Y, and it is the efomance citeion that is analyzed in Dineas et al. (20) and Dineas et al. (202), as bounded in Eqn. (3). Witing Y as Xβ + ɛ, whee X T ɛ = 0, the WC eo can be e-exessed as: C W C (S) = su Y =Xβ+ɛ, X T ɛ=0 Y Xβ S 2 2 Y Xβ OLS 2. 2 Hence, in the wost-case algoithmic setu, we take a suemum ove ɛ, whee X T ɛ = 0, wheeas in the statistical setu, we take an exectation ove ɛ whee E[ɛ] = 0. Befoe oceeding, seveal othe comments about this algoithmic-statistical famewok and ou objectives ae woth mentioning. The most imotant distinction between the algoithmic aoach and the statistical aoach is how the data is assumed to be geneated. Fo the statistical aoach, (X, Y ) ae assumed to be geneated by a standad Gaussian linea model and the goal is to estimate a tue aamte β while fo the algoithmic aoach (X, Y ) ae not assumed to follow any statistical model and the goal is to do ediction on Y athe than estimate a tue aamete β. Since odinay least-squaes is often un in the context of solving a statistical infeence oblem, we believe this distinction is imotant and focus in this aticle moe on the imlications fo the statistical esective. Fom the esective of ou two linea models, we have that β OLS = β+(x T X) X T ɛ. In the statistical setting, since E[ɛɛ T ] = I n n, it follows that β OLS is a andom vaiable with E[β OLS ] = β and E[(β β OLS )(β β OLS ) T ] = (X T X). In the algoithmic setting, on the othe hand, since X T ɛ = 0, it follows that β OLS = β. C RE (S) is a statistical analogue of the wost-case algoithmic objective C W C (S), since Y Xβ both conside the atio of the metics S 2 2. The diffeence is that a su ove Y Y Xβ OLS 2 2 in the algoithmic setting is elaced by an exectation ove noise ɛ in the statistical 6. The ank-nullity theoem assets that given any matix X R n and vecto Y R n, thee exists a unique decomosition Y = Xβ + ɛ, whee β is the ojection of Y on to the ange sace of X T and ɛ = Y Xβ lies in the null-sace of X T (Meye, 2000). 8

Statistical Pesective on Randomized Sketching Algoithms setting. A natual question is whethe thee is an algoithmic analogue of C P E (S). Such a efomance metic would be: su Y X(β β S ) 2 2 X(β β OLS ) 2, (0) 2 whee β is the ojection of Y on to the ange sace of X T. Howeve, since β OLS = β + (X T X) X T ɛ and since X T ɛ = 0, β OLS = β in the algoithmic setting, the denominato of Eqn. (0) equals zeo, and thus the objective in Eqn. (0) is not welldefined. The difficulty of comuting o aoximating this objective aallels ou esults below that show that aoximating C P E (S) is much moe challenging (in tems of the numbe of samles needed) than aoximating C RE (S). In the algoithmic setting, the sketching matix S and the objective C W C (S) can deend on X and Y in any abitay way, but in the following we conside only sketching matices that ae eithe indeendent of both X and Y o deend only on X (e.g., via the statistical leveage scoes of X). In the statistical setting, S is allowed to deend on X, but not on Y, as any deendence of S on Y might intoduce coelation between the sketching matix and the noise vaiable ɛ. Removing this estiction is of inteest, esecially since one can obtain WC bounds of the fom Eqn. (3) by constucting S by andomly samling accoding to an imotance samling distibution that deends on the influence scoes essentially the leveage scoes of the matix X augmented with Y as an additional column of the (X, Y ) ai. Both C P E (S) and C RE (S) ae qualitatively elated to quantities analyzed by Ma et al. (204, 205). In addition, C W C (S) is qualitatively simila to Cov( β Y ) in Ma et al. (204, 205), since in the algoithmic setting Y is teated as fixed; and C RE (S) is qualitatively simila to Cov( β) in Ma et al. (204, 205), since in the statistical setting Y is teated as andom and coming fom a linea model. That being said, the metics and esults we esent in this ae ae not diectly comaable to those of Ma et al. (204, 205) since, e.g., they had a slightly diffeent setu than we have hee, and since they used a fist-ode Taylo aoximation while we do not. 2.2 Stuctual Results on Sketching Matices We ae now eady to develo stuctual conditions chaacteizing how the sketching matix S inteacts with the data matix X that will allow us to ovide ue bounds fo the quantities C W C (S), C P E (S), and C RE (S). To do so, ecall that given the data matix X, we can exess the singula value decomosition of X as X = UΣV T, whee U R n is an othogonal matix, i.e., U T U = I. In addition, we can define the oblique ojection matix Π U S := U(SU) S. () Note that if ank(sx) =, then Π U S can be exessed as ΠU S = U(U T S T SU) U T S T S, since U T S T SU is invetible. Imotantly howeve, deending on the oeties of X and how S is constucted, it can easily haen that ank(sx) <, even if ank(x) =. 9

Raskutti and Mahoney Given this setu, we can now state the following lemma, the oof of which may be found in Section A.. This lemma chaacteizes how C W C (S), C P E (S), and C RE (S) deend on diffeent stuctual oeties of Π U S and SU. Lemma Fo the algoithmic setting, [ (I (SU) (SU))δ 2 2 C W C (S) = + su δ R,U T ɛ=0 ɛ 2 2 ] + ΠU S ɛ 2 2 ɛ 2. 2 Fo the statistical setting, and C P E (S) = (I (SU) SU)ΣV T β 2 2 C RE (S) = + (I (SU) SU)ΣV T β 2 2 n + ΠU S 2 F n + ΠU S 2 F, = + C P E(S). n/ Seveal oints ae woth making about Lemma. Fo all 3 citeia, the tem which involves (SU) SU is a bias tem that is non-zeo in the case that ank(su) <. Fo C P E (S) and C RE (S), the tem coesonds exactly to the statistical bias; and if ank(su) =, meaning that S is a ank-eseving sketching matix, then the bias tem equals 0, since (SU) SU = I. In actice, if is chosen smalle than o lage than but vey close to, it may haen that ank(su) <, in which case this bias is incued. The final equality C RE (S) = + C P E(S) n/ shows that in geneal it is much moe difficult (in tems of the numbe of samles needed) to obtain bounds on C P E (S) than C RE (S) since C RE (S) e-scales C P E (S) by /n, which is much less than. This will be eflected in the main esults below, whee the scaling of C RE (S) will be a facto of /n smalle than C P E (S). In geneal, it is significantly moe difficult to bound C P E (S), since X(β β OLS ) 2 2 is, wheeas Y Xβ OLS 2 2 is n, and so thee is much less magin fo eo in aoximating C P E (S). Π U S ɛ 2 2 ɛ 2 2 In the algoithmic o wost-case setting, su ɛ R n /{0},Π U ɛ=0 is the elevant quantity, wheeas in the statistical setting Π U S 2 F is the elevant quantity. The Fobenius nom entes in the statistical setting because we ae taking an aveage ove homoscedastic noise, and so the l 2 nom of the eigenvalues of Π U S need to be contolled. On the othe hand, in the algoithmic o wost-case setting, the wost diection in the null-sace of U T needs to be contolled, and thus the sectal nom entes. 3. Main Theoetical Results In this section, we ovide ue bounds fo C W C (S), C P E (S), and C RE (S), whee S coesond to andom samling and andom ojection matices. In aticula, we ovide ue bounds fo 4 sketching matices: () a vanilla leveage-based andom samling algoithm 0

Statistical Pesective on Randomized Sketching Algoithms fom Dineas et al. (202); (2) a vaiant of leveage-based andom samling, whee the andom samles ae not e-scaled io to thei inclusion in the sketch; (3) a vanilla andom ojection algoithm, whee S is a andom matix containing i.i.d. sub-gaussian andom vaiables; and (4) a andom ojection algoithm, whee S is a andom Hadamad-based andom ojection, as analyzed in Boutsidis and Gittens (203). 3. Random Samling Methods Hee, we conside andom samling algoithms. To do so, fist define a andom samling matix S R n as follows: Sij {0, } fo all (i, j) and n S j= ij =, whee each ow has an indeendent multinomial distibution with obabilities ( i ) n i=. The matix of cossleveage scoes is defined as L = UU T R n n, and l i = L ii denotes the leveage scoe coesonding to the i th samle. Note that the leveage scoes satisfy n i= l i = tace(l) = and 0 l i. The samling obability distibution we conside ( i ) n i= is of the fom i = ( θ) l i + θq i, whee {q i } n i= satisfies 0 q i and n i= q i = is an abitay obability distibution, and 0 θ <. In othe wods, it is a convex combination of a leveage-based distibution and anothe abitay distibution. Note that fo θ = 0, the obabilities ae ootional to the leveage scoes, wheeas fo θ =, the obabilities follow {q i } n i=. We conside two samling matices, one whee the andom samling matix is e-scaled, as in Dineas et al. (20), and one in which no e-scaling takes lace. In aticula, let S NR = S denote the andom samling matix (whee the subscit NR denotes the fact that no e-scaling takes lace). The e-scaled samling matix is S R R n = SW, whee W R n n is a diagonal e-scaling matix, whee [W ] jj = j and W ji = 0 fo j i. The quantity j is the e-scaling facto. In this case, we have the following esult, the oof of which may be found in Section B.. Theoem Fo S = S R, thee exists constants C and C such that if ank(s R U) = and: C W C (S R ) + 2 C ( θ) log ( C ( θ)), with obability at least 0.7. C P E (S R ) 44 n C RE (S R ) + 44, Seveal things ae woth noting about this esult. Fist, note that both C W C (S R ) and C RE (S R ) scale as ; thus, it is ossible to obtain high-quality efomance guaantees fo odinay least squaes, as long as 0, e.g., if is only slightly lage than. On the othe hand, C P E (S R ) scales as n, meaning needs to be close to n to ovide simila efomance guaantees. Next, note that all of the ue bounds aly to any data matix X, without assuming any additional stuctue on X. Finally, note that when θ =, which coesonds to samling the ows based on {q i } n i=, all the ue bounds ae. Ou simulations also eveal that unifom samling geneally efoms moe ooly than leveage-scoe based aoaches unde the linea models we conside.

Raskutti and Mahoney An imotant actical oint is the following: the distibution {q i } n i= does not ente the esults. This allows us to conside diffeent distibutions. An obvious choice is unifom, i.e., q i = n (see e.g., Ma et al. (204, 205)). Anothe imotant examle is that of aoximate leveage-scoe samling, as develoed in Dineas et al. (202). (The unning time of the main algoithm of Dineas et al. (202) is o(n 2 ), and thus this educes comutation comaed with the use of exact leveage scoes, which take O(n 2 ) time to comute). Let ( l i ) n i= denote the aoximate leveage scoes develoed by the ocedue in Dineas et al. (202). Based on Theoem 2 in Dineas et al. (202), l i l i ɛ whee 0 < ɛ < fo aoiately chosen. Now, using i = l i, i can be e-exessed as i = ( ɛ) l i +ɛq i whee (q i ) n i= is a distibution (unknown since we only have a bound on the aoximate leveage scoes). Hence, the efomance bounds achieved by aoximate leveaging ae analogous to those achieved by adding ɛ multilied by a unifom o othe abitay distibution. Next, we conside the leveage-scoe estimato without e-scaling S NR. In ode to develo nontivial bounds on C W C (S NR ), C P E (S NR ), and C RE (S NR ), we need to make a stong assumtion on the leveage-scoe distibution on X. To do so, we define the following. Definition (k-heavy hitte leveage distibution) A sequence of leveage scoes (l i ) n i= is a k-heavy hitte leveage scoe distibution if thee exist constants c, C > 0 such that fo i k, c k l i C k and fo the emaining n k leveage scoes, i=k+ l i 3 4. The inteetation of a k-heavy hitte leveage distibution is one in which only k samles in X contain the majoity of the leveage scoe mass. In the simulations below, we ovide examles of synthetic matices X whee the majoity of the mass is in the lagest leveage scoes. The aamete k acts as a measue of non-unifomity, in that the smalle the k, the moe non-unifom ae the leveage scoes. The k-heavy hitte leveage distibution allows us to model highly non-unifom leveage scoes. In this case, we have the following esult, the oof of which may be found in Section B.2. Theoem 2 Fo S = S NR, with θ = 0 and assuming a k-heavy hitte leveage distibution and, thee exist constants c and c log ( c 2 ), such that ank(s NR ) = and: with obability at least 0.6. C W C (S NR ) + 44C2 c 2 C P E (S NR ) 44C4 c 2 k C RE (S NR ) + 44C4 c 2 k n, Notice that when k n, bounds in Theoem 2 on C P E (S NR ) and C RE (S NR ) ae significantly shae than bounds in Theoem on C P E (S R ) and C RE (S R ). Hence not e-scaling has the otential to ovide shae bound in the statistical setting. Howeve a stonge assumtion on X is needed fo this esult. 2

Statistical Pesective on Randomized Sketching Algoithms 3.2 Random Pojection Methods Hee, we conside two andom ojection algoithms, one based on a sub-gaussian ojection matix and the othe based on a Hadamad ojection matix. To do so, define [S SGP ] ij = X ij, whee (X ij ) i, j n ae i.i.d. sub-gaussian andom vaiables with E[X ij ] = 0, vaiance E[X 2 ij ] = σ2 and sub-gaussian aamate. In this case, we have the following esult, the oof of which may be found in Section B.3. Theoem 3 Fo any matix X, thee exists a constant c such that if c log n, then with obability geate than 0.7, it holds that ank(s SGP ) = and that: C W C (S SGP ) + C P E (S SGP ) 44( + n ) C RE (S SGP ) + 44. Notice that the bounds in Theoem 3 fo S SGP ae equivalent to the bounds in Theoem fo S R, excet that is equied only to be lage than O(log n) athe than O( log ). Hence fo smalle values of, andom sub-gaussian ojections ae moe stable than leveagescoe samling based aoaches. This eflects the fact that to a fist-ode aoximation, leveage-scoe samling efoms as well as efoming a smooth ojection. Next, we conside the andomized Hadamad ojection matix. In aticula, S Had = S Unif HD, whee H R n n is the standad Hadamad matix (see e.g., Hedayat and Wallis (978)), S Unif R n is an n unifom samling matix, and D R n n is a diagonal matix with andom equiobable ± enties. In this case, we have the following esult, the oof of which may be found in Section B.4. Theoem 4 Fo any matix X, thee exists a constant c such that if c log n(log + log log n), then with obability geate than 0.8, it holds that ank(s Had ) = and that: C W C (S Had ) + 40 log(n) C RE (S Had ) 40 log(n)( + n ) C P E (S Had ) + 40 log(n)( + ). Notice that the bounds in Theoem 4 fo S Had ae equivalent to the bounds in Theoem fo S R, u to a constant and log(n) facto. As discussed in Dineas et al. (20), the Hadamad tansfomation makes the leveage scoes of X aoximately unifom (u to a log(n) facto), which is why the efomance is simila to the sub-gaussian ojection (which also tends to make the leveage scoes of X aoximately unifom). We susect that the additional log(n) facto is an atifact of the analysis since we use an enty-wise concentation bound; using moe sohisticated techniques, we believe that the log(n) can be emoved. The obabilities of 0.6, 0.7, and 0.8 fo which the ue bounds hold in the fou Theoems above is an atifact of the concentation bounds used the oof and can be imoved at the exense of weake constants (e.g. 40) in font of the efficiency bounds. As 3

Raskutti and Mahoney we show in the next section, the ue bound of n on C P E(S) fo S = S R, S SGP and S Had can not be imoved u to constant while fo S = S NR the ue bound of k can not be imoved. 3.3 Lowe Bounds Subsequent to the dissemination of the oiginal vesion of this ae (Raskutti and Mahoney, 204), Pilanci and Wainwight (204) amongst othe esults develo lowe bounds on the numeato in C P E (S). This oves that ou ue bounds on C P E (S) can not be imoved. We e-state Theoem (Examle ) in Pilanci and Wainwight (204) in a way that makes it most comaable to ou esults. Theoem 5 (Theoem in Pilanci and Wainwight (204)) Fo any sketching matix satisfying E[S T (SS T ) S] o η n, any estimato based on (SX, SY ) satisfies the lowe bound with obability geate than /2: C P E (S) n 28η. Pilanci and Wainwight (204) show that fo S = S R, S SGP and S Had, E[S T (SS T ) S] o c n whee c is a constant and hence η = c and the lowe bound matches ou ue bounds u to constant. On the othe hand, fo S = S NR, it is staightfowad to show that E[S T (SS T ) S] o c k fo some constant c and hence η = c n k and the lowe bound scales as k, to match the ue bound on C P E(S NR ) fom Theoem 2. This is why we ae able to ove a tighte ue bound when the matix X has highly non-unifom leveage scoes. Imotantly, this oves that C P E (S) is a quantity that is moe challenging to contol than C RE (S) and C W C (S) when only a single sketch is used. Using this insight, Pilanci and Wainwight (204) show that by using a aticula iteative Hessian sketch, C P E (S) can be contolled u to constant. In addition to oviding a lowe bound on the PE using a sketching matix just once, Pilanci and Wainwight (204) also develo a new iteative sketching scheme whee sketching matices ae used eeatedly can educe the PE significantly. Once again the obability of 0.5 can be imoved by making the constant of 28 less tight. Finally in io elated wok, Lu and Foste (204); Lu et al. (203) show that the ate + may be achieved fo the PE using the estimato β = ((SX) T (SX)) X T Y. This estimato is elated to the idge egession estimato since sketches o andom ojections ae alied only in the comutation of the X T X matix and not X T Y. Since both X T Y and (SX) T (SX) have small dimension, this estimato has significant comutational benefits. Howeve this estimato does not violate the lowe bound in Pilanci and Wainwight (204) since it is not based on the sketches (SX, SY ) but instead uses (SX, X T Y ). 4. Emiical Results In this section, we esent the esults of an emiical evaluation, illustating the esults of ou theoy. We will comae the following 6 sketching matices. () S = S R - andom leveage-scoe samling with e-scaling. 4

Statistical Pesective on Randomized Sketching Algoithms (2) S = S NR - andom leveage-scoe samling without e-scaling. (3) S = S Unif - andom unifom samling (each samle dawn indeendently with obability /n). (4) S = S Sh - andom leveage-scoe samling with e-scaling and with θ = 0.. (5) S = S GP - Gaussian ojection matices. (6) S = S Had - Hadamad ojections. To comae the methods and see how they efom on inuts with diffeent leveage scoes, we geneate test matices using a method outlined in Ma et al. (204, 205). Set n = 024 (to ensue, fo simlicity, an intege owe of 2 fo the Hadamad tansfom) and = 50, and let the numbe of samles dawn with elacement,, be vaied. X is then geneated based on a t-distibution with diffeent choices of ν to eflect diffeent unifomity of leveage scoes. Each ow of X is selected indeendently with distibution X i t ν (Σ), whee Σ coesonds to an auto-egessive model with ν the degees of feedom. The 3 values of ν esented hee ae ν = (highly non-unifom), ν = 2 (modeately non-unifom), and ν = 0 (vey unifom). See Figue fo a lot to see how ν influences the unifomity of the leveage scoes. Fo each setting, the simulation is eeated 00 times in ode to aveage 50 0.9 45 Leveage scoe 0.8 0.7 0.6 0.5 0.4 0.3 0.2 ν = 0 ν=2 ν= Unifom Cumulative leveage mass 40 35 30 25 20 5 0 ν=0 ν=2 ν= Unifom 0. 5 0 0 200 400 600 800 000 Index 0 0 200 400 600 800 000 Index Figue : Odeed leveage scoes fo diffeent values of ν (a) and cumulative sum of odeed leveage scoes fo diffeent values of ν (b). ove both the andomness in the samling, and in the statistical setting, the andomness ove y. Note that a natual comaison can be dawn between the aamete ν and the aamete k in the k-heavy hitte definition. If we want to find the value k such that 90% of the leveage mass is catued, fo ν =, k 00, fo ν = 2, k 700 and fo ν = 0, k 900, accoding to Figue (b). Hence the smalle ν, the smalle k since the leveage-scoes ae moe non-unifom. 5

Raskutti and Mahoney We fist comae the sketching methods in the statistical setting by comaing C P E (S). In Figue 2, we lot the aveage C P E (S) fo the 6 subsamling aoaches outlined above, aveaged ove 00 samles fo lage values of between 300 and 000. In addition, in Figue 3, we include a table fo esults on smalle values of between 80 and 200, to get a sense of the efomance when is close to. Obseve that in the lage setting, S NR is clealy the best aoach, out-efoming S R, esecially fo ν =. Fo small, ojectionbased methods wok bette, esecially fo ν =, since they tend to unifomize the leveage scoes. In addition, S Sh is sueio comaed to S NR, when is small, esecially when ν =. We do not lot C RE (S) as it is simly a e-scaled C P E (S) to Lemma. 5.5 9 20 5 S=S R 8 S = S R 8 S = S R 4.5 S = S NR S =S Unif 7 S = S NR S = S Unif 6 4 S = S NR S = S Sh C RPE (S) 4 3.5 3 S = S Sh S = S GP S = S Had C RPE (S) 6 5 4 S = S Sh S = S GP S = S Had C RPE (S) 2 0 8 S = S GP S = S Had 2.5 3 6 4 2 2 2.5 300 400 500 600 700 800 900 000 300 400 500 600 700 800 900 000 0 300 400 500 600 700 800 900 000 (a) ν = 0 (b) ν = 2 (c) ν = Figue 2: Relative ediction efficiency C P E (S) fo lage. S R S NR S Unif S Sh S GP S Had 80 39.8 38.7 37.3 39.8 35.5 40.0 90 27.8 27.2 27.2 27.2 26. 27.4 00 22.4 22. 22.7 22.2 2.3 23. 200 8.33 7.88 8.5 7.62 8.20 8.4 (a) ν = 0 S R S NR S Unif S Sh S GP S Had 80 70.7 60..05 0 2 73.3 36.5 39.8 90 45.6 34.6 66.2 44.7 26.7 28.2 00 35. 25.9 52.8 33.8 22. 22.5 200 9.82 5.54 5.3 9.7 7.59 7.8 (b) ν = 2 S R S NR S Unif S Sh S GP S Had 80 4.4 0 4 3. 0 4 7.0 0 3.4 0 4 34.2 40.0 90.5 0 4 7.0 0 3 5.2 0 3.0 0 4 26.0 28.7 00.8 0 4 3.6 0 3 3.9 0 3 3.4 0 3 22.7 24.8 200 2.0 0 2 34.0 5.2 0 2 3.6 0 2 7.94 7.84 (c) ν = Figue 3: Relative ediction efficiency C P E (S) fo small. Oveall, S NR, S R, and S Sh comae vey favoably to S Unif, which is consistent with Theoem 2, since samles with highe leveage scoes tend to educe the mean-squaed 6

Statistical Pesective on Randomized Sketching Algoithms eo. Futhemoe, S R (which ecall involves e-scaling) only inceases the mean-squaed eo, which is again consistent with the theoetical esults. The effects ae moe aaent as the leveage scoe distiibution is moe non-unifom (i.e., fo ν = ). The theoetical ue bound in Theoems - 4 suggests that C P E (S) is of the ode n, indeendent of the leveage scoes of X, fo S = S R as well as S = S Had and S GP. On the othe hand, the simulations suggest that fo highly non-unifom leveage scoes, C P E (S R ) is highe than when the leveage scoes ae unifom, wheeas fo S = S Had and S GP, the non-unifomity of the leveage scoes does not significantly affect the bounds. The eason that S Had and S GP ae not significantly affected by the leveage-scoe distibution is that the Hadamad and Gaussian ojection has the effect of making the leveage scoes of any matix unifom Dineas et al. (20). The eason fo the aaent disaity when S = S R is that the theoetical bounds use Makov s inequality which is a cude concentation bound. We susect that a moe efined analysis involving the bounded diffeence inequality would eflect that non-unifom leveage scoes esult in a lage C P E (S R )..25.4 4.2.35.3 3.5 S = S R S = S NR S = S Sh.5. C WCR (S).25.2.5 C WCR (S) 3 2.5 2 S = S GP S = S Had.05..05.5 300 400 500 600 700 800 900 000 300 400 500 600 700 800 900 000 300 400 500 600 700 800 900 000 (a) ν = 0 (b) ν = 2 (c) ν = Figue 4: Wost-case elative eo C W C (S) fo lage. Finally, Figues 4 and 5 ovide a comaison of the wost-case elative eo C W CE (S) fo lage and small ( > 200 and 200, esectively) values of. Obseve that, in geneal, C W C (S) ae much close to than C P E (S) fo all choices of S. This eflects the scaling of n diffeence between the bounds. Inteestingly, Figues 4 and 5 indicates that S NR still tends to out-efom S R in geneal, howeve the diffeence is not as significant as in the statistical setting. 5. Discussion and Conclusion In this ae, we develoed a famewok fo analyzing algoithmic and statistical citeia fo geneal sketching matices S R n alied to the least-squaes objective. As ou analysis makes clea, ou famewok eveals that the algoithmic and statistical citeia deend on diffeent oeties of the oblique ojection matix Π U S = U(SU) U, whee U is the left singula matix fo X. In aticula, the algoithmic citeia (WC) deends on the quantity Π su U U T S ɛ 2 ɛ=0 ɛ 2, since in that case the data may be abitay and wost-case, wheeas the two statistical citeia (RE and PE) deends on Π U S F, since in that case the data follow a linea model with homogenous noise vaiance. 7

Raskutti and Mahoney S R S NR S Unif S Sh S GP S Had 80 2.82 2.78 2.94 2.82 2.74 2.89 90 2.34 2.32 2.40 2.37 2.24 2.33 00 2.04 2.0 2.09 2.06 2.02 2.03 200.33.3.36.32.34.34 (a) ν = 0 S R S NR S Unif S Sh S GP S Had 80 4.46 3.69 5.7 4.33 2.8 2.99 90 3.25 2.85 4.29 3.8 2.27 2.28 00 2.70 2.20 3.52 2.6 2.06 2.0 200.43.22.70.42.34.36 (b) ν = 2 S R S NR S Unif S Sh S GP S Had 80 6.0 0 9 6.0 0 0 2. 0 5 6.9 0 2 2.64 2.85 90.7 0 5 3.7 0 4 2.4 0 5 5.0 0 2 2.35 2.35 00 9. 0 4 4.5 0 4.3 0 5.8 0 2 2.07 2.2 200 2. 0 2 70.0.6 0 4 8.0.34.35 (c) ν = Figue 5: Wost-case elative eo C W C (S) fo lage. Using ou famewok, we develo ue bounds fo 3 efomance citeia alied to 4 sketching schemes. Ou ue bounds eveal that in the egime whee < n, ou sketching schemes achieve otimal efomance u to constants, in tems of WC and RE. On the othe hand, the PE scales as n meaning needs to be close to (o geate than) n fo good efomance. Subsequent lowe bounds in Pilanci and Wainwight (204) show that this ue bound can not be imoved, but subsequent wok by Pilanci and Wainwight (204) as well as Lu and Foste (204); Lu et al. (203) ovide altenate moe sohisticated sketching aoaches to deal with these challenges. Ou simulation esults eveal that fo when is vey close to, ojection-based aoaches tend to out-efom samling-based aoaches since ojection-based aoaches tend to be moe stable in that egime. Thee ae numeous ways in which the famewok and esults fom this ae can be extended. Fistly, thee is a lage liteatue that esents a numbe of diffeent aoaches to sketching. Since ou famewok ovides geneal conditions to assess the statistical and algoithmic efomance fo sketching matices, a natual and staightfowad extension would be to use ou famewok to comae othe sketching matices. Anothe natual extension is to detemine whethe asects of the famewok can be adated to othe statistical models and oblems of inteest (e.g., genealized linea models, covaiance estimation, PCA, etc.). Finally, anothe imotant diection is to comae the stability and obustness oeties of diffeent sketching matices. Ou cuent analysis assumes a known linea model, and it is unclea how the sketching matices behave unde model mis-secification. 8

Statistical Pesective on Randomized Sketching Algoithms Acknowledgement. We would like to thank the Statistical and Alied Mathematical Sciences Institute and the membes of its vaious woking gous fo helful discussions. Aendix A. Auxiliay Results In this section, we ovide oofs of Lemma and an intemediate esult we will late use to ove the main theoems. A. Poof of Lemma Recall that X = UΣV T, whee U R n, Σ R and V R denote the left singula matix, diagonal singula value matix and ight singula matix esectively. Fist we show that Y Xβ OLS 2 2 = ɛ 2 2. To do so, obseve that Y Xβ OLS 2 2 = Y UΣV T β OLS 2 2, and set δ OLS = ΣV T β OLS. It follows that δ OLS = U T Y. Hence Y Xβ OLS 2 2 = Y Π U Y 2 2, whee Π U = UU T. Fo evey Y R n, thee exists a unique δ R and ɛ R n such that U T ɛ = 0 and Y = Uδ + ɛ. Hence whee the final equality holds since Π U ɛ = 0. Now we analyze Y Xβ S 2 2. Obseve that Y Xβ OLS 2 2 = (I n n Π U )ɛ 2 2 = ɛ 2 2, Y Xβ S 2 2 = Y Π S UY 2 2, whee Π U S = U(SU) S. Since Y = Uδ + ɛ, it follows that Theefoe fo all Y : Y Xβ S 2 2 = U(I (SU) SU)δ + (I n n Π S U)ɛ 2 2 C W C (S) = Y Xβ S 2 2 Y Xβ OLS 2 2 = (I (SU) SU)δ 2 2 + (I n n Π S U)ɛ 2 2 = (I (SU) SU)δ 2 2 + ɛ 2 2 + Π S Uɛ 2 2. = + (I (SU) SU)δ 2 2 + ΠU S ɛ 2 2 ɛ 2, 2 whee U T ɛ = 0. Taking a suemum ove Y and consequently ove ɛ and δ comletes the oof fo C W C (S). Now we tun to the oof fo C P E (S). Fist note that E[ X(β OLS β) 2 2] = E[ UU T Y UΣV T β 2 2]. Unde the linea model Y = UΣV T β + ɛ, E[ X(β OLS β) 2 2] = E[ Π U ɛ 2 2]. 9

Raskutti and Mahoney Since E[ɛɛ T ] = I n n, it follows that Fo β S, we have that E[ X(β OLS β) 2 2] = E[ Π U ɛ 2 2] = Π U 2 F =. E[ X(β S β) 2 2] = E[ Π U S Y UΣV T β 2 2] = E[ (U(I (SU) SU)ΣV T β + Π U S ɛ 2 2] = (I (SU) SU)ΣV T β 2 2 + E[ Π U S ɛ 2 2] = (I (SU) SU)ΣV T β 2 2 + Π U S 2 F. Hence C P E (S) = /( (I (SU) SU)ΣV T β 2 2 + ΠU S 2 F ) as stated. Fo C RE (S), the mean-sqaued eo fo δ OLS and δ S ae and Hence, E[ Y Xβ OLS 2 2] = E[ (I Π U )ɛ 2 2] = I Π U 2 F = n, E[ Y Xβ S 2 2] = (I (SU) SU)ΣV T β 2 2 + E[ (I Π U S )ɛ 2 2] = (I (SU) SU)ΣV T β 2 2 + tace((i Π S ) T (I Π S )) = (I (SU) SU)ΣV T β 2 2 + tace(i) 2tace(Π S ) + Π S 2 F = (I (SU) SU)ΣV T β 2 2 + n 2 + Π S 2 F = (I (SU) SU)ΣV T β 2 2 + n + Π S 2 F. C RE (S) = n + (I (SU) SU)ΣV T β 2 2 + Π S 2 F n = + (I (SU) SU)ΣV T β 2 2 + Π S 2 F n = + C P E(S). n/ A.2 Intemediate Result In ode to ovide a convenient way to aameteize ou ue bounds fo C W C (S), C P E (S), and C RE (S), we intoduce the following thee stuctual conditions on S. Let σ min (A) denote the minimum non-zeo singula value of a matix A. The fist condition is that thee exists an α(s) > 0 such that The second condition is that thee exists a β(s) such that σ min (SU) α(s). (2) U T S T Sɛ 2 su β(s). (3) ɛ, U T ɛ=0 ɛ 2 20