Minimax rates for memory-bounded sparse linear regression

Size: px

Start display at page:

Download "Minimax rates for memory-bounded sparse linear regression"

Jewel Stone
6 years ago
Views:

1 JMLR: Workshop an Conference Proceeings vol 40:1 4, 015 Minimax rates for memory-boune sparse linear regression Jacob Steinhart Stanfor University, Department of Computer Science John Duchi Stanfor University, Department of Statistics Abstract We establish a minimax lower boun of Ω k Bɛ) on the sample size neee to estimate parameters in a k-sparse linear regression of imension uner memory restrictions to B bits, where ɛ is the l parameter error. When the covariance of the regressors is the ientity matrix, we also provie an algorithm that uses boun hols in a more general communication-boune setting, where instea of a memory boun, at most B bits of information are allowe to be aaptively) communicate about each sample. 1. Introuction ÕB + k) bits an requires Õ k Bɛ ) observations to achieve error ɛ. Our lower The growth in size an scope of atasets unerscores a nee for techniques that balance between multiple practical esierata in statistical inference an learning, as classical tools are insufficient for proviing such traeoffs. In this work, we buil on a nascent theory of resource-constraine learning one in which researchers have investigate privacy Kasiviswanathan et al., 011; Duchi et al., 013), communication an memory Zhang et al., 013; Garg et al., 014; Shamir, 014), an computational constraints Berthet an Rigollet, 013) an stuy bouns on the performance of regression estimators uner memory an communication constraints. In particular, we provie minimax upper an lower bouns that exhibit a traeoff between accuracy an available resources for a broa class of regression problems, which have to this point been ifficult to characterize. To situate our work, we briefly review existing results on statistical learning with resource constraints. There is a consierable literature on computation an learning, beginning with Valiant s work 1984) on probably approximately correct PAC) learning, which separates concept learning in poynomial versus non-polynomial settings. More recent work shows uner natural complexitytheoretic assumptions) that restriction to polynomial-time proceures increases sample complexity for several problems, incluing sparse principal components analysis Berthet an Rigollet, 013) classification problems Daniely et al., 013, 014), an, pertinent to this work, sparse linear regression Zhang et al., 014; Nataraan, 1995). Showing fine-graine traeoffs in this setting has been challenging, however. Researchers have given more explicit guarantees in other resourceconstraine learning problems; Duchi et al. 013) stuy estimation uner a privacy constraint that the statistician shoul never be able to infer more than a small amount about any iniviual in a sample. For several tasks, incluing mean estimation an regression, they establish matching upper an lower bouns exhibiting a traeoff between privacy an statistical accuracy. In the communicationlimite setting, Zhang et al. 013) stuy problems where ata are store on multiple machines, an the goal is to minimize communication between machines an a centralize fusion center. In the full version of that paper Duchi et al., 014), they show that mean-square error for -imensional c 015 J. Steinhart & J. Duchi.

2 STEINHARDT DUCHI mean estimation has lower-boun Ω Bmn logm) ) for m machines, each with n observations, an B bits of communication per machine. In concurrent work, Garg et al. 014) provie ientical results. Perhaps most relate to our work, Shamir 014) consiers communication constraints in the online learning setting, where at most B bits of information about an example can be transmitte to the algorithm before it moves on to the next example. In this case, any memory-boune algorithm is communication-boune, an so communication lower bouns imply memory lower bouns Alon et al., 1999). For -imensional 1-sparse i.e. only a single non-zero component) mean estimation problems, Shamir 014) shows that Ω/B) observations are necessary for parameter recovery, while without memory constraints, a sample of size Olog ) is sufficient. He also shows that for certain principal component analyses, a sample of size Ω 4 /B ) is necessary for estimation, while without constraints) O log 3 )) observations suffice Outline of results In this work, we also focus on the online setting. We work in a regression moel, in which we wish to estimate an unknown -imensional parameter vector w R, an we observe an i.i.. sequence X i), Y i) ) R R such that Y i) = w, X i) + ɛ i), 1) where ɛ i) is mean-zero noise inepenent of X i) with Varɛ i) ) σ. We focus on the case that w is k-sparse, meaning that it has at most k non-zero entries, that is, w 0 k. We consier resource-constraine proceures that may store information about the ith observation in a B i -bit vector Z i) {0, 1} B i. After n observations, the proceure outputs an estimate ŵ of w. We have two types of resource constraints: Communication constraints Z i) is a measurable function of X i), Y i), an the history Z 1:i 1), an the estimate ŵ is a measurable function of Z 1:n) = Z 1),..., Z n) ). Memory constraints Z i) is a measurable function of X i), Y i), an Z i 1), an ŵ is a measurable function of only Z n). Note that the communication constraints o not allow for arbitrary communication, but rather only one-pass communication protocols in which all information is communicate from left to right. Any memory-constraine proceure is also communication-constraine, so we prove our lower bouns in the weaker less restrictive) communication-constraine setting, later proviing an algorithm for the memory-constraine setting which then carries through to the communicationconstraine setting). Our two main results Theorems an 3) are that, for a variety of regression problems with observation noise σ, the mean-square error satisfies subect to certain assumptions on the covariates X an noise ɛ) { } k Ω1) σ min Bn, 1 inf ŵ sup E [ ŵ w ] Õ1) σ max w W { } k Bn, k, ) Bn where the infimum is over all communication/memory-constraine proceures using B bits per iteration, an the supremum is over a set W of k-sparse vectors to be efine later. The lower boun ) shows that any B-bit communication-constraine proceure achieving square error ɛ

3 MEMORY-BOUNDED SPARSE REGRESSION k requires Ω σ ɛ B ) observations. In the absence of resource constraints, stanar results on sparse regression e.g. Wainwright, 009) show that Θ 1 ɛ k log ) observations suffice; moreover, algorithms which require only Õ) memory can essentially match this boun Agarwal et al., 01; Steinhart et al., 014). Thus, for any communication buget B α with α < 1, we lose exponentially in the imension. The upper boun also has sharp epenence on B an, though worse σ4 epenence on accuracy an variance: it shows that Õ ) observations suffice. 1.. Summary of techniques k ɛ B Before continuing, we summarize the techniques use to prove the bouns ). The high-level structure of the lower boun follows that of Arias-Castro et al. 013), who show lower bouns for sparse regression with an aaptive esign matrix. They aapt Assoua s metho 1983) to reuce the analysis to bouning the mutual information between a single observation an the parameter vector w. In our case, even this is challenging; rather than choosing a esign matrix, the proceure chooses an arbitrary B-bit message. We control such proceures using two ieas. First, we focus on the fully sparse k = 1) case, showing a quantitative ata processing inequality, base off of techniques evelope by Duchi et al. 013), Zhang et al. 013), an Shamir 014). By analyzing certain carefully constructe likelihoo ratio bouns, we show, roughly, that if the pair X, Y ) R R provies information I about the parameter w, then any B-bit message Z base on X, Y ) provies information at most BI / about w ; see Section for these results. After this ata processing boun for k = 1, in Section 3 we evelop a irect sum theorem, base on the ieas of Garg et al. 014) an a common tool in the communication complexity literature, which shows that fining a k-sparse vector w is as har as fining k inepenent 1-sparse vectors. While our lower boun buils off of several prior results, the regression setting introuces a subtle an challenging technical issue: both the irect sum technique an prior quantitative ata processing inequalities in communication settings rely strongly on the assumption that each observation consists of inepenent ranom variables. As Y necessarily epens on X in the moel 1), we have strong coupling of our observations. To aress this, we evelop a technique that allows us to treat Y as part of the communication ; instea of Z communicating information about X an Y, now Y an Z communicate information about X. The effective number of bits of communication increases by an analogue of) the entropy of Y. For the upper boun ), in Section 4 we show how to implement l 1 -regularize ual averaging Xiao, 010) using a count sketch ata structure Charikar et al., 00), which space-efficiently counts elements in a ata stream. We use the count sketch structure to maintain a coarse estimate of moel parameters, while also exactly storing a small active set of at most k coorinates. These combine estimates allow us to implement ual averaging when the l 1 -regularization is sufficiently large; the necessary amount of l 1 -regularization is inversely proportional to the amount of memory neee by the count sketch structure, leaing to a traeoff between memory an statistical efficiency. Notation We perpetrate several abuses. Superscripts inex iterations an subscripts imensions, so X i) enotes the coorinate of the ith X vector. We let D kl P Q) enote the KL-ivergence between P an Q, an use upper case to enote unboun variables an lower case to enote fixe conitioning variables, e.g., D kl P X z) QX z)) enotes the KL ivergence of the istributions of X uner P an Q conitional on Z = z, an IX; Z) is the mutual information between X an Z. We use stanar big-oh notation, where Õ ) an Ω ) enote bouns holing up to polylogarithmic factors. We let [n] = {1,..., n}. 3

4 STEINHARDT DUCHI. Lower boun when k = 1 We begin our analysis in a restricte an simpler setting than the full one we consier in the sequel, proviing a lower boun for estimation in the regression moel 1) when w is 1-sparse. Our lower boun hols even in the presence of certain types of sie information, which is important for our later analysis. We begin with a formal escription of the setting. Let r > 0 an let W be uniformly ranom in the set { r, 0, r}, where we constrain W 0 = 1, i.e. only a single coorinate of W is non-zero. At iteration i, the proceure observes the triple X i), ξ i), Y i) ) { 1, 1} Ξ R. Here X i) is an i.i.. sequence of vectors uniform on { 1, 1}, ξ i) is sie information inepenent of X 1:n) an Z 1:i 1) which is necessary to consier for our irect sum argument later), an we observe Y i) = W X i) + ξ i) + ɛ i), 3) where ɛ i) are i.i.. mean-zero Laplace istribute with Varɛ) = σ. Aitionally, we let P 0 be a null istribution in which we set Y i) = ξ i) + rs i) + ɛ i), where s i) { 1, 1} is an i.i.. sequence of Raemacher variables; note that P 0 is constructe to have the same marginal over Y i) as the true istribution. We present our lower boun in terms of average communication, efine as B 0 ef = 1 n I P0 X i) ; Y i), Z i) Z 1:i 1) ), 4) where I P0 enotes mutual information uner P 0. With this setup in place, we have: Theorem 1 Let δ = r/σ. Uner the setting above, for any estimator Ŵ Z1),..., Z n) ) of the ranom vector W, we have 1 [ ] E Ŵ W r e δ e δ 1) B 0 n. To interpret Theorem 1, we stuy B 0 uner the assumption that the proceure simply observes pairs X i), Y i) ) from the regression moel 1), that is, ξ i) 0, an we may store/communicate only B bits. By the chain rule for information an the fact that conitioning reuces entropy, we then have I P0 X i) ; Y i), Z i) Z 1:i 1) ) = I P0 X i) ; Y i) Z 1:i 1) ) + I P0 X i) ; Z i) Y i), Z 1:i 1) ) i) = I P0 X i) ; Z i) Y i), Z 1:i 1) ) H P0 Z i) ) B, where i) uses the fact that X i) an Y i) are inepenent uner P 0. Now consier the signal-tonoise ratio δ = r/σ, which we may vary by scaling the raius r. Noting that e δ e δ 1) δ for δ 1/3, Theorem 1 implies the lower boun [ ] E Ŵ W σ δ sup 0 δ 1/ ) δ Bn { σ 1 64 min 3 Bn, 1 9 where we have chosen δ = min{1/3, /3Bn)}. Thus, to achieve mean-square error smaller than σ, we require n /B observations. This contrasts with the unboune memory case, where n scales only logarithmically in the imension if we use the Lasso algorithm Wainwright, 009). }, 4

5 MEMORY-BOUNDED SPARSE REGRESSION.1. Proof of Theorem 1 We prove the theorem using an extension of Assoua s metho, which transforms the minimax lower bouning problem into one of multiple hypothesis tests Assoua, 1983; Arias-Castro et al., 013; Shamir, 014). We exten a few results of Arias-Castro et al. an Shamir to apply in our slightly more complex setting. We introuce a bit of aitional notation before continuing. Let J {±1, ±,..., ±} be a ranom inex an sign) corresponing to the vector W, so that J = means that W = r an J = means that W = r. Let P +, P enote the oint istribution of all variables conitione on J = +, respectively, an let P 0 be the null istribution as efine earlier. Throughout, we use lower-case p to enote the ensity of P with respect to a fixe base measure. Overview. We prove Theorem 1 in three steps. First, we show that the minimax estimation error Ŵ W can be lower-boune in terms of the recovery error P[Ĵ J] Lemma 1). Next, we show that recovering J is ifficult unless Z 1:n) contains substantial information about J Lemma ). Finally, we reach the crux of our argument, which is to establish a strong ata processing inequality Lemma 4 an equation 9)), which shows that, for the Markov chain W X, Y ) Z, the mutual information IW ; Z) egraes by a factor of /B from the information IX, Y ; Z); this shows that classical minimax bouns increase by a factor of /B in our setting. Estimation to testing to information. Lemma 1 For any estimator Ŵ, there is an estimator Ĵ such that We begin by bouning square error by testing error: [ ] E Ŵ W r ] [Ĵ P J. 5) We next state a lower boun on the probability of error in a hypothesis test. ef Lemma Let J Uniform{±1,..., ±}) an K ± = D kl P0 Z 1:n) ) P ± Z 1:n) ) ). Then for any estimator ĴZ1:n) ) PĴZ1:n) ) J) 1 1 ) 1 K + K + ). 6) 4 The proofs of Lemmas 1 an are in Sec. A.1 an A., respectively. We next provie upper bouns on K ± from Lemma, focusing on the + term as the term is symmetric. By the chain rule, ) D kl P 0 Z 1:n) ) P + Z 1:n) ) = =1 ) D kl P 0 Z i) z 1:i 1) ) P + Z i) z 1:i 1) ) P 0 z 1:i 1) ). For notational simplicity, introuce the shorthan Z = Z i) an ẑ = z 1:i 1) ; we will focus on bouning D kl P 0 Z ẑ) P + Z ẑ)) for a fixe i an ẑ = z 1:i 1). A strong ata-processing inequality. We now relate the ivergence between P 0 Z) an P + Z) to that for the istributions of X, Y ) i.e., the ivergence if there were no memory or communication constraints). As in Theorem 1, let δ = r/σ be the signal to noise ratio. Note that px, ξ, y ẑ) = px, ξ ẑ)py x, ξ, ẑ) = px, ξ)py x, ξ, ẑ) 5

6 STEINHARDT DUCHI for p = p 0, p ±, as the pair x, ξ) is inepenent of the inex J an the past store) ata ẑ. Thus log p + x, ξ, y ẑ) log p 0 x, ξ, y ẑ) = log p + y x, ξ, ẑ) log p 0 y x, ξ, ẑ) δ, 7) as the istribution of p + y x, ξ, ẑ, s) is a mean shift of at most r relative to p 0 y x, ξ, s), an both istributions are Laplaceσ/ ) about their mean recall that s is the Raemacher variable in the efinition of p 0 ; marginalizing it out as in 7) can only bring the ensities closer). Leveraging 7), we obtain the following lemma, which bouns the KL-ivergence in terms of the χ -ivergence. Lemma 3 For any inex an past ẑ, p+ z ẑ) p 0 z ẑ) D kl P 0 Z ẑ) P + Z ẑ)) µz) e δ p+ z ẑ) p 0 z ẑ) µz). p + z ẑ) p 0 z ẑ) Proof The first inequality is the stanar boun of KL-ivergence in terms of χ -ivergence cf. Tsybakov, 009, Lemma.7). The secon follows from inequality 7), as p + z ẑ) = p + z ξ, x, y, ẑ)p + x, ξ, y ẑ) = p 0 z ξ, x, y, ẑ)p + x, ξ, y ẑ) e δ p 0 z ξ, x, y, ẑ)p 0 x, ξ, y ẑ) = e δ p 0 z ẑ), which gives the esire result. Picking up from Lemma 3, we analyze p + z ẑ) p 0 z ẑ) in Lemma 4, which is the key technical lemma in this section all results so far, while non-trivial, are stanar results in the literature). Lemma 4 Information contraction) For any ẑ an z, we have p + z ẑ) p 0 z ẑ) e δ 1)p 0 z ẑ) D kl P 0 X y, z, ẑ) P 0 X y, ẑ))p 0 y z, ẑ). Lemma 4 is prove in Sec. A.3. The intuition behin the proof is that, since p + y ẑ) = p 0 y ẑ), the only way for z to istinguish between 0 an + is by storing information about x information about x is useless since it oesn t affect the istribution over y). The amount of information about x is measure by the KL ivergence term in the boun; the e δ 1 term appears because even when x is known, the istributions over y iffer by a factor of at most e δ. Proving Theorem 1. Combining Lemmas 3 an 4, we boun D kl P 0 Z ẑ) P + Z ẑ)) in terms of an average KL-ivergence as follows; we have D kl P 0 Z ẑ) P + Z ẑ)) i) e δ p+ z ẑ) p 0 z ẑ) µz) p 0 z ẑ) ii) e δ e δ 1) D kl P 0 X y, z, ẑ) P 0 X y, ẑ))p 0 y z, ẑ)) P 0 z ẑ) iii) e δ e δ 1) = e δ e δ 1) I P0 X ; Z Y, Ẑ = ẑ), D kl P 0 X y, z, ẑ) P 0 X y, ẑ)) P 0 y, z ẑ) 6

7 MEMORY-BOUNDED SPARSE REGRESSION where the final equality is the efinition of conitional mutual information. Step i) follows from Lemma 3, step ii) by the strong information contraction of Lemma 4, an step iii) as a consequence of Jensen. By noting that IA; B C) + IA; C) = IA; B, C) for any ranom variables A, B, C, the final information quantity is boune by IX ; Z, Y Ẑ = ẑ), whence we obtain D kl P 0 Z ẑ) P + Z ẑ)) e δ e δ 1) I P0 X ; Z, Y Ẑ = ẑ). 8) By construction, the X are inepenent even given Ẑ), yieling the oint information boun 1 D kl P 0 Z ẑ) P + Z ẑ)) P 0 ẑ) eδ e δ 1) =1 = eδ e δ 1) I P0 X ; Z, Y Ẑ = ẑ)p 0ẑ) =1 I p0 X ; Z, Y Ẑ) =1 eδ e δ 1) I p0 X; Z, Y Ẑ). 9) Returning to the full ivergence in inequality 6), we sum over inices i = 1,..., n to obtain 1 4 ) ) D kl P 0 Z 1:n) ) P Z 1:n) ) + D kl P 0 Z 1:n) ) P + Z 1:n) ) =1 eδ e δ 1) Hence, by Lemma, P[Ĵ J] 1 which proves Theorem 1. [ ] E Ŵ W r 3. Lower boun for general k I P0 X i) ; Z i), Y i) Z 1:i 1) ) = eδ e δ 1) B0 n. e δ e δ 1) B 0 n 1. Applying Lemma 1, we have e δ e δ 1) B 0 n, Theorem 1 provies a lower boun on the memory-constraine minimax risk when k = 1. We can exten to general k using a so-calle irect-sum approach e.g. Braverman, 01; Garg et al., 014). To o so, we efine a istribution that is a bit ifferent from the stanar moel 1). Let W { r/ k, 0, r/ k} be a -imensional vector, whose coorinates we split into k contiguous blocks, each of size at least k. Within a block, with probability 1 all coorinates are zero, an otherwise we choose a single coorinate uniformly at ranom within the block) to have value ±r/ k. We enote this istribution by P. As before, we let each X i) { 1, 1} be a ranom sign vector. We now efine the noise process. Let ɛ i) l choose ɛ i) l i.i.. an uniformly from {±r/ k} if W l ɛ i) 0 Laplaceσ/ ). At iteration i, then, we observe = 0 for all i if Wl 0,, where = 0, an set ɛ i) = ɛ i) 0 + k l=1 ɛi) l Y i) = W X i) + ɛ i). 10) 7

8 STEINHARDT DUCHI The key iea that allows us to exten our techniques from the previous section to obtain a lower boun for general k N as oppose to k = 1) is that we can ecompose Y as [ ] [ Y i) = Wl X i) l + ɛ i) l + W l X i) l + ] ɛ i) l + ɛ i) 0. l l We can then reuce to the case when k = 1 by letting W = Wl, the lth block, ξi) = W l X i) l +, an ɛ i) = ɛ i) 0. Of course, our proceure is not allowe to know W l or ɛ l, but any l l ɛi) l lower boun in which a proceure observes these is only stronger than one in which the proceure oes not. By showing that estimation of the lth block is still challenging in this moel, we obtain our irect sum result, that is, that estimating each of the k blocks is ifficult in a memory-restricte setting. Thus, Theorem 1 gives us: Proposition 1 Let l k be the size of the lth block, let B l ef = 1 n I P X i) l ; Z i), Y i) Z 1:i 1), Wl, W l ), 11) an let ν = r/σ. Then for any communication-constraine estimator Ŵl of W [ E Ŵl Wl ] r 4k 1 B ) l n e ν/ k e ν/ k 1). 1) l Proof The key is to relate P to the istributions consiere in Section, for which we alreay have results. While this may seem extraneous, using P is crucial for allowing our bouns to tensorize in Theorem. First focusing on the setting k = 1 as in Theorem 1, let P 0 be the null istribution efine in the beginning of Section, an let P be the oint istribution of W, Z, X, Y when W is rawn uniformly from the 1-sparse vectors in { r, 0, r}. Letting P = 1 P 0 + P ), we have for any i [n] I P Z 1:i 1), W ) = 1 I P Z 1:i 1), W = 0) + 1 I P Z 1:i 1), W, W 0) Thus, in the setting of Theorem 1, if we efine = 1 I P 0 Z 1:i 1) ) + 1 I P Z 1:i 1), W ) 1 I P 0 Z 1:i 1) ). B ef = 1 n I P X i) ; Y i), Z i) Z 1:i 1), W ), we have B 0 B. Moreover, we have P 1 P, an so we obtain that E P [ Ŵ W ] 1 E [ Ŵ P W ], where the secon expectation is the risk boune in Theorem 1. Couple with the efinition of B, this implies that for k = 1, we have ] E P [ Ŵ W 1 r 1 e δ e δ 1) B ) 0 n r 1 4 e δ e δ 1) Bn ). 13) l, 8

9 MEMORY-BOUNDED SPARSE REGRESSION Now we show how to give a similar result, focusing on the k > 1 case, for a single block l uner the istribution P. Inee, we have that P W l ) = 1 P Wl, W l 0, W l ) + 1 P W l, W l = 0). Let B l = 1 n n I P Xi) l ; Z i), Y i) Z 1:i 1), Wl, W l = w l ) be the average mutual information conitione on the realization W l = w l. Then the lower boun 13), couple with the iscussion preceing this proposition, implies [ ] E P Ŵl Wl W l = w l r 1 4k e δ e δ 1) B l n ), 14) l where we have use δ = ν/ k. We must remove the conitioning in 14). To that en, note that ) 1 B l P w l ) B l P w l ) = B 1 l by Jensen. Integrating 14) completes the proof. Extening this proposition, we arrive at our final lower boun, which hols for any k. Theorem Let ν = r/σ an assume that ν k/3. Assume that Z i) consists of B i 1 bits, an efine B = 1 n n B i. For any communication-constraine estimator Ŵ of W, ) [ ] E Ŵ W r 1 4 9Bnν + 4nν log1 + ν ) k. /k To prove Theorem, we sum 1) over l from 1 to k; the main work is to show that k B l=1 l from Proposition 1 is at most slightly larger than the bit constraint B. The intuition is that Y i), being a single scalar, only as a small amount of information on top of Z i). The full proof is in Sec. A.4. We en with a few remarks on the implications of Theorem for asymptotic rates of convergence. Fixing the variance parameter σ an number of observations n, we choose the size parameter r to optimize ν. { To satisfy the assumptions } of the theorem, we must have r σ k/7. Choosing r = σ 8 min k /k 36Bn, k 9, e 9 4 B 1, we are guarantee that 4 log1 + ν ) 9B, an also that 9Bnν +4nν log1+ν ) k /k = 1 4, whence Theorem implies the lower boun [ ] E Ŵ W σ 18 min { k /k 36Bn, k } ) 9, e 9 k 4 B 1 σ min Bn, 1. That is, we require at least an average of B = Ω/ log)) bits of communication or memory) per roun of our proceure to achieve the optimal unconstraine) estimation rate of Θ σ k log)/n ). 4. An algorithm an upper boun We now provie an algorithm for the setting when the memory buget satisfies B Ω1) max{k log, k log n}. It is no loss of generality to assume that the buget is at least this high, as otherwise we cannot even represent the optimal vector w R to high accuracy. Before giving the proceure, we enumerate the amittely somewhat restrictive) assumptions uner which it operates. As before, all proofs are in the supplement. Assumption A The vectors X i) an noise variables ɛ i) are inepenent an are rawn i.i.. Aitionally, they satisfy E[X] = 0, E[ɛ] = 0, an CovX) = I, an X satisfies X ρ with probability 1. Moreover, ɛ is σ-sub-exponential, meaning that E[ ɛ k ] k!)σ k for all k. 9

10 STEINHARDT DUCHI In aition, we assume without further mention as in Theorems 1 an that w is k-sparse, meaning w 0 k, an that we know a boun r such that w r. In the construction we gave in the lower boun, we assume ρ = 1 an r = νσ = Omin{1, k/bn}σ). The strongest of our assumptions is that CovX) = I ; letting U = supp w, we can weaken this to E[X U X U ] = 0 an CovX U) γi for some γ > 0, but we omit this for simplicity in exposition. Further weakenings of this assumption in online settings appear possible but challenging Agarwal et al., 01; Steinhart et al., 014). We now escribe a memory-boune proceure for performing an analogue of regularize ual averaging RDA; Xiao 010)). In RDA, one receives a sequence f i of loss functions, maintaining a vector θ i) of graient sums, an at iteration i, performs the following two-step upate: w i) = arg min w W { θ i), w + ψw) } an θ i+1) = θ i) + g i), where g i) f i w i) ). 15) The set W is a close convex constraint set, an the function ψ is a strongly-convex regularizing function; Xiao 010) establishes convergence of this proceure for several functions ψ. We apply a variant of RDA 15) to the sequence of losses f i w) = 1 y i) w x i)). Our proceure separates the coorinates into two sets; most coorinates are in the first set, store in a compresse representation amitting approximate recovery. The accuracy of this representation is too low for accurate optimization but is high enough to etermine which coorinates are in suppw ). We then track these few) important coorinates more accurately. For compression we use a count sketch CS) ata structure Charikar et al., 00), which has two parameters: an accuracy ɛ, an a failure probability δ. The CS stores an approximation x to a vector x by maintaining a low-imensional proection Ax of x an supports two operations: Upatev), which replaces x with x + v. Query), which returns an approximation ˆx to x. Let Cɛ, δ) enote the CS ata structure with parameters ɛ an δ. It satisfies the following: Proposition Gilbert an Inyk 010), Theorem ) Let 0 < ɛ, δ < 1 an k 1 ɛ. The ata structure Cɛ, δ) can perform Upatev) an Query) in O log n logn/δ) δ ) time an stores O ɛ ) real numbers. If ˆx is the output of Query), then ˆx x ɛ x x top k hols uniformly over the course of all n upates with probability 1 δ, where x top k enotes x with only its k largest entries in absolute value) kept non-zero. Base on this proposition, we implement a version of l 1 -regularize regularize ual averaging, which we present in Algorithm 1. We show subsequently that this proceure with high probability) correctly implements ual averaging, so we can leverage known convergence guarantees to give an upper boun on the minimax rate of convergence for memory-boune proceures. Letting w i) be the parameter vector in iteration i, the algorithm tracks three quantities: first, a i) count-sketch approximation θ coarse to θ i) ef = i <i f i wi ) ), which requires O 1 ɛ log n δ ) memory i) we specify ɛ presently); secon, a set Ũ of active coorinates; an thir, an approximation θ fine to θ i) supporte on Ũ. We also track a running average ŵ of w1:i), which we use at the en as a parameter estimate; this requires at most Ũ numbers. In the algorithm, we use the soft-thresholing an proection operators, given by T c x) ef = [signx ) [ x c] + ] =1 an P r x) ef = { x rx/ x if x r otherwise. 10

11 MEMORY-BOUNDED SPARSE REGRESSION Algorithm 1 Low-memory l 1 -RDA for regression Algorithm parameters: c,, R, ɛ, δ. Initialize Cɛ, δ), ŵ 0, Ũ for i = 1 to n o w i) P r ηt c n θ i) fine )) an ŵ 1 i wi) + i 1 i ŵ Preict ỹ i) = w i) ) x i) an compute graient g i) = y i) ỹ i) )x i) Call Upateg i) i+1) ) an set θ coarse Query) for Ũ o θ i+1) i) fine, θ fine, + gi) en for for Ũ o if θ i+1) coarse, c ) n then i+1) A to Ũ an set θ fine, elsẽ θ i+1) fine, 0 en if en for en for return ŵ θ i+1) coarse, i) We remark that in Algorithm 1, we nee track only ŵ, θ fine, an the count sketch ata structure, so the memory usage in real numbers) is boune by ŵ 0 + θ i) fine 0, plus the size of the count sketch structure. We will see later that the size of this structure is roughly inversely proportional to the egree of l 1 -regularization. Our main result concerns the convergence of Algorithm 1. Theorem 3 Let Assumption A an the moel 1) hol. With appropriate setting of the constants c,, r, ɛ specifie in the proof), for buget B [k, ], Algorithm 1 uses at most ÕB) bits an achieves risk E [ { }) k ŵ w ] = Õ max rρσ Bn, r ρ k. Bn To compare Theorem 3 with our lower bouns, assume that k Bn 1 an set ρ = 1 an r = νσ; this matches the setting of our lower bouns in Theorems 1 an. Then rρσ = νσ k = bn σ, an we have Ω σ Bn) k infŵ E [ ŵ w ] Õ σ Bn) k. In particular, at least for some non-trivial regimes, our upper an lower bouns match to polylogarithmic factors. This oes not imply that our algorithm is optimal: if the raius r is fixe as n grows, we expect the optimal error to ecay as 1 1 n rather than the n rate in Theorem 3. One reason to believe the optimal rate is 1 n is that it is attainable when k = 1: simply split the coorinates into batches of size ÕB) an process the batches one at a time; since suppw ) has size 1, it is always fully containe in one of the batches. Analysis of Algorithm 1 Define s to be the iteration where is ae to Ũ or if this never happens). Also let i ba be the first iteration where θ coarse i) θ i) > n, an efine { θs ) a = coarse, θs ) : s < i ba, 0 : s i ba. 11

12 STEINHARDT DUCHI The vector a tracks the offset between θ fine an θ, while being clippe to ensure that a n. Our key result is that Algorithm 1 implements an instantiation of RDA: Lemma 5 Let θ i), w i) ) be the sequence of iterates prouce by RDA 15) with regularizer ψw) = 1 η w + c n w 1 + a w, where suppw) is constraine to lie in U an have l -norm at most r. Suppose that for some G, G 1 n max max n U θi), ɛ k)g, an c + G. Also assume ɛ 1 k. Then, with probability 1 δ, w = w. Let Reg ef = n f iw i) ) f i w ) enote the regret of the RDA proceure in Lemma 5, an let E ef = n f iw ) = n ɛi) ) be the empirical loss of w. A mostly stanar analysis yiels: Lemma 6 Assume ɛ 1 k. Also suppose that n max max U θi) ρ Reg + E) log/δ). 16) Then, letting V ef = log/δ) ɛ k)) for short-han, the regret Reg satisfies Reg Rρ k ) + V E + kr ρ 4 + 4V ) 17) for appropriately chosen c an, which moreover satisfy conitions of Lemma 5. Lemma 6 is useful because 1 n E [Reg] can be shown to upper-boun E [ ŵ w ]. To wrap up, we nee to eal with a few etails. First, we nee to show that 16) hols with high probability: Lemma 7 With probability 1 δ, θ i) ρ Reg + E) log/δ) for all i n an U. Secon, we nee a high probability boun on E; we prove the following Lemma in Section A.10. Lemma 8 Let p 1. There are constants K p satisfying K p 6p for p 5 an K p /p as p such that for any t 0, [ ] P ɛ i) ) σ n + 4K p σ 3 ) p n + n 1/p p t. t Substituting p = max{5, log 1 δ }, we have, for a constant C 7e an with probability 1 δ, E = ɛ i)) nσ + Cσ log 1 δ [ n + n 1/5 log 1 ]. 18) δ Combining Lemmas 7 an 8 with Lemma 6, we then have the following with probability 1 3δ: Reg Rρσ k ) + V n + C log1/δ)[ n + n 1/5 log 1/δ)] + kr ρ 4 + 4V ). 1

13 MEMORY-BOUNDED SPARSE REGRESSION To interpret this, remember that V = log/δ) ɛ k)) an that the count-sketch structure stores O logn/δ)/ɛ) bits; also, we nee ɛ 1 k for Proposition to hol. As long as B k, we can take ɛ = O 1/B) using ÕB) bits) an have V = Õ /B). The entire first term ) above is thus Õ kn Rρσ B, while the secon is Õ R ρ k ) B. This essentially yiels Theorem 3; the full proof is in Section A.5. From real numbers to bits. In the above, we analyze a proceure that stores ÕB) real numbers. In fact, each number only requires Õ1) bits of precision for the algorithm to run correctly. A etaile argument for this is given in Section B. Acknowlegments. The anonymous reviewers were instrumental in improving this paper. JS was supporte by an NSF Grauate Research Fellowship an a Fannie & John Hertz Fellowship. References A. Agarwal, S. Negahban, an M. Wainwright. Stochastic optimization an sparse statistical recovery: An optimal algorithm for high imensions. In Avances in Neural Information Processing Systems 5, 01. N. Alon, Y. Matias, an M. Szegey. The space complexity of approximating the frequency moments. Journal of Computer an System Sciences, 581): , E. Arias-Castro, E. J. Canes, an M. A. Davenport. On the funamental limits of aaptive sensing. Information Theory, IEEE Transactions on, 591):47 481, 013. P. Assoua. Deux remarques sur l estimation. Comptes renus es séances e l Acaémie es sciences. Série 1, Mathématique, 963): , Q. Berthet an P. Rigollet. Complexity theoretic lower bouns for sparse principal component etection. In Proceeings of the Twenty Sixth Annual Conference on Computational Learning Theory, 013. M. Braverman. Interactive information complexity. In Proceeings of the Fourty-Fourth Annual ACM Symposium on the Theory of Computing, 01. L. Breiman. Probability. Society for Inustrial an Applie Mathematics, 199. M. Charikar, K. Chen, an M. Farach-Colton. Fining frequent items in ata streams. In Automata, Languages an Programming, pages Springer, 00. A. Daniely, N. Linial, an S. Shalev-Shwartz. More ata spees up training time in learning halfspaces over sparse vectors. In Proceeings of the Twenty Sixth Annual Conference on Computational Learning Theory, 013. A. Daniely, N. Linial, an S. Shalev-Shwartz. From average case complexity to improper learning complexity. In Proceeings of the Fourty-Sixth Annual ACM Symposium on the Theory of Computing,

14 STEINHARDT DUCHI V. H. e la Peña an E. Giné. Decoupling: From Depenence to Inepenence. Springer, J. C. Duchi, M. I. Joran, an M. J. Wainwright. Local privacy an statistical minimax rates. In Founations of Computer Science FOCS), 013 IEEE 54th Annual Symposium on, pages IEEE, 013. J. C. Duchi, M. I. Joran, M. J. Wainwright, an Y. Zhang. Information-theoretic lower bouns for istribute statistical estimation with communication constraints. arxiv: [cs.it], 014. A. Garg, T. Ma, an H. Nguyen. On communication cost of istribute statistical estimation an imensionality. In Avances in Neural Information Processing Systems 7, pages , 014. A. Gilbert an P. Inyk. Sparse recovery using sparse matrices. Proceeings of the IEEE, 986): , 010. S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhonikova, an A. Smith. What can we learn privately? SIAM Journal on Computing, 403):793 86, 011. B. K. Nataraan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 4 ):7 34, S. Shalev-Shwartz. Online learning an online convex optimization. Founations an Trens in Machine Learning, 4): , 011. O. Shamir. Funamental limits of online an istribute algorithms for statistical learning an estimation. In Neural Information Processing Systems 8, 014. J. Steinhart, S. Wager, an P. Liang. The statistics of streaming sparse regression. arxiv preprint arxiv: , 014. A. B. Tsybakov. Introuction to Nonparametric Estimation. Springer, 009. L. G. Valiant. A theory of the learnable. Communications of the ACM, 711): , M. J. Wainwright. Sharp threshols for high-imensional an noisy sparsity recovery using l 1 - constraine quaratic programming Lasso). IEEE Transactions on Information Theory, 555): 183 0, 009. L. Xiao. Dual averaging methos for regularize stochastic learning an online optimization. Journal of Machine Learning Research, 11: , 010. Y. Zhang, J. Duchi, M. Joran, an M. J. Wainwright. Information-theoretic lower bouns for istribute statistical estimation with communication constraints. In Avances in Neural Information Processing Systems, pages , 013. Y. Zhang, M. J. Wainwright, an M. I. Joran. Lower bouns on the performance of polynomialtime algorithms for sparse linear regression. In Conference on Learning Theory COLT),

15 MEMORY-BOUNDED SPARSE REGRESSION Appenix A. Deferre proofs A.1. Proof of Lemma 1 Given an estimator Ŵ, efine an estimator Ĵ by letting Ĵ be the inex of the largest coorinate of Ŵ an letting signĵ) be the sign of that coorinate ties can be broken arbitrarily). We claim that Ŵ W r I[Ĵ J]. If Ĵ = J, then either signj) = signĵ), in which case I[Ĵ J] = 0, or else signj) signĵ), in which case Ŵ W W J = r. In either case, the result hols. Therefore, we turn out attention to the case that that Ĵ J. Then Ŵ W Ŵ Ĵ W Ĵ ) + Ŵ J W J ) Ŵ Ĵ + Ŵ J r) i) Ŵ J + Ŵ J r) r, where i) is because Ĵ inexes the largest coorinate of Ŵ by construction. So, the claime inequality hols, an the esire result follows by taking expectations. A.. Proof of Lemma We note that P[Ĵ J] = 1 P[Ĵ = J] = 1 1 P + Ĵ = +) + P Ĵ = ) =1 i) = 1 1 ) 1 ii) iii) iv) 1 1 ) ) ) 1 1 ) =1 ) ) P + Ĵ = +) P 0Ĵ = +) + P Ĵ = ) P 0Ĵ = ) P + Ĵ = +) P 0Ĵ = +) + P Ĵ = ) P 0Ĵ = ) =1 P + P 0 T V + P P 0 T V = P + P 0 T V + P P 0 T V =1 D kl P 0 P +J ) + D kl P 0 P ), =1 as was to be shown. Here i) uses the fact that =1 P 0Ĵ = +) + P 0Ĵ = ) = 1, ii) uses the variational form of TV istance, iii) is Cauchy-Schwarz, an iv) is Pinsker s inequality. 15

16 STEINHARDT DUCHI A.3. Proof of Lemma 4 We first make three observations, each of which relies on the specific structure of our problem. First, we have p + z x, y, ẑ) = p 0 z x, y, ẑ) 19a) since in both cases X are i.i.. ranom sign vectors, Z i) is X i), ξ i), Y i), Z 1:i 1) )-measurable, an ξ i) is inepenent of X. Seconly, we have p + y ẑ) = p 0 y ẑ) 19b) by construction of p 0. Finally, we have the inequality p + x, y ẑ) p 0 x, y ẑ) = p + x, y ẑ) p 0 x, y ẑ) 1 p 0x, y ẑ) e δ 1)p 0 x, y ẑ), 19c) as the ratio between the quantities as note by inequality 7) is boune by e δ. Therefore, expaning the istance between p + z) an p 0 z), we have p + z ẑ) p 0 z ẑ) = p + z x, y, ẑ)p + x, y ẑ) p 0 z x, y, ẑ)p 0 x, y ẑ)) i) = p 0 z x, y, ẑ)p + x, y ẑ) P 0 x, y ẑ)) ii) = p 0 z x, y, ẑ) p 0 z y, ẑ))p + x, y ẑ) P 0 x, y ẑ)), where step i) follows from the inepenence equality 19a) an step ii) because p 0 z y, ẑ) is constant with respect to x an p + y ẑ) = p 0 y ẑ) by 19b). Next, by inequality 19c) we have that P + x, y ẑ) P 0 x, y ẑ)) e δ 1)P 0 x, y ẑ), 1) whence we have the further upper boun p + z ẑ) p 0 z ẑ) i) e δ 1) p 0 z x, y, ẑ) p 0 z y, ẑ) P 0 x, y ẑ) ii) = e δ P 0 x, y z, ẑ)p 0 z ẑ) 1) P 0y z, ẑ)p 0 z ẑ) P 0 x, y ẑ) P 0 y ẑ) P 0x, y ẑ) = e δ 1)p 0 z ẑ) P 0 x, y z, ẑ) P 0 x, y ẑ) P 0y z, ẑ) P 0 y ẑ) = e δ 1)p 0 z ẑ) P 0 x y, z, ẑ) P 0 x y, ẑ) P 0 y z, ẑ) iii) e δ 1)p 0 z ẑ) D kl P 0 X y, z, ẑ) P 0 X y, ẑ))p 0 y z, ẑ), where i) is by 0) an 1), ii) is by Bayes rule, an iii) is by Pinsker s inequality. 0) 16

17 MEMORY-BOUNDED SPARSE REGRESSION A.4. Proof of Theorem First, we observe that e ν/ k e ν/ k 1) ν /k for ν k/3, so we may replace the lower boun in Proposition 1 with [ ] E Ŵl Wl r 1 4k 4 B l nν ). ) k l Now, by inequality ) an Jensen s inequality, we have E P [ ] Ŵ W = k l=1 E P [ ] Ŵl Wl r 4k r 4 r 4 l 1 4nν ) B l k /k k=1 1 4nν k ) B k 3 l /k l=1 1 4nν k k /k l=1 B l. We woul thus like to boun k B l=1 l. Next note that by the inepenence of the X i) l an the chain rule for mutual information, we have that k I P X i) l ; Z i), Y i) Z 1:i 1), W ) I P X i) ; Z i), Y i) Z 1:i 1), W ) l=1 = I P X i) ; Y i) Z 1:i 1), W ) + I P X i) ; Z i) Y i), Z 1:i 1), W ). 3) We can upper boun the final term in expression 3) by HZ i) ) B i. In aition, the first term on the right han sie of 3) satisfies I P X i) ; Y i) Z 1:i 1), W ) = hy i) Z 1:i 1), W ) hy i) X i), Z 1:i 1), W ) 1 logπe VarY i) )) hlaplaceσ/ )), where h enotes ifferential entropy an we have use that the normal istribution maximizes entropy for a given variance. Using that hlaplaceσ/ )) = 1 + log σ), inequality 3) thus implies I P X i) ; Z i), Y i) Z 1:i 1), W ) B i + 1 logπe Var[Y i) ]) 1 log σ) = B i + 1 logπeσ + R )) 1 loge σ ) = B i + 1 log π σ + R ) eσ 9 8 B i + 1 log 1 + ν ). In the last step we use 1 logπ/e) B i an R /σ 8R /σ = ν. Using the efinition 11) of B l an the preceing boun, we obtain that k l=1 B l 9/8)B + 1 log 1 + ν ). Using k l=1 B l 9/8)B + 1 log1 + ν ) gives the result. 17

18 STEINHARDT DUCHI A.5. Proof of Theorem 3 To prove Theorem 3, we first state the following more precise theorem, whose proof is given in Sec. A.6: Theorem A Let ɛ 1 k. With probability 1 3δ, we have Reg Rρσ k ) + V n + C log1/δ)[ n + n 1/5 log 1/δ)] + kr ρ 4 + 4V ). 4) In particular, let: δ = k R ρ 1σ n ŵ ef = 1 n w 1) + + w n)). Then we have the minimax boun: E [ ŵ w ] Rρσ k ) + V n + C log1/δ)[ n + n n 1/5 log 1/δ)] + kr ρ 6 + 4V )). Base on this theorem, we have E [ ŵ w ] ÕRρσ1 + V ) k/n + R ρ 1 + V )k/n). Also, by Proposition, the count sketch structure stores Õ1/ɛ) numbers; in aition, only Ok) numbers are neee to store ŵ an θ fine. Thus as long as b Ωk) so that ɛ 1 k ), we have ef ɛ Õ1/b). Next recall that V = log/δ) ɛ k)) = Õ1 + /b). We assume that b, so we therefore have E [ ŵ w ] Õ Rρσ ) k/bn + R ρ k/bn )) k = Õ max Rρσ bn, R ρ k, bn as was to be shown. A.6. Proof of Theorem A The count-sketch guarantee Proposition ), as well as Lemmas 7 an 8, each hol with probability 1 δ, so with probability 1 3δ all 3 results will hol, in which case Lemma 6 establishes the state regret boun by plugging in the boun on E from Lemma 8 to 17). To prove the boun on E [ ŵ w ], first note that [ ] E f i w i) ) = 1 ɛ i) + w i) w ) x i)) = 1 E [ ɛ i)) ] ) + w i) w, 18

19 MEMORY-BOUNDED SPARSE REGRESSION where the secon equality follows because Cov[X i) ] = I. Also note that, by convexity, ŵ w 1 n n wi) w. Therefore, any boun on E [ n f i w i) ) ] implies a boun on E[ ŵ w ]; in particular, [ ] E[ ŵ w ] n E f i w i) ) ɛ i)). The main ifficulty is that we have a boun that hols with high probability, an we nee to make it hol in expectation. To accomplish this, we simply nee to show that i f i is not too large even when the high probability boun fails. To o this, first note that we have the boun f i w i) ) = 1 ɛ i) + w i) w ) x i)) i) 1 + δ 0 ɛ i)) 1 + ii) 1 + δ δ 0 ) w i) w ) x i)) ɛ i)) δ 0 ) R ρ, where i) is by Young s inequality an ii) is by Cauchy-Schwarz an the fact that w i) w R. Thus, we have the boun We also straightforwarly have the boun ) f i w i) ) n 1 + 1δ0 R ρ δ 0 ɛ i)). f i w i) ) Reg + 1 ɛ i)). Combining these yiels ) ) f i w i) ) min Reg, n 1 + 1δ0 R ρ δ 0 ɛ i)). 5) [ ef ɛ Let σ 0 = E i) ) ] an let Reg 0 enote the right-han-sie of 4). Then, taking expectations of 5) an using the fact that Reg Reg 0 with probability 1 3δ, we have 1 σ 0 + E [ ŵ w 1 3δ) ]) Reg n 0 + 6δ 1 + 1δ0 ) R ρ δ 0 σ 0, 19

20 STEINHARDT DUCHI which implies that E [ ) ŵ w ] 1 3δ) Reg n 0 + 1δ 1 + 1δ0 R ρ + δ 0 σ0 n Reg δ R ρ + δ 0 σ0 δ 0 = Rρσ k ) + V n + C log1/δ)[ n + n n 1/5 log 1/δ)] + kr ρ 4 + 4V ) ) + 4 δ δ 0 R ρ + δ 0 σ 0. Note also that σ 0 σ. Setting δ 0 = kr ρ σ n an δ = kδ 0 boun. A.7. Proof of Lemma 5 1n = k R ρ 1σ n, we obtain the esire We procee by contraiction. Let i be the minimal inex where w i) w i), an let be a particular coorinate where the relation fails. Note that, by minimality of i, we have θ i) = θ i). We split into cases base on how i relates to s. w i) Case 1: i < s. By construction, we have w i) = 0 for i < s. Therefore, we must have 0. In particular, this implies that θ i) + a > c n. But a n by construction, so we must have θ i) > c ) n. Since θ i) = θ i) an θ coarse i) θ i) n, we have θ i) coarse, > c ) n; thus i s, which is a contraiction. Case : i s. Note that w i) = S R ηt c n θ i) fine, )) = S R ηt c n θ i) + θ s ) coarse, θs ) ))) = S R ηt c n θ i) + θ s ) coarse, θs ) ))), while w i) = S R ηt c n θ i) + a )). Therefore, w i) w i) implies that a θ s ) coarse, θs ), which means that s i ba. Hence, i i ba as well. This implies that there is an iteration where θ i ba) coarse θ iba) > n, such that the RDA algorithm an Algorithm 1 make the same preictions for all i < i ba. We will focus the rest of the proof on showing this is impossible. Since RDA an Algorithm 1 make the same preictions for i < i ba, we in particular have θ i ba) = θ i ba). Now, using the count sketch guarantee, we have θ i ba) coarse θ i ba) ɛ θ i ba) θ i ba) top k = ɛ θ i ba) θ i ba ) top k. But since U has size k, we have the boun θ i ba) θ i ba ) top k U θ i ba) G n k). 0

21 MEMORY-BOUNDED SPARSE REGRESSION Putting these together, we have θ i ba) coarse θ i ba) ɛ k)g n n, which contraicts the assumption that θ i ba) coarse θ iba) > n. Since both cases lea to a contraiction, Algorithm 1 an the RDA proceure must match. Moreover, the conition c + G ensures that no coorinate outsie of U is ae to the active set Ũ, which completes the proof. A.8. Proof of Lemma 6 By the general regret boun for mirror escent Theorem.15 1 of Shalev-Shwartz 011)), Reg = f i w) f i w ) 1 η w + c n w 1 + a w + η z i) U 1 η R + c n + a ) w 1 + η x i) U f i w i) ) 1 η R + c + ) nkr + kηρ fw i) ) 1 η R + c + ) nkr + kηρ E + Reg). Reg+E) log/δ) Now note that, by the conition of the proposition, we can take G = ρ n ; hence ɛ k)reg+e) log/δ) Reg+E) log/δ) setting = ρ n, c = ρ n 1 + ) ɛ k) will satisfy the conitions of Lemma 5. Recalling the efinition of V above, we thus have the boun Re-arranging, we have: Reg 1 η R + krρv Reg + E + kηρ Reg + E). 1 kηρ ) Reg R η + krρv Reg + E + kηρ E. Now set η = min R ρ, 1 ); note that the secon term in the min allows us to ivie through ke kρ by 1 kηρ while only losing a factor of. We then have Reg Rρ ) k E + V Reg + E + kr ρ, where the first term is the boun when η = R/ρ ke an the secon term upper-bouns R η case that η = 1/kρ. Solving the quaratic, we have which completes the proof. Reg Rρ k + V ) E + kr ρ 4 + 4V ), in the 1. Note that Shalev-Shwartz 011) refers to the proceure as online mirror escent rather than regularize ual averaging, but it is actually the same proceure.. More precisely, we use the fact that if x a + bx + c for a, b, c 0, then x a + b + c. 1

22 STEINHARDT DUCHI A.9. Proof of Lemma 7 Note that θ i+1) θ i) = z i) = y i) w i) ) x i) )x i) = w w i) ) x i) + ɛ i) )x i). Since suppw w i) ) U an U, this is a zero-mean ranom variable since Cov[X i) ] = I), an so θ i) is a martingale ifference sequence. Moreover, letting F i be the sigma-algebra generate by the first i 1 samples, we have that [ ) ) ] E exp λ θ i+1) θ i) ρ λ f i w i) ) F i [ = E exp λx i) y i) w i) ) x i)) ) ] ρ λ f i w i) ) F i [ = E exp λx i) y i) w i) ) x i))) ] F i [ i) 1 E exp λ ρ y i) w i) ) x i)) ) ] ρ λ f i w i) ) F i = 1. Here i) is obtaine by marginalizing over x i), using the fact that xi) is inepenent of both y i) an w i) ) x i) since suppw ) U an suppw i) ) U), together with the fact that x i) ρ an hence is sub-gaussian. Consequently, for any λ, Y i = exp λθ i+1) ρ λ ) i k=1 f kw k) ) is a supermartingale. For any t, efine the stopping criterion Y i t. Then, by Doob s optional stopping theorem Corollary 5.11 of Breiman 199)) applie to this stoppe process, together with Markov s inequality, we have P [max n Y i t] 1 t. Bouning i k=1 f k by n k=1 f k an inverting, for any fixe an λ, with probability 1 δ, we have n max θi) λρ f i w i) ) + 1 λ log1/δ). Union-bouning over = 1,..., an ±θ changes the log ) term to log/δ), yieling the high-probability-boun n max θi) λρ f i w i) ) + 1 λ log/δ). Using the equality n f iw i) ) = E +Reg an choosing the optimal value of λ yiels the esire result. Remark 1 Note that λ is set aaptively base on E + Reg, which epens on the ranomness in θ 1:n) an thus coul be problematic. However, later in our argument we en up with an upper boun on E + Reg that epens only on the problem parameters; if we set λ base on this upper boun, our proofs still go through, an we eliminate the epenence of λ on θ.

23 MEMORY-BOUNDED SPARSE REGRESSION A.10. Proof of Lemma 8 We prove the inequality using stanar symmetrization inequalities. First, we use the Høffmann- Jorgensen inequality e.g. e la Peña an Giné, 1999, Theorem 1..3)), which states that given symmetric ranom variables U i an S n = n U i an some t 0 is such that P [ S n t 0 ] 1/8, then for any p > 0, [ p] 1/p E U i [ ] ) 1/p ) K p t 0 + E max U i p, where K p 1+ 4 p p + 1 exp logp + 1). i n p 6) Notably, for any p 5 we have K p 6p, for any p 1 we have K p 18p, an K p /p as p. Now, recall that ɛ i) is sub-exponential, meaning that there is some σ such that E [ ɛ i) p] 1/p σp for all p 1. Letting r i {±1} be i.i.. Raemacher variables an V i = ɛ i) ) be shorthan, an immeiate symmetrization inequality e.g. e la Peña an Giné, 1999, Lemma 1..6) gives that [ ] [ ] [ p] P V i σ n + t P V i E [V i ] + t t p E V i E [V i ]) [ p] p t p E r i V i. Now we apply the Høffmann-Jorgensen inequality 6) to U i = r i V i, which are symmetric an satisfy [ ] P r i V i t 0 1 t E [ ] r i V i = 1 0 t E Vi 4nσ4 0 t, 0 so that taking t 0 = 6σ n an applying inequality 6) yiels [ p] E r i V i Kp p 6σ [ ] ) 1/p p n + E max V p i n i. Moreover, we have that [ ] E max V p i n i E [V p i ] ne [ɛ i)p] nσ p p) p. In particular, we have [ ] ) p P V i σ n + 4K p σ t 4K p σ K p t pσ p 3 n + n 1/p p ) p 3 ) p n + n 1/p p =. t 7) Taking p = log 1 δ 3 an t = e ) n + n log 1 1 δ log 1 δ gives the esire result. 3

24 STEINHARDT DUCHI Appenix B. From real numbers to bits In Section 4, we analyze a proceure that store ÕB) real numbers. We now argue that each number only nees to be store to polynomial precision, an so requires only Õ1) bits. Algorithm 1 stores the quantities Ũ, ŵ, θ fine, an the count sketch structure θ coarse ). The set Ũ requires Ok log ) bits. To hanle θ coarse an θ fine, we ranomly an unbiasely) roun each g i) to the nearest multiple of M for some large integer M; for instance, a value of M will be roune to 4 M with probability 0.9 an to 5 M with probability 0.1. Since this yiels an unbiase estimate of gi), the RDA proceure will still work, with the overall regret boun increasing only slightly essentially, by kn in expectation). M If M ef = max i [n], [] g i), then the number of bits neee to represent each coorinate of θ fine as well as each number in the count sketch structure) is OlognMM )). But g i) = x i) f i w i) ) ρ E + Reg), which is polynomial in n, so M is polynomial in n. In aition, for M growing polynomially in n, the increase of kn in the regret boun becomes negligibly M small. Hence, we can take nmm to be polynomial in n, thus requiring Õ1) bits per coorinate to represent θ coarse an θ fine. Finally, to hanle ŵ, we eterministically roun to the nearest multiple of 1 M. Since ŵ oes not affect any choices in the algorithm, the accumulate error per coorinate n after n steps is at most M, which is again negligible for large M. Since each coorinate of ŵ is at most R, we can store them each with Õ1) bits, as well. 4

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an