Minimax rates for memory-bounded sparse linear regression

Size: px
Start display at page:

Download "Minimax rates for memory-bounded sparse linear regression"

Transcription

1 JMLR: Workshop an Conference Proceeings vol 40:1 4, 015 Minimax rates for memory-boune sparse linear regression Jacob Steinhart Stanfor University, Department of Computer Science John Duchi Stanfor University, Department of Statistics Abstract We establish a minimax lower boun of Ω k Bɛ) on the sample size neee to estimate parameters in a k-sparse linear regression of imension uner memory restrictions to B bits, where ɛ is the l parameter error. When the covariance of the regressors is the ientity matrix, we also provie an algorithm that uses boun hols in a more general communication-boune setting, where instea of a memory boun, at most B bits of information are allowe to be aaptively) communicate about each sample. 1. Introuction ÕB + k) bits an requires Õ k Bɛ ) observations to achieve error ɛ. Our lower The growth in size an scope of atasets unerscores a nee for techniques that balance between multiple practical esierata in statistical inference an learning, as classical tools are insufficient for proviing such traeoffs. In this work, we buil on a nascent theory of resource-constraine learning one in which researchers have investigate privacy Kasiviswanathan et al., 011; Duchi et al., 013), communication an memory Zhang et al., 013; Garg et al., 014; Shamir, 014), an computational constraints Berthet an Rigollet, 013) an stuy bouns on the performance of regression estimators uner memory an communication constraints. In particular, we provie minimax upper an lower bouns that exhibit a traeoff between accuracy an available resources for a broa class of regression problems, which have to this point been ifficult to characterize. To situate our work, we briefly review existing results on statistical learning with resource constraints. There is a consierable literature on computation an learning, beginning with Valiant s work 1984) on probably approximately correct PAC) learning, which separates concept learning in poynomial versus non-polynomial settings. More recent work shows uner natural complexitytheoretic assumptions) that restriction to polynomial-time proceures increases sample complexity for several problems, incluing sparse principal components analysis Berthet an Rigollet, 013) classification problems Daniely et al., 013, 014), an, pertinent to this work, sparse linear regression Zhang et al., 014; Nataraan, 1995). Showing fine-graine traeoffs in this setting has been challenging, however. Researchers have given more explicit guarantees in other resourceconstraine learning problems; Duchi et al. 013) stuy estimation uner a privacy constraint that the statistician shoul never be able to infer more than a small amount about any iniviual in a sample. For several tasks, incluing mean estimation an regression, they establish matching upper an lower bouns exhibiting a traeoff between privacy an statistical accuracy. In the communicationlimite setting, Zhang et al. 013) stuy problems where ata are store on multiple machines, an the goal is to minimize communication between machines an a centralize fusion center. In the full version of that paper Duchi et al., 014), they show that mean-square error for -imensional c 015 J. Steinhart & J. Duchi.

2 STEINHARDT DUCHI mean estimation has lower-boun Ω Bmn logm) ) for m machines, each with n observations, an B bits of communication per machine. In concurrent work, Garg et al. 014) provie ientical results. Perhaps most relate to our work, Shamir 014) consiers communication constraints in the online learning setting, where at most B bits of information about an example can be transmitte to the algorithm before it moves on to the next example. In this case, any memory-boune algorithm is communication-boune, an so communication lower bouns imply memory lower bouns Alon et al., 1999). For -imensional 1-sparse i.e. only a single non-zero component) mean estimation problems, Shamir 014) shows that Ω/B) observations are necessary for parameter recovery, while without memory constraints, a sample of size Olog ) is sufficient. He also shows that for certain principal component analyses, a sample of size Ω 4 /B ) is necessary for estimation, while without constraints) O log 3 )) observations suffice Outline of results In this work, we also focus on the online setting. We work in a regression moel, in which we wish to estimate an unknown -imensional parameter vector w R, an we observe an i.i.. sequence X i), Y i) ) R R such that Y i) = w, X i) + ɛ i), 1) where ɛ i) is mean-zero noise inepenent of X i) with Varɛ i) ) σ. We focus on the case that w is k-sparse, meaning that it has at most k non-zero entries, that is, w 0 k. We consier resource-constraine proceures that may store information about the ith observation in a B i -bit vector Z i) {0, 1} B i. After n observations, the proceure outputs an estimate ŵ of w. We have two types of resource constraints: Communication constraints Z i) is a measurable function of X i), Y i), an the history Z 1:i 1), an the estimate ŵ is a measurable function of Z 1:n) = Z 1),..., Z n) ). Memory constraints Z i) is a measurable function of X i), Y i), an Z i 1), an ŵ is a measurable function of only Z n). Note that the communication constraints o not allow for arbitrary communication, but rather only one-pass communication protocols in which all information is communicate from left to right. Any memory-constraine proceure is also communication-constraine, so we prove our lower bouns in the weaker less restrictive) communication-constraine setting, later proviing an algorithm for the memory-constraine setting which then carries through to the communicationconstraine setting). Our two main results Theorems an 3) are that, for a variety of regression problems with observation noise σ, the mean-square error satisfies subect to certain assumptions on the covariates X an noise ɛ) { } k Ω1) σ min Bn, 1 inf ŵ sup E [ ŵ w ] Õ1) σ max w W { } k Bn, k, ) Bn where the infimum is over all communication/memory-constraine proceures using B bits per iteration, an the supremum is over a set W of k-sparse vectors to be efine later. The lower boun ) shows that any B-bit communication-constraine proceure achieving square error ɛ

3 MEMORY-BOUNDED SPARSE REGRESSION k requires Ω σ ɛ B ) observations. In the absence of resource constraints, stanar results on sparse regression e.g. Wainwright, 009) show that Θ 1 ɛ k log ) observations suffice; moreover, algorithms which require only Õ) memory can essentially match this boun Agarwal et al., 01; Steinhart et al., 014). Thus, for any communication buget B α with α < 1, we lose exponentially in the imension. The upper boun also has sharp epenence on B an, though worse σ4 epenence on accuracy an variance: it shows that Õ ) observations suffice. 1.. Summary of techniques k ɛ B Before continuing, we summarize the techniques use to prove the bouns ). The high-level structure of the lower boun follows that of Arias-Castro et al. 013), who show lower bouns for sparse regression with an aaptive esign matrix. They aapt Assoua s metho 1983) to reuce the analysis to bouning the mutual information between a single observation an the parameter vector w. In our case, even this is challenging; rather than choosing a esign matrix, the proceure chooses an arbitrary B-bit message. We control such proceures using two ieas. First, we focus on the fully sparse k = 1) case, showing a quantitative ata processing inequality, base off of techniques evelope by Duchi et al. 013), Zhang et al. 013), an Shamir 014). By analyzing certain carefully constructe likelihoo ratio bouns, we show, roughly, that if the pair X, Y ) R R provies information I about the parameter w, then any B-bit message Z base on X, Y ) provies information at most BI / about w ; see Section for these results. After this ata processing boun for k = 1, in Section 3 we evelop a irect sum theorem, base on the ieas of Garg et al. 014) an a common tool in the communication complexity literature, which shows that fining a k-sparse vector w is as har as fining k inepenent 1-sparse vectors. While our lower boun buils off of several prior results, the regression setting introuces a subtle an challenging technical issue: both the irect sum technique an prior quantitative ata processing inequalities in communication settings rely strongly on the assumption that each observation consists of inepenent ranom variables. As Y necessarily epens on X in the moel 1), we have strong coupling of our observations. To aress this, we evelop a technique that allows us to treat Y as part of the communication ; instea of Z communicating information about X an Y, now Y an Z communicate information about X. The effective number of bits of communication increases by an analogue of) the entropy of Y. For the upper boun ), in Section 4 we show how to implement l 1 -regularize ual averaging Xiao, 010) using a count sketch ata structure Charikar et al., 00), which space-efficiently counts elements in a ata stream. We use the count sketch structure to maintain a coarse estimate of moel parameters, while also exactly storing a small active set of at most k coorinates. These combine estimates allow us to implement ual averaging when the l 1 -regularization is sufficiently large; the necessary amount of l 1 -regularization is inversely proportional to the amount of memory neee by the count sketch structure, leaing to a traeoff between memory an statistical efficiency. Notation We perpetrate several abuses. Superscripts inex iterations an subscripts imensions, so X i) enotes the coorinate of the ith X vector. We let D kl P Q) enote the KL-ivergence between P an Q, an use upper case to enote unboun variables an lower case to enote fixe conitioning variables, e.g., D kl P X z) QX z)) enotes the KL ivergence of the istributions of X uner P an Q conitional on Z = z, an IX; Z) is the mutual information between X an Z. We use stanar big-oh notation, where Õ ) an Ω ) enote bouns holing up to polylogarithmic factors. We let [n] = {1,..., n}. 3

4 STEINHARDT DUCHI. Lower boun when k = 1 We begin our analysis in a restricte an simpler setting than the full one we consier in the sequel, proviing a lower boun for estimation in the regression moel 1) when w is 1-sparse. Our lower boun hols even in the presence of certain types of sie information, which is important for our later analysis. We begin with a formal escription of the setting. Let r > 0 an let W be uniformly ranom in the set { r, 0, r}, where we constrain W 0 = 1, i.e. only a single coorinate of W is non-zero. At iteration i, the proceure observes the triple X i), ξ i), Y i) ) { 1, 1} Ξ R. Here X i) is an i.i.. sequence of vectors uniform on { 1, 1}, ξ i) is sie information inepenent of X 1:n) an Z 1:i 1) which is necessary to consier for our irect sum argument later), an we observe Y i) = W X i) + ξ i) + ɛ i), 3) where ɛ i) are i.i.. mean-zero Laplace istribute with Varɛ) = σ. Aitionally, we let P 0 be a null istribution in which we set Y i) = ξ i) + rs i) + ɛ i), where s i) { 1, 1} is an i.i.. sequence of Raemacher variables; note that P 0 is constructe to have the same marginal over Y i) as the true istribution. We present our lower boun in terms of average communication, efine as B 0 ef = 1 n I P0 X i) ; Y i), Z i) Z 1:i 1) ), 4) where I P0 enotes mutual information uner P 0. With this setup in place, we have: Theorem 1 Let δ = r/σ. Uner the setting above, for any estimator Ŵ Z1),..., Z n) ) of the ranom vector W, we have 1 [ ] E Ŵ W r e δ e δ 1) B 0 n. To interpret Theorem 1, we stuy B 0 uner the assumption that the proceure simply observes pairs X i), Y i) ) from the regression moel 1), that is, ξ i) 0, an we may store/communicate only B bits. By the chain rule for information an the fact that conitioning reuces entropy, we then have I P0 X i) ; Y i), Z i) Z 1:i 1) ) = I P0 X i) ; Y i) Z 1:i 1) ) + I P0 X i) ; Z i) Y i), Z 1:i 1) ) i) = I P0 X i) ; Z i) Y i), Z 1:i 1) ) H P0 Z i) ) B, where i) uses the fact that X i) an Y i) are inepenent uner P 0. Now consier the signal-tonoise ratio δ = r/σ, which we may vary by scaling the raius r. Noting that e δ e δ 1) δ for δ 1/3, Theorem 1 implies the lower boun [ ] E Ŵ W σ δ sup 0 δ 1/ ) δ Bn { σ 1 64 min 3 Bn, 1 9 where we have chosen δ = min{1/3, /3Bn)}. Thus, to achieve mean-square error smaller than σ, we require n /B observations. This contrasts with the unboune memory case, where n scales only logarithmically in the imension if we use the Lasso algorithm Wainwright, 009). }, 4

5 MEMORY-BOUNDED SPARSE REGRESSION.1. Proof of Theorem 1 We prove the theorem using an extension of Assoua s metho, which transforms the minimax lower bouning problem into one of multiple hypothesis tests Assoua, 1983; Arias-Castro et al., 013; Shamir, 014). We exten a few results of Arias-Castro et al. an Shamir to apply in our slightly more complex setting. We introuce a bit of aitional notation before continuing. Let J {±1, ±,..., ±} be a ranom inex an sign) corresponing to the vector W, so that J = means that W = r an J = means that W = r. Let P +, P enote the oint istribution of all variables conitione on J = +, respectively, an let P 0 be the null istribution as efine earlier. Throughout, we use lower-case p to enote the ensity of P with respect to a fixe base measure. Overview. We prove Theorem 1 in three steps. First, we show that the minimax estimation error Ŵ W can be lower-boune in terms of the recovery error P[Ĵ J] Lemma 1). Next, we show that recovering J is ifficult unless Z 1:n) contains substantial information about J Lemma ). Finally, we reach the crux of our argument, which is to establish a strong ata processing inequality Lemma 4 an equation 9)), which shows that, for the Markov chain W X, Y ) Z, the mutual information IW ; Z) egraes by a factor of /B from the information IX, Y ; Z); this shows that classical minimax bouns increase by a factor of /B in our setting. Estimation to testing to information. Lemma 1 For any estimator Ŵ, there is an estimator Ĵ such that We begin by bouning square error by testing error: [ ] E Ŵ W r ] [Ĵ P J. 5) We next state a lower boun on the probability of error in a hypothesis test. ef Lemma Let J Uniform{±1,..., ±}) an K ± = D kl P0 Z 1:n) ) P ± Z 1:n) ) ). Then for any estimator ĴZ1:n) ) PĴZ1:n) ) J) 1 1 ) 1 K + K + ). 6) 4 The proofs of Lemmas 1 an are in Sec. A.1 an A., respectively. We next provie upper bouns on K ± from Lemma, focusing on the + term as the term is symmetric. By the chain rule, ) D kl P 0 Z 1:n) ) P + Z 1:n) ) = =1 ) D kl P 0 Z i) z 1:i 1) ) P + Z i) z 1:i 1) ) P 0 z 1:i 1) ). For notational simplicity, introuce the shorthan Z = Z i) an ẑ = z 1:i 1) ; we will focus on bouning D kl P 0 Z ẑ) P + Z ẑ)) for a fixe i an ẑ = z 1:i 1). A strong ata-processing inequality. We now relate the ivergence between P 0 Z) an P + Z) to that for the istributions of X, Y ) i.e., the ivergence if there were no memory or communication constraints). As in Theorem 1, let δ = r/σ be the signal to noise ratio. Note that px, ξ, y ẑ) = px, ξ ẑ)py x, ξ, ẑ) = px, ξ)py x, ξ, ẑ) 5

6 STEINHARDT DUCHI for p = p 0, p ±, as the pair x, ξ) is inepenent of the inex J an the past store) ata ẑ. Thus log p + x, ξ, y ẑ) log p 0 x, ξ, y ẑ) = log p + y x, ξ, ẑ) log p 0 y x, ξ, ẑ) δ, 7) as the istribution of p + y x, ξ, ẑ, s) is a mean shift of at most r relative to p 0 y x, ξ, s), an both istributions are Laplaceσ/ ) about their mean recall that s is the Raemacher variable in the efinition of p 0 ; marginalizing it out as in 7) can only bring the ensities closer). Leveraging 7), we obtain the following lemma, which bouns the KL-ivergence in terms of the χ -ivergence. Lemma 3 For any inex an past ẑ, p+ z ẑ) p 0 z ẑ) D kl P 0 Z ẑ) P + Z ẑ)) µz) e δ p+ z ẑ) p 0 z ẑ) µz). p + z ẑ) p 0 z ẑ) Proof The first inequality is the stanar boun of KL-ivergence in terms of χ -ivergence cf. Tsybakov, 009, Lemma.7). The secon follows from inequality 7), as p + z ẑ) = p + z ξ, x, y, ẑ)p + x, ξ, y ẑ) = p 0 z ξ, x, y, ẑ)p + x, ξ, y ẑ) e δ p 0 z ξ, x, y, ẑ)p 0 x, ξ, y ẑ) = e δ p 0 z ẑ), which gives the esire result. Picking up from Lemma 3, we analyze p + z ẑ) p 0 z ẑ) in Lemma 4, which is the key technical lemma in this section all results so far, while non-trivial, are stanar results in the literature). Lemma 4 Information contraction) For any ẑ an z, we have p + z ẑ) p 0 z ẑ) e δ 1)p 0 z ẑ) D kl P 0 X y, z, ẑ) P 0 X y, ẑ))p 0 y z, ẑ). Lemma 4 is prove in Sec. A.3. The intuition behin the proof is that, since p + y ẑ) = p 0 y ẑ), the only way for z to istinguish between 0 an + is by storing information about x information about x is useless since it oesn t affect the istribution over y). The amount of information about x is measure by the KL ivergence term in the boun; the e δ 1 term appears because even when x is known, the istributions over y iffer by a factor of at most e δ. Proving Theorem 1. Combining Lemmas 3 an 4, we boun D kl P 0 Z ẑ) P + Z ẑ)) in terms of an average KL-ivergence as follows; we have D kl P 0 Z ẑ) P + Z ẑ)) i) e δ p+ z ẑ) p 0 z ẑ) µz) p 0 z ẑ) ii) e δ e δ 1) D kl P 0 X y, z, ẑ) P 0 X y, ẑ))p 0 y z, ẑ)) P 0 z ẑ) iii) e δ e δ 1) = e δ e δ 1) I P0 X ; Z Y, Ẑ = ẑ), D kl P 0 X y, z, ẑ) P 0 X y, ẑ)) P 0 y, z ẑ) 6

7 MEMORY-BOUNDED SPARSE REGRESSION where the final equality is the efinition of conitional mutual information. Step i) follows from Lemma 3, step ii) by the strong information contraction of Lemma 4, an step iii) as a consequence of Jensen. By noting that IA; B C) + IA; C) = IA; B, C) for any ranom variables A, B, C, the final information quantity is boune by IX ; Z, Y Ẑ = ẑ), whence we obtain D kl P 0 Z ẑ) P + Z ẑ)) e δ e δ 1) I P0 X ; Z, Y Ẑ = ẑ). 8) By construction, the X are inepenent even given Ẑ), yieling the oint information boun 1 D kl P 0 Z ẑ) P + Z ẑ)) P 0 ẑ) eδ e δ 1) =1 = eδ e δ 1) I P0 X ; Z, Y Ẑ = ẑ)p 0ẑ) =1 I p0 X ; Z, Y Ẑ) =1 eδ e δ 1) I p0 X; Z, Y Ẑ). 9) Returning to the full ivergence in inequality 6), we sum over inices i = 1,..., n to obtain 1 4 ) ) D kl P 0 Z 1:n) ) P Z 1:n) ) + D kl P 0 Z 1:n) ) P + Z 1:n) ) =1 eδ e δ 1) Hence, by Lemma, P[Ĵ J] 1 which proves Theorem 1. [ ] E Ŵ W r 3. Lower boun for general k I P0 X i) ; Z i), Y i) Z 1:i 1) ) = eδ e δ 1) B0 n. e δ e δ 1) B 0 n 1. Applying Lemma 1, we have e δ e δ 1) B 0 n, Theorem 1 provies a lower boun on the memory-constraine minimax risk when k = 1. We can exten to general k using a so-calle irect-sum approach e.g. Braverman, 01; Garg et al., 014). To o so, we efine a istribution that is a bit ifferent from the stanar moel 1). Let W { r/ k, 0, r/ k} be a -imensional vector, whose coorinates we split into k contiguous blocks, each of size at least k. Within a block, with probability 1 all coorinates are zero, an otherwise we choose a single coorinate uniformly at ranom within the block) to have value ±r/ k. We enote this istribution by P. As before, we let each X i) { 1, 1} be a ranom sign vector. We now efine the noise process. Let ɛ i) l choose ɛ i) l i.i.. an uniformly from {±r/ k} if W l ɛ i) 0 Laplaceσ/ ). At iteration i, then, we observe = 0 for all i if Wl 0,, where = 0, an set ɛ i) = ɛ i) 0 + k l=1 ɛi) l Y i) = W X i) + ɛ i). 10) 7

8 STEINHARDT DUCHI The key iea that allows us to exten our techniques from the previous section to obtain a lower boun for general k N as oppose to k = 1) is that we can ecompose Y as [ ] [ Y i) = Wl X i) l + ɛ i) l + W l X i) l + ] ɛ i) l + ɛ i) 0. l l We can then reuce to the case when k = 1 by letting W = Wl, the lth block, ξi) = W l X i) l +, an ɛ i) = ɛ i) 0. Of course, our proceure is not allowe to know W l or ɛ l, but any l l ɛi) l lower boun in which a proceure observes these is only stronger than one in which the proceure oes not. By showing that estimation of the lth block is still challenging in this moel, we obtain our irect sum result, that is, that estimating each of the k blocks is ifficult in a memory-restricte setting. Thus, Theorem 1 gives us: Proposition 1 Let l k be the size of the lth block, let B l ef = 1 n I P X i) l ; Z i), Y i) Z 1:i 1), Wl, W l ), 11) an let ν = r/σ. Then for any communication-constraine estimator Ŵl of W [ E Ŵl Wl ] r 4k 1 B ) l n e ν/ k e ν/ k 1). 1) l Proof The key is to relate P to the istributions consiere in Section, for which we alreay have results. While this may seem extraneous, using P is crucial for allowing our bouns to tensorize in Theorem. First focusing on the setting k = 1 as in Theorem 1, let P 0 be the null istribution efine in the beginning of Section, an let P be the oint istribution of W, Z, X, Y when W is rawn uniformly from the 1-sparse vectors in { r, 0, r}. Letting P = 1 P 0 + P ), we have for any i [n] I P Z 1:i 1), W ) = 1 I P Z 1:i 1), W = 0) + 1 I P Z 1:i 1), W, W 0) Thus, in the setting of Theorem 1, if we efine = 1 I P 0 Z 1:i 1) ) + 1 I P Z 1:i 1), W ) 1 I P 0 Z 1:i 1) ). B ef = 1 n I P X i) ; Y i), Z i) Z 1:i 1), W ), we have B 0 B. Moreover, we have P 1 P, an so we obtain that E P [ Ŵ W ] 1 E [ Ŵ P W ], where the secon expectation is the risk boune in Theorem 1. Couple with the efinition of B, this implies that for k = 1, we have ] E P [ Ŵ W 1 r 1 e δ e δ 1) B ) 0 n r 1 4 e δ e δ 1) Bn ). 13) l, 8

9 MEMORY-BOUNDED SPARSE REGRESSION Now we show how to give a similar result, focusing on the k > 1 case, for a single block l uner the istribution P. Inee, we have that P W l ) = 1 P Wl, W l 0, W l ) + 1 P W l, W l = 0). Let B l = 1 n n I P Xi) l ; Z i), Y i) Z 1:i 1), Wl, W l = w l ) be the average mutual information conitione on the realization W l = w l. Then the lower boun 13), couple with the iscussion preceing this proposition, implies [ ] E P Ŵl Wl W l = w l r 1 4k e δ e δ 1) B l n ), 14) l where we have use δ = ν/ k. We must remove the conitioning in 14). To that en, note that ) 1 B l P w l ) B l P w l ) = B 1 l by Jensen. Integrating 14) completes the proof. Extening this proposition, we arrive at our final lower boun, which hols for any k. Theorem Let ν = r/σ an assume that ν k/3. Assume that Z i) consists of B i 1 bits, an efine B = 1 n n B i. For any communication-constraine estimator Ŵ of W, ) [ ] E Ŵ W r 1 4 9Bnν + 4nν log1 + ν ) k. /k To prove Theorem, we sum 1) over l from 1 to k; the main work is to show that k B l=1 l from Proposition 1 is at most slightly larger than the bit constraint B. The intuition is that Y i), being a single scalar, only as a small amount of information on top of Z i). The full proof is in Sec. A.4. We en with a few remarks on the implications of Theorem for asymptotic rates of convergence. Fixing the variance parameter σ an number of observations n, we choose the size parameter r to optimize ν. { To satisfy the assumptions } of the theorem, we must have r σ k/7. Choosing r = σ 8 min k /k 36Bn, k 9, e 9 4 B 1, we are guarantee that 4 log1 + ν ) 9B, an also that 9Bnν +4nν log1+ν ) k /k = 1 4, whence Theorem implies the lower boun [ ] E Ŵ W σ 18 min { k /k 36Bn, k } ) 9, e 9 k 4 B 1 σ min Bn, 1. That is, we require at least an average of B = Ω/ log)) bits of communication or memory) per roun of our proceure to achieve the optimal unconstraine) estimation rate of Θ σ k log)/n ). 4. An algorithm an upper boun We now provie an algorithm for the setting when the memory buget satisfies B Ω1) max{k log, k log n}. It is no loss of generality to assume that the buget is at least this high, as otherwise we cannot even represent the optimal vector w R to high accuracy. Before giving the proceure, we enumerate the amittely somewhat restrictive) assumptions uner which it operates. As before, all proofs are in the supplement. Assumption A The vectors X i) an noise variables ɛ i) are inepenent an are rawn i.i.. Aitionally, they satisfy E[X] = 0, E[ɛ] = 0, an CovX) = I, an X satisfies X ρ with probability 1. Moreover, ɛ is σ-sub-exponential, meaning that E[ ɛ k ] k!)σ k for all k. 9

10 STEINHARDT DUCHI In aition, we assume without further mention as in Theorems 1 an that w is k-sparse, meaning w 0 k, an that we know a boun r such that w r. In the construction we gave in the lower boun, we assume ρ = 1 an r = νσ = Omin{1, k/bn}σ). The strongest of our assumptions is that CovX) = I ; letting U = supp w, we can weaken this to E[X U X U ] = 0 an CovX U) γi for some γ > 0, but we omit this for simplicity in exposition. Further weakenings of this assumption in online settings appear possible but challenging Agarwal et al., 01; Steinhart et al., 014). We now escribe a memory-boune proceure for performing an analogue of regularize ual averaging RDA; Xiao 010)). In RDA, one receives a sequence f i of loss functions, maintaining a vector θ i) of graient sums, an at iteration i, performs the following two-step upate: w i) = arg min w W { θ i), w + ψw) } an θ i+1) = θ i) + g i), where g i) f i w i) ). 15) The set W is a close convex constraint set, an the function ψ is a strongly-convex regularizing function; Xiao 010) establishes convergence of this proceure for several functions ψ. We apply a variant of RDA 15) to the sequence of losses f i w) = 1 y i) w x i)). Our proceure separates the coorinates into two sets; most coorinates are in the first set, store in a compresse representation amitting approximate recovery. The accuracy of this representation is too low for accurate optimization but is high enough to etermine which coorinates are in suppw ). We then track these few) important coorinates more accurately. For compression we use a count sketch CS) ata structure Charikar et al., 00), which has two parameters: an accuracy ɛ, an a failure probability δ. The CS stores an approximation x to a vector x by maintaining a low-imensional proection Ax of x an supports two operations: Upatev), which replaces x with x + v. Query), which returns an approximation ˆx to x. Let Cɛ, δ) enote the CS ata structure with parameters ɛ an δ. It satisfies the following: Proposition Gilbert an Inyk 010), Theorem ) Let 0 < ɛ, δ < 1 an k 1 ɛ. The ata structure Cɛ, δ) can perform Upatev) an Query) in O log n logn/δ) δ ) time an stores O ɛ ) real numbers. If ˆx is the output of Query), then ˆx x ɛ x x top k hols uniformly over the course of all n upates with probability 1 δ, where x top k enotes x with only its k largest entries in absolute value) kept non-zero. Base on this proposition, we implement a version of l 1 -regularize regularize ual averaging, which we present in Algorithm 1. We show subsequently that this proceure with high probability) correctly implements ual averaging, so we can leverage known convergence guarantees to give an upper boun on the minimax rate of convergence for memory-boune proceures. Letting w i) be the parameter vector in iteration i, the algorithm tracks three quantities: first, a i) count-sketch approximation θ coarse to θ i) ef = i <i f i wi ) ), which requires O 1 ɛ log n δ ) memory i) we specify ɛ presently); secon, a set Ũ of active coorinates; an thir, an approximation θ fine to θ i) supporte on Ũ. We also track a running average ŵ of w1:i), which we use at the en as a parameter estimate; this requires at most Ũ numbers. In the algorithm, we use the soft-thresholing an proection operators, given by T c x) ef = [signx ) [ x c] + ] =1 an P r x) ef = { x rx/ x if x r otherwise. 10

11 MEMORY-BOUNDED SPARSE REGRESSION Algorithm 1 Low-memory l 1 -RDA for regression Algorithm parameters: c,, R, ɛ, δ. Initialize Cɛ, δ), ŵ 0, Ũ for i = 1 to n o w i) P r ηt c n θ i) fine )) an ŵ 1 i wi) + i 1 i ŵ Preict ỹ i) = w i) ) x i) an compute graient g i) = y i) ỹ i) )x i) Call Upateg i) i+1) ) an set θ coarse Query) for Ũ o θ i+1) i) fine, θ fine, + gi) en for for Ũ o if θ i+1) coarse, c ) n then i+1) A to Ũ an set θ fine, elsẽ θ i+1) fine, 0 en if en for en for return ŵ θ i+1) coarse, i) We remark that in Algorithm 1, we nee track only ŵ, θ fine, an the count sketch ata structure, so the memory usage in real numbers) is boune by ŵ 0 + θ i) fine 0, plus the size of the count sketch structure. We will see later that the size of this structure is roughly inversely proportional to the egree of l 1 -regularization. Our main result concerns the convergence of Algorithm 1. Theorem 3 Let Assumption A an the moel 1) hol. With appropriate setting of the constants c,, r, ɛ specifie in the proof), for buget B [k, ], Algorithm 1 uses at most ÕB) bits an achieves risk E [ { }) k ŵ w ] = Õ max rρσ Bn, r ρ k. Bn To compare Theorem 3 with our lower bouns, assume that k Bn 1 an set ρ = 1 an r = νσ; this matches the setting of our lower bouns in Theorems 1 an. Then rρσ = νσ k = bn σ, an we have Ω σ Bn) k infŵ E [ ŵ w ] Õ σ Bn) k. In particular, at least for some non-trivial regimes, our upper an lower bouns match to polylogarithmic factors. This oes not imply that our algorithm is optimal: if the raius r is fixe as n grows, we expect the optimal error to ecay as 1 1 n rather than the n rate in Theorem 3. One reason to believe the optimal rate is 1 n is that it is attainable when k = 1: simply split the coorinates into batches of size ÕB) an process the batches one at a time; since suppw ) has size 1, it is always fully containe in one of the batches. Analysis of Algorithm 1 Define s to be the iteration where is ae to Ũ or if this never happens). Also let i ba be the first iteration where θ coarse i) θ i) > n, an efine { θs ) a = coarse, θs ) : s < i ba, 0 : s i ba. 11

12 STEINHARDT DUCHI The vector a tracks the offset between θ fine an θ, while being clippe to ensure that a n. Our key result is that Algorithm 1 implements an instantiation of RDA: Lemma 5 Let θ i), w i) ) be the sequence of iterates prouce by RDA 15) with regularizer ψw) = 1 η w + c n w 1 + a w, where suppw) is constraine to lie in U an have l -norm at most r. Suppose that for some G, G 1 n max max n U θi), ɛ k)g, an c + G. Also assume ɛ 1 k. Then, with probability 1 δ, w = w. Let Reg ef = n f iw i) ) f i w ) enote the regret of the RDA proceure in Lemma 5, an let E ef = n f iw ) = n ɛi) ) be the empirical loss of w. A mostly stanar analysis yiels: Lemma 6 Assume ɛ 1 k. Also suppose that n max max U θi) ρ Reg + E) log/δ). 16) Then, letting V ef = log/δ) ɛ k)) for short-han, the regret Reg satisfies Reg Rρ k ) + V E + kr ρ 4 + 4V ) 17) for appropriately chosen c an, which moreover satisfy conitions of Lemma 5. Lemma 6 is useful because 1 n E [Reg] can be shown to upper-boun E [ ŵ w ]. To wrap up, we nee to eal with a few etails. First, we nee to show that 16) hols with high probability: Lemma 7 With probability 1 δ, θ i) ρ Reg + E) log/δ) for all i n an U. Secon, we nee a high probability boun on E; we prove the following Lemma in Section A.10. Lemma 8 Let p 1. There are constants K p satisfying K p 6p for p 5 an K p /p as p such that for any t 0, [ ] P ɛ i) ) σ n + 4K p σ 3 ) p n + n 1/p p t. t Substituting p = max{5, log 1 δ }, we have, for a constant C 7e an with probability 1 δ, E = ɛ i)) nσ + Cσ log 1 δ [ n + n 1/5 log 1 ]. 18) δ Combining Lemmas 7 an 8 with Lemma 6, we then have the following with probability 1 3δ: Reg Rρσ k ) + V n + C log1/δ)[ n + n 1/5 log 1/δ)] + kr ρ 4 + 4V ). 1

13 MEMORY-BOUNDED SPARSE REGRESSION To interpret this, remember that V = log/δ) ɛ k)) an that the count-sketch structure stores O logn/δ)/ɛ) bits; also, we nee ɛ 1 k for Proposition to hol. As long as B k, we can take ɛ = O 1/B) using ÕB) bits) an have V = Õ /B). The entire first term ) above is thus Õ kn Rρσ B, while the secon is Õ R ρ k ) B. This essentially yiels Theorem 3; the full proof is in Section A.5. From real numbers to bits. In the above, we analyze a proceure that stores ÕB) real numbers. In fact, each number only requires Õ1) bits of precision for the algorithm to run correctly. A etaile argument for this is given in Section B. Acknowlegments. The anonymous reviewers were instrumental in improving this paper. JS was supporte by an NSF Grauate Research Fellowship an a Fannie & John Hertz Fellowship. References A. Agarwal, S. Negahban, an M. Wainwright. Stochastic optimization an sparse statistical recovery: An optimal algorithm for high imensions. In Avances in Neural Information Processing Systems 5, 01. N. Alon, Y. Matias, an M. Szegey. The space complexity of approximating the frequency moments. Journal of Computer an System Sciences, 581): , E. Arias-Castro, E. J. Canes, an M. A. Davenport. On the funamental limits of aaptive sensing. Information Theory, IEEE Transactions on, 591):47 481, 013. P. Assoua. Deux remarques sur l estimation. Comptes renus es séances e l Acaémie es sciences. Série 1, Mathématique, 963): , Q. Berthet an P. Rigollet. Complexity theoretic lower bouns for sparse principal component etection. In Proceeings of the Twenty Sixth Annual Conference on Computational Learning Theory, 013. M. Braverman. Interactive information complexity. In Proceeings of the Fourty-Fourth Annual ACM Symposium on the Theory of Computing, 01. L. Breiman. Probability. Society for Inustrial an Applie Mathematics, 199. M. Charikar, K. Chen, an M. Farach-Colton. Fining frequent items in ata streams. In Automata, Languages an Programming, pages Springer, 00. A. Daniely, N. Linial, an S. Shalev-Shwartz. More ata spees up training time in learning halfspaces over sparse vectors. In Proceeings of the Twenty Sixth Annual Conference on Computational Learning Theory, 013. A. Daniely, N. Linial, an S. Shalev-Shwartz. From average case complexity to improper learning complexity. In Proceeings of the Fourty-Sixth Annual ACM Symposium on the Theory of Computing,

14 STEINHARDT DUCHI V. H. e la Peña an E. Giné. Decoupling: From Depenence to Inepenence. Springer, J. C. Duchi, M. I. Joran, an M. J. Wainwright. Local privacy an statistical minimax rates. In Founations of Computer Science FOCS), 013 IEEE 54th Annual Symposium on, pages IEEE, 013. J. C. Duchi, M. I. Joran, M. J. Wainwright, an Y. Zhang. Information-theoretic lower bouns for istribute statistical estimation with communication constraints. arxiv: [cs.it], 014. A. Garg, T. Ma, an H. Nguyen. On communication cost of istribute statistical estimation an imensionality. In Avances in Neural Information Processing Systems 7, pages , 014. A. Gilbert an P. Inyk. Sparse recovery using sparse matrices. Proceeings of the IEEE, 986): , 010. S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhonikova, an A. Smith. What can we learn privately? SIAM Journal on Computing, 403):793 86, 011. B. K. Nataraan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 4 ):7 34, S. Shalev-Shwartz. Online learning an online convex optimization. Founations an Trens in Machine Learning, 4): , 011. O. Shamir. Funamental limits of online an istribute algorithms for statistical learning an estimation. In Neural Information Processing Systems 8, 014. J. Steinhart, S. Wager, an P. Liang. The statistics of streaming sparse regression. arxiv preprint arxiv: , 014. A. B. Tsybakov. Introuction to Nonparametric Estimation. Springer, 009. L. G. Valiant. A theory of the learnable. Communications of the ACM, 711): , M. J. Wainwright. Sharp threshols for high-imensional an noisy sparsity recovery using l 1 - constraine quaratic programming Lasso). IEEE Transactions on Information Theory, 555): 183 0, 009. L. Xiao. Dual averaging methos for regularize stochastic learning an online optimization. Journal of Machine Learning Research, 11: , 010. Y. Zhang, J. Duchi, M. Joran, an M. J. Wainwright. Information-theoretic lower bouns for istribute statistical estimation with communication constraints. In Avances in Neural Information Processing Systems, pages , 013. Y. Zhang, M. J. Wainwright, an M. I. Joran. Lower bouns on the performance of polynomialtime algorithms for sparse linear regression. In Conference on Learning Theory COLT),

15 MEMORY-BOUNDED SPARSE REGRESSION Appenix A. Deferre proofs A.1. Proof of Lemma 1 Given an estimator Ŵ, efine an estimator Ĵ by letting Ĵ be the inex of the largest coorinate of Ŵ an letting signĵ) be the sign of that coorinate ties can be broken arbitrarily). We claim that Ŵ W r I[Ĵ J]. If Ĵ = J, then either signj) = signĵ), in which case I[Ĵ J] = 0, or else signj) signĵ), in which case Ŵ W W J = r. In either case, the result hols. Therefore, we turn out attention to the case that that Ĵ J. Then Ŵ W Ŵ Ĵ W Ĵ ) + Ŵ J W J ) Ŵ Ĵ + Ŵ J r) i) Ŵ J + Ŵ J r) r, where i) is because Ĵ inexes the largest coorinate of Ŵ by construction. So, the claime inequality hols, an the esire result follows by taking expectations. A.. Proof of Lemma We note that P[Ĵ J] = 1 P[Ĵ = J] = 1 1 P + Ĵ = +) + P Ĵ = ) =1 i) = 1 1 ) 1 ii) iii) iv) 1 1 ) ) ) 1 1 ) =1 ) ) P + Ĵ = +) P 0Ĵ = +) + P Ĵ = ) P 0Ĵ = ) P + Ĵ = +) P 0Ĵ = +) + P Ĵ = ) P 0Ĵ = ) =1 P + P 0 T V + P P 0 T V = P + P 0 T V + P P 0 T V =1 D kl P 0 P +J ) + D kl P 0 P ), =1 as was to be shown. Here i) uses the fact that =1 P 0Ĵ = +) + P 0Ĵ = ) = 1, ii) uses the variational form of TV istance, iii) is Cauchy-Schwarz, an iv) is Pinsker s inequality. 15

16 STEINHARDT DUCHI A.3. Proof of Lemma 4 We first make three observations, each of which relies on the specific structure of our problem. First, we have p + z x, y, ẑ) = p 0 z x, y, ẑ) 19a) since in both cases X are i.i.. ranom sign vectors, Z i) is X i), ξ i), Y i), Z 1:i 1) )-measurable, an ξ i) is inepenent of X. Seconly, we have p + y ẑ) = p 0 y ẑ) 19b) by construction of p 0. Finally, we have the inequality p + x, y ẑ) p 0 x, y ẑ) = p + x, y ẑ) p 0 x, y ẑ) 1 p 0x, y ẑ) e δ 1)p 0 x, y ẑ), 19c) as the ratio between the quantities as note by inequality 7) is boune by e δ. Therefore, expaning the istance between p + z) an p 0 z), we have p + z ẑ) p 0 z ẑ) = p + z x, y, ẑ)p + x, y ẑ) p 0 z x, y, ẑ)p 0 x, y ẑ)) i) = p 0 z x, y, ẑ)p + x, y ẑ) P 0 x, y ẑ)) ii) = p 0 z x, y, ẑ) p 0 z y, ẑ))p + x, y ẑ) P 0 x, y ẑ)), where step i) follows from the inepenence equality 19a) an step ii) because p 0 z y, ẑ) is constant with respect to x an p + y ẑ) = p 0 y ẑ) by 19b). Next, by inequality 19c) we have that P + x, y ẑ) P 0 x, y ẑ)) e δ 1)P 0 x, y ẑ), 1) whence we have the further upper boun p + z ẑ) p 0 z ẑ) i) e δ 1) p 0 z x, y, ẑ) p 0 z y, ẑ) P 0 x, y ẑ) ii) = e δ P 0 x, y z, ẑ)p 0 z ẑ) 1) P 0y z, ẑ)p 0 z ẑ) P 0 x, y ẑ) P 0 y ẑ) P 0x, y ẑ) = e δ 1)p 0 z ẑ) P 0 x, y z, ẑ) P 0 x, y ẑ) P 0y z, ẑ) P 0 y ẑ) = e δ 1)p 0 z ẑ) P 0 x y, z, ẑ) P 0 x y, ẑ) P 0 y z, ẑ) iii) e δ 1)p 0 z ẑ) D kl P 0 X y, z, ẑ) P 0 X y, ẑ))p 0 y z, ẑ), where i) is by 0) an 1), ii) is by Bayes rule, an iii) is by Pinsker s inequality. 0) 16

17 MEMORY-BOUNDED SPARSE REGRESSION A.4. Proof of Theorem First, we observe that e ν/ k e ν/ k 1) ν /k for ν k/3, so we may replace the lower boun in Proposition 1 with [ ] E Ŵl Wl r 1 4k 4 B l nν ). ) k l Now, by inequality ) an Jensen s inequality, we have E P [ ] Ŵ W = k l=1 E P [ ] Ŵl Wl r 4k r 4 r 4 l 1 4nν ) B l k /k k=1 1 4nν k ) B k 3 l /k l=1 1 4nν k k /k l=1 B l. We woul thus like to boun k B l=1 l. Next note that by the inepenence of the X i) l an the chain rule for mutual information, we have that k I P X i) l ; Z i), Y i) Z 1:i 1), W ) I P X i) ; Z i), Y i) Z 1:i 1), W ) l=1 = I P X i) ; Y i) Z 1:i 1), W ) + I P X i) ; Z i) Y i), Z 1:i 1), W ). 3) We can upper boun the final term in expression 3) by HZ i) ) B i. In aition, the first term on the right han sie of 3) satisfies I P X i) ; Y i) Z 1:i 1), W ) = hy i) Z 1:i 1), W ) hy i) X i), Z 1:i 1), W ) 1 logπe VarY i) )) hlaplaceσ/ )), where h enotes ifferential entropy an we have use that the normal istribution maximizes entropy for a given variance. Using that hlaplaceσ/ )) = 1 + log σ), inequality 3) thus implies I P X i) ; Z i), Y i) Z 1:i 1), W ) B i + 1 logπe Var[Y i) ]) 1 log σ) = B i + 1 logπeσ + R )) 1 loge σ ) = B i + 1 log π σ + R ) eσ 9 8 B i + 1 log 1 + ν ). In the last step we use 1 logπ/e) B i an R /σ 8R /σ = ν. Using the efinition 11) of B l an the preceing boun, we obtain that k l=1 B l 9/8)B + 1 log 1 + ν ). Using k l=1 B l 9/8)B + 1 log1 + ν ) gives the result. 17

18 STEINHARDT DUCHI A.5. Proof of Theorem 3 To prove Theorem 3, we first state the following more precise theorem, whose proof is given in Sec. A.6: Theorem A Let ɛ 1 k. With probability 1 3δ, we have Reg Rρσ k ) + V n + C log1/δ)[ n + n 1/5 log 1/δ)] + kr ρ 4 + 4V ). 4) In particular, let: δ = k R ρ 1σ n ŵ ef = 1 n w 1) + + w n)). Then we have the minimax boun: E [ ŵ w ] Rρσ k ) + V n + C log1/δ)[ n + n n 1/5 log 1/δ)] + kr ρ 6 + 4V )). Base on this theorem, we have E [ ŵ w ] ÕRρσ1 + V ) k/n + R ρ 1 + V )k/n). Also, by Proposition, the count sketch structure stores Õ1/ɛ) numbers; in aition, only Ok) numbers are neee to store ŵ an θ fine. Thus as long as b Ωk) so that ɛ 1 k ), we have ef ɛ Õ1/b). Next recall that V = log/δ) ɛ k)) = Õ1 + /b). We assume that b, so we therefore have E [ ŵ w ] Õ Rρσ ) k/bn + R ρ k/bn )) k = Õ max Rρσ bn, R ρ k, bn as was to be shown. A.6. Proof of Theorem A The count-sketch guarantee Proposition ), as well as Lemmas 7 an 8, each hol with probability 1 δ, so with probability 1 3δ all 3 results will hol, in which case Lemma 6 establishes the state regret boun by plugging in the boun on E from Lemma 8 to 17). To prove the boun on E [ ŵ w ], first note that [ ] E f i w i) ) = 1 ɛ i) + w i) w ) x i)) = 1 E [ ɛ i)) ] ) + w i) w, 18

19 MEMORY-BOUNDED SPARSE REGRESSION where the secon equality follows because Cov[X i) ] = I. Also note that, by convexity, ŵ w 1 n n wi) w. Therefore, any boun on E [ n f i w i) ) ] implies a boun on E[ ŵ w ]; in particular, [ ] E[ ŵ w ] n E f i w i) ) ɛ i)). The main ifficulty is that we have a boun that hols with high probability, an we nee to make it hol in expectation. To accomplish this, we simply nee to show that i f i is not too large even when the high probability boun fails. To o this, first note that we have the boun f i w i) ) = 1 ɛ i) + w i) w ) x i)) i) 1 + δ 0 ɛ i)) 1 + ii) 1 + δ δ 0 ) w i) w ) x i)) ɛ i)) δ 0 ) R ρ, where i) is by Young s inequality an ii) is by Cauchy-Schwarz an the fact that w i) w R. Thus, we have the boun We also straightforwarly have the boun ) f i w i) ) n 1 + 1δ0 R ρ δ 0 ɛ i)). f i w i) ) Reg + 1 ɛ i)). Combining these yiels ) ) f i w i) ) min Reg, n 1 + 1δ0 R ρ δ 0 ɛ i)). 5) [ ef ɛ Let σ 0 = E i) ) ] an let Reg 0 enote the right-han-sie of 4). Then, taking expectations of 5) an using the fact that Reg Reg 0 with probability 1 3δ, we have 1 σ 0 + E [ ŵ w 1 3δ) ]) Reg n 0 + 6δ 1 + 1δ0 ) R ρ δ 0 σ 0, 19

20 STEINHARDT DUCHI which implies that E [ ) ŵ w ] 1 3δ) Reg n 0 + 1δ 1 + 1δ0 R ρ + δ 0 σ0 n Reg δ R ρ + δ 0 σ0 δ 0 = Rρσ k ) + V n + C log1/δ)[ n + n n 1/5 log 1/δ)] + kr ρ 4 + 4V ) ) + 4 δ δ 0 R ρ + δ 0 σ 0. Note also that σ 0 σ. Setting δ 0 = kr ρ σ n an δ = kδ 0 boun. A.7. Proof of Lemma 5 1n = k R ρ 1σ n, we obtain the esire We procee by contraiction. Let i be the minimal inex where w i) w i), an let be a particular coorinate where the relation fails. Note that, by minimality of i, we have θ i) = θ i). We split into cases base on how i relates to s. w i) Case 1: i < s. By construction, we have w i) = 0 for i < s. Therefore, we must have 0. In particular, this implies that θ i) + a > c n. But a n by construction, so we must have θ i) > c ) n. Since θ i) = θ i) an θ coarse i) θ i) n, we have θ i) coarse, > c ) n; thus i s, which is a contraiction. Case : i s. Note that w i) = S R ηt c n θ i) fine, )) = S R ηt c n θ i) + θ s ) coarse, θs ) ))) = S R ηt c n θ i) + θ s ) coarse, θs ) ))), while w i) = S R ηt c n θ i) + a )). Therefore, w i) w i) implies that a θ s ) coarse, θs ), which means that s i ba. Hence, i i ba as well. This implies that there is an iteration where θ i ba) coarse θ iba) > n, such that the RDA algorithm an Algorithm 1 make the same preictions for all i < i ba. We will focus the rest of the proof on showing this is impossible. Since RDA an Algorithm 1 make the same preictions for i < i ba, we in particular have θ i ba) = θ i ba). Now, using the count sketch guarantee, we have θ i ba) coarse θ i ba) ɛ θ i ba) θ i ba) top k = ɛ θ i ba) θ i ba ) top k. But since U has size k, we have the boun θ i ba) θ i ba ) top k U θ i ba) G n k). 0

21 MEMORY-BOUNDED SPARSE REGRESSION Putting these together, we have θ i ba) coarse θ i ba) ɛ k)g n n, which contraicts the assumption that θ i ba) coarse θ iba) > n. Since both cases lea to a contraiction, Algorithm 1 an the RDA proceure must match. Moreover, the conition c + G ensures that no coorinate outsie of U is ae to the active set Ũ, which completes the proof. A.8. Proof of Lemma 6 By the general regret boun for mirror escent Theorem.15 1 of Shalev-Shwartz 011)), Reg = f i w) f i w ) 1 η w + c n w 1 + a w + η z i) U 1 η R + c n + a ) w 1 + η x i) U f i w i) ) 1 η R + c + ) nkr + kηρ fw i) ) 1 η R + c + ) nkr + kηρ E + Reg). Reg+E) log/δ) Now note that, by the conition of the proposition, we can take G = ρ n ; hence ɛ k)reg+e) log/δ) Reg+E) log/δ) setting = ρ n, c = ρ n 1 + ) ɛ k) will satisfy the conitions of Lemma 5. Recalling the efinition of V above, we thus have the boun Re-arranging, we have: Reg 1 η R + krρv Reg + E + kηρ Reg + E). 1 kηρ ) Reg R η + krρv Reg + E + kηρ E. Now set η = min R ρ, 1 ); note that the secon term in the min allows us to ivie through ke kρ by 1 kηρ while only losing a factor of. We then have Reg Rρ ) k E + V Reg + E + kr ρ, where the first term is the boun when η = R/ρ ke an the secon term upper-bouns R η case that η = 1/kρ. Solving the quaratic, we have which completes the proof. Reg Rρ k + V ) E + kr ρ 4 + 4V ), in the 1. Note that Shalev-Shwartz 011) refers to the proceure as online mirror escent rather than regularize ual averaging, but it is actually the same proceure.. More precisely, we use the fact that if x a + bx + c for a, b, c 0, then x a + b + c. 1

22 STEINHARDT DUCHI A.9. Proof of Lemma 7 Note that θ i+1) θ i) = z i) = y i) w i) ) x i) )x i) = w w i) ) x i) + ɛ i) )x i). Since suppw w i) ) U an U, this is a zero-mean ranom variable since Cov[X i) ] = I), an so θ i) is a martingale ifference sequence. Moreover, letting F i be the sigma-algebra generate by the first i 1 samples, we have that [ ) ) ] E exp λ θ i+1) θ i) ρ λ f i w i) ) F i [ = E exp λx i) y i) w i) ) x i)) ) ] ρ λ f i w i) ) F i [ = E exp λx i) y i) w i) ) x i))) ] F i [ i) 1 E exp λ ρ y i) w i) ) x i)) ) ] ρ λ f i w i) ) F i = 1. Here i) is obtaine by marginalizing over x i), using the fact that xi) is inepenent of both y i) an w i) ) x i) since suppw ) U an suppw i) ) U), together with the fact that x i) ρ an hence is sub-gaussian. Consequently, for any λ, Y i = exp λθ i+1) ρ λ ) i k=1 f kw k) ) is a supermartingale. For any t, efine the stopping criterion Y i t. Then, by Doob s optional stopping theorem Corollary 5.11 of Breiman 199)) applie to this stoppe process, together with Markov s inequality, we have P [max n Y i t] 1 t. Bouning i k=1 f k by n k=1 f k an inverting, for any fixe an λ, with probability 1 δ, we have n max θi) λρ f i w i) ) + 1 λ log1/δ). Union-bouning over = 1,..., an ±θ changes the log ) term to log/δ), yieling the high-probability-boun n max θi) λρ f i w i) ) + 1 λ log/δ). Using the equality n f iw i) ) = E +Reg an choosing the optimal value of λ yiels the esire result. Remark 1 Note that λ is set aaptively base on E + Reg, which epens on the ranomness in θ 1:n) an thus coul be problematic. However, later in our argument we en up with an upper boun on E + Reg that epens only on the problem parameters; if we set λ base on this upper boun, our proofs still go through, an we eliminate the epenence of λ on θ.

23 MEMORY-BOUNDED SPARSE REGRESSION A.10. Proof of Lemma 8 We prove the inequality using stanar symmetrization inequalities. First, we use the Høffmann- Jorgensen inequality e.g. e la Peña an Giné, 1999, Theorem 1..3)), which states that given symmetric ranom variables U i an S n = n U i an some t 0 is such that P [ S n t 0 ] 1/8, then for any p > 0, [ p] 1/p E U i [ ] ) 1/p ) K p t 0 + E max U i p, where K p 1+ 4 p p + 1 exp logp + 1). i n p 6) Notably, for any p 5 we have K p 6p, for any p 1 we have K p 18p, an K p /p as p. Now, recall that ɛ i) is sub-exponential, meaning that there is some σ such that E [ ɛ i) p] 1/p σp for all p 1. Letting r i {±1} be i.i.. Raemacher variables an V i = ɛ i) ) be shorthan, an immeiate symmetrization inequality e.g. e la Peña an Giné, 1999, Lemma 1..6) gives that [ ] [ ] [ p] P V i σ n + t P V i E [V i ] + t t p E V i E [V i ]) [ p] p t p E r i V i. Now we apply the Høffmann-Jorgensen inequality 6) to U i = r i V i, which are symmetric an satisfy [ ] P r i V i t 0 1 t E [ ] r i V i = 1 0 t E Vi 4nσ4 0 t, 0 so that taking t 0 = 6σ n an applying inequality 6) yiels [ p] E r i V i Kp p 6σ [ ] ) 1/p p n + E max V p i n i. Moreover, we have that [ ] E max V p i n i E [V p i ] ne [ɛ i)p] nσ p p) p. In particular, we have [ ] ) p P V i σ n + 4K p σ t 4K p σ K p t pσ p 3 n + n 1/p p ) p 3 ) p n + n 1/p p =. t 7) Taking p = log 1 δ 3 an t = e ) n + n log 1 1 δ log 1 δ gives the esire result. 3

24 STEINHARDT DUCHI Appenix B. From real numbers to bits In Section 4, we analyze a proceure that store ÕB) real numbers. We now argue that each number only nees to be store to polynomial precision, an so requires only Õ1) bits. Algorithm 1 stores the quantities Ũ, ŵ, θ fine, an the count sketch structure θ coarse ). The set Ũ requires Ok log ) bits. To hanle θ coarse an θ fine, we ranomly an unbiasely) roun each g i) to the nearest multiple of M for some large integer M; for instance, a value of M will be roune to 4 M with probability 0.9 an to 5 M with probability 0.1. Since this yiels an unbiase estimate of gi), the RDA proceure will still work, with the overall regret boun increasing only slightly essentially, by kn in expectation). M If M ef = max i [n], [] g i), then the number of bits neee to represent each coorinate of θ fine as well as each number in the count sketch structure) is OlognMM )). But g i) = x i) f i w i) ) ρ E + Reg), which is polynomial in n, so M is polynomial in n. In aition, for M growing polynomially in n, the increase of kn in the regret boun becomes negligibly M small. Hence, we can take nmm to be polynomial in n, thus requiring Õ1) bits per coorinate to represent θ coarse an θ fine. Finally, to hanle ŵ, we eterministically roun to the nearest multiple of 1 M. Since ŵ oes not affect any choices in the algorithm, the accumulate error per coorinate n after n steps is at most M, which is again negligible for large M. Since each coorinate of ŵ is at most R, we can store them each with Õ1) bits, as well. 4

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

Self-normalized Martingale Tail Inequality

Self-normalized Martingale Tail Inequality Online-to-Confience-Set Conversions an Application to Sparse Stochastic Banits A Self-normalize Martingale Tail Inequality The self-normalize martingale tail inequality that we present here is the scalar-value

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Euler equations for multiple integrals

Euler equations for multiple integrals Euler equations for multiple integrals January 22, 2013 Contents 1 Reminer of multivariable calculus 2 1.1 Vector ifferentiation......................... 2 1.2 Matrix ifferentiation........................

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Analyzing Tensor Power Method Dynamics in Overcomplete Regime Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical

More information

Linear Regression with Limited Observation

Linear Regression with Limited Observation Ela Hazan Tomer Koren Technion Israel Institute of Technology, Technion City 32000, Haifa, Israel ehazan@ie.technion.ac.il tomerk@cs.technion.ac.il Abstract We consier the most common variants of linear

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

Table of Common Derivatives By David Abraham

Table of Common Derivatives By David Abraham Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec

More information

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy,

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy, NOTES ON EULER-BOOLE SUMMATION JONATHAN M BORWEIN, NEIL J CALKIN, AND DANTE MANNA Abstract We stuy a connection between Euler-MacLaurin Summation an Boole Summation suggeste in an AMM note from 196, which

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization JMLR: Workshop an Conference Proceeings vol 30 013) 1 On the Complexity of Banit an Derivative-Free Stochastic Convex Optimization Oha Shamir Microsoft Research an the Weizmann Institute of Science oha.shamir@weizmann.ac.il

More information

PDE Notes, Lecture #11

PDE Notes, Lecture #11 PDE Notes, Lecture # from Professor Jalal Shatah s Lectures Febuary 9th, 2009 Sobolev Spaces Recall that for u L loc we can efine the weak erivative Du by Du, φ := udφ φ C0 If v L loc such that Du, φ =

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables

More information

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Necessary and Sufficient Conditions for Sketched Subspace Clustering Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This

More information

LECTURE NOTES ON DVORETZKY S THEOREM

LECTURE NOTES ON DVORETZKY S THEOREM LECTURE NOTES ON DVORETZKY S THEOREM STEVEN HEILMAN Abstract. We present the first half of the paper [S]. In particular, the results below, unless otherwise state, shoul be attribute to G. Schechtman.

More information

Chapter 6: Energy-Momentum Tensors

Chapter 6: Energy-Momentum Tensors 49 Chapter 6: Energy-Momentum Tensors This chapter outlines the general theory of energy an momentum conservation in terms of energy-momentum tensors, then applies these ieas to the case of Bohm's moel.

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

Iterated Point-Line Configurations Grow Doubly-Exponentially

Iterated Point-Line Configurations Grow Doubly-Exponentially Iterate Point-Line Configurations Grow Doubly-Exponentially Joshua Cooper an Mark Walters July 9, 008 Abstract Begin with a set of four points in the real plane in general position. A to this collection

More information

Algorithms and matching lower bounds for approximately-convex optimization

Algorithms and matching lower bounds for approximately-convex optimization Algorithms an matching lower bouns for approximately-convex optimization Yuanzhi Li Department of Computer Science Princeton University Princeton, NJ, 08450 yuanzhil@cs.princeton.eu Anrej Risteski Department

More information

Convergence of Random Walks

Convergence of Random Walks Chapter 16 Convergence of Ranom Walks This lecture examines the convergence of ranom walks to the Wiener process. This is very important both physically an statistically, an illustrates the utility of

More information

How to Minimize Maximum Regret in Repeated Decision-Making

How to Minimize Maximum Regret in Repeated Decision-Making How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

Some Examples. Uniform motion. Poisson processes on the real line

Some Examples. Uniform motion. Poisson processes on the real line Some Examples Our immeiate goal is to see some examples of Lévy processes, an/or infinitely-ivisible laws on. Uniform motion Choose an fix a nonranom an efine X := for all (1) Then, {X } is a [nonranom]

More information

Extreme Values by Resnick

Extreme Values by Resnick 1 Extreme Values by Resnick 1 Preliminaries 1.1 Uniform Convergence We will evelop the iea of something calle continuous convergence which will be useful to us later on. Denition 1. Let X an Y be metric

More information

Introduction to the Vlasov-Poisson system

Introduction to the Vlasov-Poisson system Introuction to the Vlasov-Poisson system Simone Calogero 1 The Vlasov equation Consier a particle with mass m > 0. Let x(t) R 3 enote the position of the particle at time t R an v(t) = ẋ(t) = x(t)/t its

More information

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example

More information

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d A new proof of the sharpness of the phase transition for Bernoulli percolation on Z Hugo Duminil-Copin an Vincent Tassion October 8, 205 Abstract We provie a new proof of the sharpness of the phase transition

More information

15.1 Upper bound via Sudakov minorization

15.1 Upper bound via Sudakov minorization ECE598: Information-theoretic methos in high-imensional statistics Spring 206 Lecture 5: Suakov, Maurey, an uality of metric entropy Lecturer: Yihong Wu Scribe: Aolin Xu, Mar 7, 206 [E. Mar 24] In this

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics UC Berkeley Department of Electrical Engineering an Computer Science Department of Statistics EECS 8B / STAT 4B Avance Topics in Statistical Learning Theory Solutions 3 Spring 9 Solution 3. For parti,

More information

On combinatorial approaches to compressed sensing

On combinatorial approaches to compressed sensing On combinatorial approaches to compresse sensing Abolreza Abolhosseini Moghaam an Hayer Raha Department of Electrical an Computer Engineering, Michigan State University, East Lansing, MI, U.S. Emails:{abolhos,raha}@msu.eu

More information

Acute sets in Euclidean spaces

Acute sets in Euclidean spaces Acute sets in Eucliean spaces Viktor Harangi April, 011 Abstract A finite set H in R is calle an acute set if any angle etermine by three points of H is acute. We examine the maximal carinality α() of

More information

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection an System Ientification Borhan M Sananaji, Tyrone L Vincent, an Michael B Wakin Abstract In this paper,

More information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21 Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting

More information

Collaborative Ranking for Local Preferences Supplement

Collaborative Ranking for Local Preferences Supplement Collaborative Raning for Local Preferences Supplement Ber apicioglu Davi S Rosenberg Robert E Schapire ony Jebara YP YP Princeton University Columbia University Problem Formulation Let U {,,m} be the set

More information

arxiv: v2 [math.pr] 27 Nov 2018

arxiv: v2 [math.pr] 27 Nov 2018 Range an spee of rotor wals on trees arxiv:15.57v [math.pr] 7 Nov 1 Wilfrie Huss an Ecaterina Sava-Huss November, 1 Abstract We prove a law of large numbers for the range of rotor wals with ranom initial

More information

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys Homewor Solutions EM, Mixture Moels, PCA, Dualitys CMU 0-75: Machine Learning Fall 05 http://www.cs.cmu.eu/~bapoczos/classes/ml075_05fall/ OUT: Oct 5, 05 DUE: Oct 9, 05, 0:0 AM An EM algorithm for a Mixture

More information

Local Privacy and Statistical Minimax Rates

Local Privacy and Statistical Minimax Rates Local Privacy an Statistical Minimax Rates John C. Duchi Michael I. Joran,, an Martin J. Wainwright, Department of Electrical Engineering an Computer Science an Department of Statistics University of California,

More information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische

More information

Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners

Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners Lower Bouns for Local Monotonicity Reconstruction from Transitive-Closure Spanners Arnab Bhattacharyya Elena Grigorescu Mahav Jha Kyomin Jung Sofya Raskhonikova Davi P. Wooruff Abstract Given a irecte

More information

Proof of SPNs as Mixture of Trees

Proof of SPNs as Mixture of Trees A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a

More information

1. Aufgabenblatt zur Vorlesung Probability Theory

1. Aufgabenblatt zur Vorlesung Probability Theory 24.10.17 1. Aufgabenblatt zur Vorlesung By (Ω, A, P ) we always enote the unerlying probability space, unless state otherwise. 1. Let r > 0, an efine f(x) = 1 [0, [ (x) exp( r x), x R. a) Show that p f

More information

Calculus of Variations

Calculus of Variations 16.323 Lecture 5 Calculus of Variations Calculus of Variations Most books cover this material well, but Kirk Chapter 4 oes a particularly nice job. x(t) x* x*+ αδx (1) x*- αδx (1) αδx (1) αδx (1) t f t

More information

The Principle of Least Action

The Principle of Least Action Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms

More information

Permanent vs. Determinant

Permanent vs. Determinant Permanent vs. Determinant Frank Ban Introuction A major problem in theoretical computer science is the Permanent vs. Determinant problem. It asks: given an n by n matrix of ineterminates A = (a i,j ) an

More information

The Exact Form and General Integrating Factors

The Exact Form and General Integrating Factors 7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily

More information

Lecture 6 : Dimensionality Reduction

Lecture 6 : Dimensionality Reduction CPS290: Algorithmic Founations of Data Science February 3, 207 Lecture 6 : Dimensionality Reuction Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will consier the roblem of maing

More information

6 General properties of an autonomous system of two first order ODE

6 General properties of an autonomous system of two first order ODE 6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

arxiv: v1 [cs.lg] 22 Mar 2014

arxiv: v1 [cs.lg] 22 Mar 2014 CUR lgorithm with Incomplete Matrix Observation Rong Jin an Shenghuo Zhu Dept. of Computer Science an Engineering, Michigan State University, rongjin@msu.eu NEC Laboratories merica, Inc., zsh@nec-labs.com

More information

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016 Amin Assignment 7 Assignment 8 Goals toay BACKPROPAGATION Davi Kauchak CS58 Fall 206 Neural network Neural network inputs inputs some inputs are provie/ entere Iniviual perceptrons/ neurons Neural network

More information

Agmon Kolmogorov Inequalities on l 2 (Z d )

Agmon Kolmogorov Inequalities on l 2 (Z d ) Journal of Mathematics Research; Vol. 6, No. ; 04 ISSN 96-9795 E-ISSN 96-9809 Publishe by Canaian Center of Science an Eucation Agmon Kolmogorov Inequalities on l (Z ) Arman Sahovic Mathematics Department,

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods Hyperbolic Moment Equations Using Quarature-Base Projection Methos J. Koellermeier an M. Torrilhon Department of Mathematics, RWTH Aachen University, Aachen, Germany Abstract. Kinetic equations like the

More information

Error Floors in LDPC Codes: Fast Simulation, Bounds and Hardware Emulation

Error Floors in LDPC Codes: Fast Simulation, Bounds and Hardware Emulation Error Floors in LDPC Coes: Fast Simulation, Bouns an Harware Emulation Pamela Lee, Lara Dolecek, Zhengya Zhang, Venkat Anantharam, Borivoje Nikolic, an Martin J. Wainwright EECS Department University of

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation Relative Entropy an Score Function: New Information Estimation Relationships through Arbitrary Aitive Perturbation Dongning Guo Department of Electrical Engineering & Computer Science Northwestern University

More information

Expected Value of Partial Perfect Information

Expected Value of Partial Perfect Information Expecte Value of Partial Perfect Information Mike Giles 1, Takashi Goa 2, Howar Thom 3 Wei Fang 1, Zhenru Wang 1 1 Mathematical Institute, University of Oxfor 2 School of Engineering, University of Tokyo

More information

WUCHEN LI AND STANLEY OSHER

WUCHEN LI AND STANLEY OSHER CONSTRAINED DYNAMICAL OPTIMAL TRANSPORT AND ITS LAGRANGIAN FORMULATION WUCHEN LI AND STANLEY OSHER Abstract. We propose ynamical optimal transport (OT) problems constraine in a parameterize probability

More information

Logarithmic spurious regressions

Logarithmic spurious regressions Logarithmic spurious regressions Robert M. e Jong Michigan State University February 5, 22 Abstract Spurious regressions, i.e. regressions in which an integrate process is regresse on another integrate

More information

Function Spaces. 1 Hilbert Spaces

Function Spaces. 1 Hilbert Spaces Function Spaces A function space is a set of functions F that has some structure. Often a nonparametric regression function or classifier is chosen to lie in some function space, where the assume structure

More information

Tractability results for weighted Banach spaces of smooth functions

Tractability results for weighted Banach spaces of smooth functions Tractability results for weighte Banach spaces of smooth functions Markus Weimar Mathematisches Institut, Universität Jena Ernst-Abbe-Platz 2, 07740 Jena, Germany email: markus.weimar@uni-jena.e March

More information

05 The Continuum Limit and the Wave Equation

05 The Continuum Limit and the Wave Equation Utah State University DigitalCommons@USU Founations of Wave Phenomena Physics, Department of 1-1-2004 05 The Continuum Limit an the Wave Equation Charles G. Torre Department of Physics, Utah State University,

More information

Online Appendix for Trade Policy under Monopolistic Competition with Firm Selection

Online Appendix for Trade Policy under Monopolistic Competition with Firm Selection Online Appenix for Trae Policy uner Monopolistic Competition with Firm Selection Kyle Bagwell Stanfor University an NBER Seung Hoon Lee Georgia Institute of Technology September 6, 2018 In this Online

More information

Equilibrium in Queues Under Unknown Service Times and Service Value

Equilibrium in Queues Under Unknown Service Times and Service Value University of Pennsylvania ScholarlyCommons Finance Papers Wharton Faculty Research 1-2014 Equilibrium in Queues Uner Unknown Service Times an Service Value Laurens Debo Senthil K. Veeraraghavan University

More information

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x) Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)

More information

Monotonicity for excited random walk in high dimensions

Monotonicity for excited random walk in high dimensions Monotonicity for excite ranom walk in high imensions Remco van er Hofsta Mark Holmes March, 2009 Abstract We prove that the rift θ, β) for excite ranom walk in imension is monotone in the excitement parameter

More information

Approximate Constraint Satisfaction Requires Large LP Relaxations

Approximate Constraint Satisfaction Requires Large LP Relaxations Approximate Constraint Satisfaction Requires Large LP Relaxations oah Fleming April 19, 2018 Linear programming is a very powerful tool for attacking optimization problems. Techniques such as the ellipsoi

More information

arxiv: v2 [cs.ds] 11 May 2016

arxiv: v2 [cs.ds] 11 May 2016 Optimizing Star-Convex Functions Jasper C.H. Lee Paul Valiant arxiv:5.04466v2 [cs.ds] May 206 Department of Computer Science Brown University {jasperchlee,paul_valiant}@brown.eu May 3, 206 Abstract We

More information

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Lecture 2 Lagrangian formulation of classical mechanics Mechanics Lecture Lagrangian formulation of classical mechanics 70.00 Mechanics Principle of stationary action MATH-GA To specify a motion uniquely in classical mechanics, it suffices to give, at some time t 0,

More information

Lecture 6: Calculus. In Song Kim. September 7, 2011

Lecture 6: Calculus. In Song Kim. September 7, 2011 Lecture 6: Calculus In Song Kim September 7, 20 Introuction to Differential Calculus In our previous lecture we came up with several ways to analyze functions. We saw previously that the slope of a linear

More information

Efficiently Decodable Non-Adaptive Threshold Group Testing

Efficiently Decodable Non-Adaptive Threshold Group Testing Efficiently Decoable Non-Aaptive Threshol Group Testing arxiv:72759v3 [csit] 3 Jan 28 Thach V Bui, Minoru Kuribayashi, Mahi Cheraghchi, an Isao Echizen SOKENDAI The Grauate University for Avance Stuies),

More information

A Note on Exact Solutions to Linear Differential Equations by the Matrix Exponential

A Note on Exact Solutions to Linear Differential Equations by the Matrix Exponential Avances in Applie Mathematics an Mechanics Av. Appl. Math. Mech. Vol. 1 No. 4 pp. 573-580 DOI: 10.4208/aamm.09-m0946 August 2009 A Note on Exact Solutions to Linear Differential Equations by the Matrix

More information

Chapter 4. Electrostatics of Macroscopic Media

Chapter 4. Electrostatics of Macroscopic Media Chapter 4. Electrostatics of Macroscopic Meia 4.1 Multipole Expansion Approximate potentials at large istances 3 x' x' (x') x x' x x Fig 4.1 We consier the potential in the far-fiel region (see Fig. 4.1

More information

New Statistical Test for Quality Control in High Dimension Data Set

New Statistical Test for Quality Control in High Dimension Data Set International Journal of Applie Engineering Research ISSN 973-456 Volume, Number 6 (7) pp. 64-649 New Statistical Test for Quality Control in High Dimension Data Set Shamshuritawati Sharif, Suzilah Ismail

More information

Lecture 5. Symmetric Shearer s Lemma

Lecture 5. Symmetric Shearer s Lemma Stanfor University Spring 208 Math 233: Non-constructive methos in combinatorics Instructor: Jan Vonrák Lecture ate: January 23, 208 Original scribe: Erik Bates Lecture 5 Symmetric Shearer s Lemma Here

More information

ON THE OPTIMALITY SYSTEM FOR A 1 D EULER FLOW PROBLEM

ON THE OPTIMALITY SYSTEM FOR A 1 D EULER FLOW PROBLEM ON THE OPTIMALITY SYSTEM FOR A D EULER FLOW PROBLEM Eugene M. Cliff Matthias Heinkenschloss y Ajit R. Shenoy z Interisciplinary Center for Applie Mathematics Virginia Tech Blacksburg, Virginia 46 Abstract

More information

Chromatic number for a generalization of Cartesian product graphs

Chromatic number for a generalization of Cartesian product graphs Chromatic number for a generalization of Cartesian prouct graphs Daniel Král Douglas B. West Abstract Let G be a class of graphs. The -fol gri over G, enote G, is the family of graphs obtaine from -imensional

More information

Cascaded redundancy reduction

Cascaded redundancy reduction Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,

More information

Qubit channels that achieve capacity with two states

Qubit channels that achieve capacity with two states Qubit channels that achieve capacity with two states Dominic W. Berry Department of Physics, The University of Queenslan, Brisbane, Queenslan 4072, Australia Receive 22 December 2004; publishe 22 March

More information

Upper and Lower Bounds on ε-approximate Degree of AND n and OR n Using Chebyshev Polynomials

Upper and Lower Bounds on ε-approximate Degree of AND n and OR n Using Chebyshev Polynomials Upper an Lower Bouns on ε-approximate Degree of AND n an OR n Using Chebyshev Polynomials Mrinalkanti Ghosh, Rachit Nimavat December 11, 016 1 Introuction The notion of approximate egree was first introuce

More information

arxiv: v4 [math.pr] 27 Jul 2016

arxiv: v4 [math.pr] 27 Jul 2016 The Asymptotic Distribution of the Determinant of a Ranom Correlation Matrix arxiv:309768v4 mathpr] 7 Jul 06 AM Hanea a, & GF Nane b a Centre of xcellence for Biosecurity Risk Analysis, University of Melbourne,

More information

Schrödinger s equation.

Schrödinger s equation. Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of

More information

Quantum Mechanics in Three Dimensions

Quantum Mechanics in Three Dimensions Physics 342 Lecture 20 Quantum Mechanics in Three Dimensions Lecture 20 Physics 342 Quantum Mechanics I Monay, March 24th, 2008 We begin our spherical solutions with the simplest possible case zero potential.

More information

On the Aloha throughput-fairness tradeoff

On the Aloha throughput-fairness tradeoff On the Aloha throughput-fairness traeoff 1 Nan Xie, Member, IEEE, an Steven Weber, Senior Member, IEEE Abstract arxiv:1605.01557v1 [cs.it] 5 May 2016 A well-known inner boun of the stability region of

More information