arxiv: v2 [stat.ml] 25 Feb 2015

Size: px
Start display at page:

Download "arxiv: v2 [stat.ml] 25 Feb 2015"

Transcription

1 Competing with the Empirical Ris Minimizer in a Single Pass Roy Frostig, Rong Ge, Sham M. Kaade, and Aaron Sidford 3 arxiv:4.6606v [stat.ml] 5 Feb 05 Stanford University rf@cs.stanford.edu Microsoft Research, ew England rongge@microsoft.com, saade@microsoft.com 3 MIT sidford@mit.edu Abstract In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unnown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data commonly referred to as either the empirical ris minimizer ERM) or the M-estimator is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal in this wor is to perform as well as the ERM, on every problem, while minimizing the use of computational resources such as running time and space usage. We provide a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties:. The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample.. The algorithm achieves the same statistical rate of convergence as the empirical ris minimizer on every problem, even considering constant factors. 3. The algorithm s performance depends on the initial error at a rate that decreases superpolynomially. 4. The algorithm is easily parallelizable. Moreover, we quantify the finite-sample) rate at which the algorithm becomes competitive with the ERM. Introduction Consider the following optimization problem: def min P w), where P w) = E ψ D [ψw)] ) w S and D is a distribution over convex functions from a Euclidean space S to R e.g. S = R d in the finite dimensional setting). Let w be a minimizer of P and suppose we observe the functions

2 ψ, ψ,..., ψ independently sampled from D. Our objective is to compute an estimator ŵ so that the expected error or, equivalently, the excess ris): E[P ŵ ) P w )] is small, where the expectation is over the estimator ŵ which depends on the sampled functions). Stochastic approximation algorithms, such as stochastic gradient descent SGD) Robbins and Monro, 95), are the most widely used in practice, due to their ease of implementation and their efficiency with regards to runtime and memory. Without consideration for computational constraints, we often wish to compute the empirical ris minimizer ERM; or, equivalently, the M-estimator): ŵ ERM argmin w S ψ i w). ) In the context of statistical modeling, the ERM is the maximum lielihood estimator MLE). Under certain regularity conditions, and under correct model specification, the MLE is asymptotically efficient, in that no unbiased estimator can have a lower variance in the limit see Lehmann and Casella 998); van der Vaart 000)). Analogous arguments have been made in the stochastic approximation setting, where we do not necessarily have a statistical model of the distribution D see Kushner and Yin 003)). The question we aim to address is as follows. Consider the ratio: i= E[P ŵ ERM ) P w )]. 3) E[P ŵ ) P w )] We see an algorithm to compute ŵ in which: ) under sufficient regularity conditions, this ratio approaches on every problem D and ) it does so quicly, at a rate quantifiable in terms of the number of samples, the dependence on the initial error and other relevant quantities), and the computational time and space usage.. This wor Under certain smoothness assumptions on ψ and strong convexity assumptions on P applicable to linear and logistic regression, generalized linear models, smoothed Huber losses, and various other M-estimation problems), we provide an algorithm where:. The algorithm achieves the same statistical rate of convergence as the ERM on every problem, even considering constant factors, and we quantify the sample size at which this occurs.. The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample. 3. The algorithm decreases the standard notion of initial error at a super-polynomial rate The algorithm is trivially parallelizable see Remar 3). A well specified statistical model is one where the data is generated under some model in the parametric class. See the linear regression Section 3.. However, note that biased estimators, such as the James-Stein estimator, can outperform the MLE Lehmann and Casella, 998). 3 A function is super-polynomial if grows faster than any polynomial.

3 Algorithm / analysis Problem Step size Initial error dependence Parallelizable Finite-sample analysis Polya and Juditsy 99) general decaying: /n c?? Polya and Juditsy 99) / linear constant Ω/n )? Dieuleveut and Bach 04) regression This wor: Streaming SVRG general constant /n ω) Table : Comparison of nown streaming algorithms which achieve a constant competitive ratio to the ERM. Polya and Juditsy 99) is an SGD algorithm with iterate averaging. Concurrent to and independently from our wor, Dieuleveut and Bach 04) provide a finite-sample analysis for SGD with averaging in the linear regression problem setting where the learning rate can be taen as constant). In the problem column, general indicates problems under the regularity assumptions herein. Polya and Juditsy 99) require the step size to decay with the sample size n, as /n c with c strictly in the range / < c <. The dependence on c in a finite-sample analysis is unclear and tuning the decay of learning rates is often undesirable in practice). The initial error is P w 0 ) P w ), where w 0 is the starting point of the algorithm. We see algorithms in which the initial error dependence is significantly lower in order, and we write /n ω) to indicate that it can be driven down to an arbitrarily low-order polynomial. See Remar 3 with regard to parallelization. Table compares previous and concurrent) algorithms that enjoy the first two guarantees; this wor is the first with a finite-sample analysis handling the more general class of problems. Our algorithm is a variant of the stochastic variance reduced gradient procedure of Johnson and Zhang 03). Importantly, we quantify how fast we obtain a rate comparable to that of the ERM. For the case of linear regression, we have non-trivial guarantees when the sample size is larger than a constant times what can be interpreted as a condition number, κ = L/µ, where µ is a strong convexity parameter of P and where L is a smoothness parameter of each ψ. Critically, after is larger than κ, the initial error is divided by a factor that can be larger than any polynomial in /κ. Finally, in order to address this question on a per-problem basis, we provide both upper and lower bounds for the rate of convergence of the ERM.. Related wor Stochastic optimization dates bac to the wor of Robbins and Monro 95) and has seen much subsequent wor Kushner and Clar, 978; Kushner and Yin, 003; emirovsi and Yudin, 983). More recently, questions of how to quantify and compare rates of estimation procedures with implications to machine learning problems in the streaming and large dataset settings have been raised and discussed several times see Bottou and Bousquet 008); Agarwal and Bottou 04)). Stochastic approximation. The pioneering wor of Polya and Juditsy 99) and Ruppert 988) provides an asymptotically optimal streaming algorithm, by averaging the iterates of an SGD procedure. It is unclear how quicly these algorithms converge to the rate of the ERM in 3

4 finite sample; the relevant dependencies, such as the dependence on the initial error that is, P w 0 ) P w ) where w 0 is the starting point of the algorithm are not specified. In particular, they characterize the limiting distribution of ŵ w ), essentially arguing that the variance of the iterate-averaging procedure matches the asymptotic distribution of the ERM see Kushner and Yin 003)). In a series of papers, Bach and Moulines 0), Bach and Moulines 03), Dieuleveut and Bach 04), and Defossez and Bach 05) provide non-asymptotic analysis of the same averaging schemes. Of these, for the specific case of linear least-squares regression, Dieuleveut and Bach 04) and Defossez and Bach 05) provide rates which are competitive with the ERM, concurrently and independent of results presented herein. The wor in Bach and Moulines 0) and Bach and Moulines 03) either does not achieve the ERM rate or has a dependence on the initial error which is not lower in order; it is rather in Dieuleveut and Bach 04) and Defossez and Bach 05) that dependence on the initial error decaying as / is shown. For the special case of least squares, one could adapt the algorithm and guarantees of Dieuleveut and Bach 04); Defossez and Bach 05), by replacing global averaging with random restarts, to obtain super-polynomial rates results comparable to ours when specializing to linear regression). For more general problems, it is unclear how such an adaptation would wor using constant step sizes alone may not suffice. In contrast, as shown in Table, our algorithm is identical for a wide variety of cases and does not need decaying rates whose choices may be difficult in practice). We should also note that much wor has characterized rates of convergence under various assumptions on P and ψ different than our own. Our case of interest is when P is strongly convex. For such P, the rates of convergence of many algorithms are O/), often achieved by averaging the iterates in some way emirovsi et al., 009; Juditsy and esterov, 00; Rahlin et al., 0; Hazan and Kale, 04). These results do not achieve a constant competitive ratio, for a variety of reasons they have a leading order dependencies on various quantities, including the initial error along with strong convexity and smoothness parameters). Solely in terms of the dependence on the sample size, these rates are nown to be optimal emirovsi and Yudin, 983; esterov, 004; Agarwal et al., 0). Empirical ris minimization M-estimation). In statistics, it is classically argued that the MLE, under certain restrictions, is an asymptotically efficient estimator for well-specified statistical models Lehmann and Casella, 998; van der Vaart, 000). Analogously, in an optimization context, applicable to mis-specified models, similar asymptotic arguments have been made: under certain restrictions, the asymptotically optimal estimator is one which has a limiting variance that is equivalent to that of the ERM Anbar, 97; Fabian, 973; Kushner and Clar, 978). With regards to finite-sample rates, Agarwal et al. 0) provide information-theoretic lower bounds for any strategy) for certain stochastic convex optimization problems. This result does not imply our bounds as they do not consider the same smoothness assumptions on ψ. For the special case of linear least-squares regression, there are several upper bounds for instance, Caponnetto and De Vito 007); Hsu et al. 04)). Recently, Shamir 04) provides lower bounds specifically for the least-squares estimator, applicable under model mis-specification, and sharp only for specific problems. Linearly convergent optimization and approaches based on doubling). There are numerous algorithms for optimizing sums of convex functions that converge linearly, i.e. that depend 4

5 only logarithmically on the target precision. otably, several recently developed such algorithms are applicable in the setting where the sample size becomes large, due to their stochastic nature Strohmer and Vershynin, 009; Le Roux et al., 0; Shalev-Shwartz and Zhang, 03; Johnson and Zhang, 03). These procedures minimize a sum of losses in time near to) linear in, provided is sufficiently large relative to the dimension and the condition number. aively, one could attempt to use one of these algorithms to directly compute the ERM. Such an attempt poses two difficulties. First, we would need to prove concentration results for the empirical function P w) = i= ψ iw); in order to argue that these algorithms perform well in linear time with respect to the objective P, one must relate the condition number of P w) to the condition number of P w). Second, we would need new generalization analysis in order to relate the in-sample error ε ŵ ), where ε w) def = P w) min w P w ), to the generalization error E[P ŵ ) P w )]. To use existing generalization analyses would demand that ε w ) = Ω/), but the algorithms in question all require at least log passes of the data furthermore scaled by other problem-dependent factors) to achieve such an in-sample error. Hence, this approach would not immediately describe the generalization error obtained in time linear in. Finally, it requires that entire observed data sample, constituting the sum, be stored in memory. A second natural question is: can one naively use a doubling tric with an extant algorithm to compete with the ERM? By this we mean to iteratively run such a linearly convergent optimization algorithm, on increasingly larger subsets of the data, with the hope of cutting the error at each iteration by a constant fraction, eventually down to that of the ERM. There are two points to note for this approach. First, the approach is not implementable in a streaming model as one would eventually have to run the algorithm on a constant fraction of the entire dataset size, thus essentially holding the entire dataset in memory. Second, proving such an algorithm succeeds would similarly involve the aforementioned type of generalization argument. We conjecture that these tight generalization arguments described are attainable, although with a somewhat involved analysis. For linear regression, the bounds in Hsu et al. 04) may suffice. More generally, we believe the detailed ERM analysis provided herein could be used. In contrast, the statistical convergence analysis of our single-pass algorithm is self-contained and does not go through any generalization arguments about the ERM. In fact, it avoids matrix concentration arguments entirely. Comparison to related wor. To our nowledge, this wor provides the first streaming algorithm guaranteed to have a rate that approaches that of the ERM under certain regularity assumptions on D), where the initial error is decreased at a super-polynomial rate. The previous wor, in the general case that we consider, only provides asymptotic convergence guarantees Polya and Juditsy, 99). For the special case of linear least-squares regression, the concurrent and independent wor presented in Dieuleveut and Bach 04) and Defossez and Bach 05) also converges to the rate of the ERM, with a lower-order dependence on the initial error of Ω/ ). Furthermore, even if we ignored memory constraints and focused solely on computational complexity, our algorithm compares favorably to using state-of-the-art algorithms for minimizing sums of functions such as the linearly convergent algorithms in Le Roux et al. 0); Shalev-Shwartz and Zhang 03); Johnson and Zhang 03)); as discussed above, obtaining a convergence rate with these algorithms would entail some further generalization analysis. It would be interesting if one could quantify an approach of restarting the algorithm of Polya and Juditsy 99) to obtain guarantees comparable to our streaming algorithm. Such an analysis 5

6 could be delicate in settings other than linear regression, as their learning rates do not decay too quicly or too slowly they must decay strictly faster than /, yet more slowly than /). In contrast, our algorithm taes a constant learning rate to obtain its constant competitive ratio. Furthermore, our algorithm is easily parallelizable and its analysis, we believe, is relatively transparent..3 Organization Section summarizes our main results, and Section 3 provides applications to a few standard statistical models. Section 4 provides the main technical claims for our algorithm, Streaming SVRG Algorithm ). Section 5 provides finite-sample rates for the ERM, along with proofs for these rates. The Appendix contains various technical lemmas and proofs of our corollaries. Main results This section summarizes our main results, as corollaries of more general theorems provided later. After providing our assumptions in Section., Section. provides the algorithm, along with performance guarantees. Then Section.3 provides upper and lower bounds of the statistical rate of the empirical ris minimizer. def = x T Mx for a vector x First, a few preliminaries and definitions are needed. Denote x M and a matrix M of appropriate dimensions. Denote λ max M) and λ min M) as the maximal and minimal eigenvalues of a matrix M. Let I denote the identity matrix. Also, for positive semidefinite symmetric matrices A and B, A B if and only if x T Ax x T Bx for all x. Throughout, define σ as: [ ] σ def = E ψ D ψw ) P w )) This quantity governs the precise problem dependent) convergence rate of the ERM. amely, under certain restrictions on D, we have lim E[P ŵ ERM ) P w )] σ =. 5) / This limiting rate is well-established in asymptotic statistics see, for instance, van der Vaart 000)), whereas Section.3 provides upper and lower bounds on this rate for finite sample sizes. Analogous to the Cramér-Rao lower bound, under certain restrictions, σ / is the asymptotically efficient rate for stochastic approximation problems Anbar, 97; Fabian, 973; Kushner and Yin, 003). 4 The problem dependent rate of σ / sets the benchmar. Statistically, we hope to achieve a leading order dependency of σ / quicly, with rapidly-decaying dependence on the initial error.. Assumptions We now provide two assumptions under which we analyze the convergence rate of our streaming algorithm, Algorithm. Our first assumption is relatively standard. It provides upper and lower quadratic approximations the lower approximation is on the full objective P ). 4 Though, as with Cramér-Rao, this may be improvable with biased estimators. 4) 6

7 Assumption.. Suppose that:. The objective P is twice differentiable.. Strong convexity) The objective P is µ-strongly convex, i.e. for all w, w S, P w) P w ) + P w ) T w w ) + µ w w, 6) 3. Smoothness) Each loss ψ is L-smooth with probability one), i.e. for all w, w S, ψw) ψw ) + ψw ) T w w ) + L w w, 7) Our results in fact hold under a slightly weaer version of this assumption see Remar 9. Define: κ def = L µ. 8) The quantity κ can be interpreted as the condition number of the optimization objective ). The following definition quantifies a global bound on the Hessian. Definition. α-bounded Hessian). Let α be the smallest value if it exists) such that for all w S, P w ) α P w). Under Assumption., we have α κ, because L-smoothness implies P w ) LI and µ-strong convexity implies µi P w). However, α could be much smaller. For instance, α = in linear regression, whereas κ is the maximum to minimum eigenvalue ratio of the design matrix. Our second assumption offers a stronger, local relationship on the objective s Hessian, namely self-concordance. A function is self-concordant if its third-order derivative is bounded by a multiple of its second-order derivative. Formally, f : R R is M self-concordant if and only if f is convex and f x) Mf x) 3/. A multivariate function f : R d R is M self-concordant if and only if its restriction to any line is M self-concordant. Assumption. Self-concordance). Suppose that:. P is M-self concordant or that the weaer condition in Equation 30) holds).. The following urtosis condition holds: [ E ψ D ψw ) 4 ] [ Eψ D ψw ) ]) C ote that these two assumptions are also standard assumptions in the analysis of the two phases of ewton s method aside from the urtosis condition): the first phase of ewton s method gets close to the minimizer quicly based on a global strong convexity assumption) and the second phase obtains quadratic convergence based on local curvature assumptions on how fast the local Hessian changes, e.g. self-concordance). Moreover, our proof of the streaming algorithm follows a similar structure; we use Assumption. to analyze the progress of our algorithm when the current point is far away from optimality and Assumption. when it is close. 7

8 Algorithm Streaming Stochastic Variance Reduced Gradient Streaming SVRG) input Initial point w 0, batch sizes { 0,,...}, update frequency m, learning rate η, smoothness L for each stage s = 0,,,... do Sample ψ,..., ψ s from D and compute the estimate P w s ) = s i [ s] Sample m uniformly at random from {,,..., m}. w 0 w s for t = 0,,..., m do Sample ψ t from D and set w t+ w t η ψ t w t ) ψ t w s ) + L end for w s+ w m end for ψ i w s ). 9) ) P w s ). 0). Algorithm Here we describe a streaming algorithm and provide its convergence guarantees. Algorithm is inspired by the Stochastic Variance Reduced Gradient SVRG) algorithm of Johnson and Zhang 03) for minimizing a strongly convex sum of smooth losses. The algorithm follows a simple framewor that proceeds in stages. In each stage s we draw s samples independently at random from D and use these samples to obtain an estimate of the gradient of P at the current point, w s 9)). This stable gradient, denoted P w s ), is then used to decrease the variance of a gradient descent procedure. For each of m steps where m is chosen uniformly at random from {,,..., m}), we draw a sample ψ from D and tae a step opposite to its gradient at the current point, plus a zero-bias correction given by ψ w s ) P w s ) see 0)). The remainder of this section shows that, for suitable choices of s and m, Algorithm achieves desirable convergence rates under the aforementioned assumptions. Remar Generalizing SVRG). ote that Algorithm is a generalization of SVRG. In particular if we chose s =, i.e. if P ws ) = P w s ), then our algorithm coincides with the SVRG algorithm of Johnson and Zhang 03). Also, note that Johnson and Zhang 03) do not mae use of any self-concordance assumptions. Remar on-conformance to stochastic first-order oracle models). Algorithm is not implementable in the standard stochastic first-order oracle model, e.g. that which is assumed in order to obtain the lower bounds in emirovsi and Yudin 983) and Agarwal et al. 0). Streaming SVRG computes the gradient of the randomly drawn ψ at two points, while the oracle model only allows gradient queries at one point. We have the following algorithmic guarantee under only Assumption., which is a corollary of Theorem 4. also see the Appendix). 8

9 Corollary. Convergence under α-bounded Hessians). Suppose Assumption. holds. Fix w 0 R d. For p and b 3, set η =, m = 0bp+ κ 0b p+ η, 0 = 0ακb p+, and s = b s. Denote: s def s = τ + m) τ=0 s is an upper bound on the number of samples drawn up to the end of stage s). Let ŵ s be the parameter returned at iteration s by Algorithm. For s b p +6p κ and so s > p + 6p), we have E[P ŵ s ) P w )] + 4 ) ασ + b s ) P w 0 ) P w ) ) p When α = such as for least squares regression), the above bound achieves the ERM rate of σ / up to a constant factor, which can be driven to one, as discussed later). Furthermore, under self-concordance, we can drive the competitive ratio 3) down from α to arbitrarily near to. The following is a corollary of Theorem 4. also see the Appendix): Corollary. Convergence under self-concordance). Suppose Assumptions. and. hold. Consider w 0 R d. For p and b 3, set η =, m = 0bp+ κ 0b p+ η, 0 = max{400κ b p+3, 0C} = def max{bmκ, 0C}, and s = b s. Denote s = s τ=0 s + m) an upper bound on the number of samples drawn up to the end of stage s). Let ŵ s be the parameter returned at iteration s by Algorithm. Then: ) E[P ŵ s ) P w )] + 5 σ b s ) { κσ b s min, s ακ s Mσ+) 0 ) p/ } + P w0 ) P w ) s 0 ) p+ Remar 3 Implementation and parallelization). ote that Algorithm is simple to implement and requires little space. In each iteration, the space usage is linear in the size of a single sample along with needing to count to s and m). Furthermore, the algorithm is easily parallelizable once we have run enough stages. In both Theorem 4. and Theorem 4. as s increases s grows geometrically, whereas m remains constant. Hence, the majority of the computation time is spent averaging the gradient, i.e. 9), which is easily parallelizable. ote that the constants in the parameter settings for the Algorithm have not been optimized. Furthermore, we have not attempted to fully optimize the time it taes the algorithm to enter the second phase in which self-concordance is relevant), and we conjecture that the algorithm in fact enjoys even better dependencies. Our emphasis is on an analysis that is flexible in that it allows for a variety of assumptions in driving the competitive ratio to as is done in the case of logistic regression in Section 3, where we use a slight variant of self-concordance). Before providing statistical rates for the ERM, let us remar that the above achieves superpolynomial convergence rates and that the competitive ratio can be driven to recall that σ / is the rate of the ERM). Remar 4 Linear convergence and super-polynomial convergence). Suppose the ratio γ between P w 0 ) P w ) and σ is nown approximately within a multiplicative factor), we can let s = 0 ) 9

10 for log b γ number of iterations, then start increasing s = b s. This way in the first log b γ iterations E[P ŵ s ) P w )] is decreasing geometrically. Furthermore, even without nowing the ratio γ, we can can obtain a super-polynomial rate of convergence by setting the parameters as we specify in the next remar. The dependence on the initial error will then be Ωlog / log log ).) Remar 5 Driving the ratio to ). By choosing b sufficiently large, the competitive ratio 3) can be made close to on every problem). Furthermore, we can ensure this constant goes to by altering the parameter choices adaptively: let s = 4 s s!) 0, and let η s = η/ s, m s = m 4 s. Intuitively, grows so fast that lim s s / s = ; η s and m s are also changing fast enough so the initial error vanishes very quicly..3 Competing with the ERM ow we provide a finite-sample characterization of the rate of convergence of the ERM under regularity conditions. This essentially gives the numerator of 3), allowing us to compare the rate of the ERM against the rate achieved by Streaming SVRG. We provide the more general result in Theorem 5.; this section focuses on a corollary. In the following, we constrain the domain S; so the ERM, as defined in ), is taen over this restricted set. Further discussion appears in Theorem 5. and the comments thereafter. Corollary.3 of Theorem 5.). Suppose ψ, ψ,..., ψ are an independently drawn sample from D. Assume the following regularity conditions hold; see Theorem 5. for weaer conditions.. S is compact.. ψ is convex with probability one). 3. w is an interior point of S, and P w ) exists and is positive definite. 4. Smoothness) Assume the first, second, and third derivatives of ψ exist and are uniformly bounded on S. Then, for the ERM ŵ ERM as defined in )), we have lim E[P ŵ ERM ) P w )] σ = / In particular, the following lower and upper bounds hold. With problem dependent constants C 0 and C polynomial in the relevant quantities, as specified in Theorem 5.), we have for all p, if satisfies C 0, then p log d C p log d ) σ E[P ŵerm ) P w )] + C p log d ) σ + max w S P w) P w )) p 0

11 3 Applications: one pass learning and generalization This section provides applications to a few standard statistical models, in part providing a benchmar for comparison on concrete problems. For the widely studied problem of least-squares regression, we also instantiate upper and lower bounds for the ERM. The applications in this section can be extended to include generalized linear models, some M-estimation problems, and other loss functions e.g. the Huber loss). 3. Linear least-squares regression In linear regression, the goal is to minimize the possibly l -regularized) squared loss ψ X,Y w) = Y w T X) + λ w for a random data point X, Y ) Rd R. The objective ) is P w) = E X,Y D [Y w X) ] + λ w. ) 3.. Upper bound for the algorithm Using that α =, the following corollary illustrates that Algorithm achieves the rate of the ERM, Corollary 3. Least-squares performance of streaming SVRG). Suppose that X L. Define µ = λ + λ min Σ). Using the parameter settings of Theorem. and supposing that b p +6p κ, E[P w ) P w )] + 4 ) ) σ P w 0 ) P w ) + b ) p κ Remar 6 When κ). If the sample size is less than κ and λ = 0, there exist distributions on X in which the ERM is not unique as the sample matrix Xi X i will not be invertible, with reasonable probability, on these distributions by construction). Remar 7 When do the streaming SVRG bounds become meaningful?). Algorithm is competitive with the performance of the ERM when the sample size is slightly larger than a constant times κ. In particular, as the sample size grows larger than κ, then the initial error is decreased at an arbitrary polynomial rate in /κ. Let us consider a few special cases. First, consider the unregularized setting where λ = 0. Assume also that the least-squares problem is well-specified. That is, Y = w X + η where E[η] = 0 and E[η ] = σ noise. Define Σ = E[XX ]. Here, we have σ = E η X Σ = dσ noise. ) In other words, Corollary 3. recovers the classical rate in this case. In the mis-specified case where we do not assume the aforementioned model is correct i.e. E[Y X] may not equal w X ) define Y X) = w X, and we have σ = E [ Y Y X)) X ] Σ 3) = E [ Y E[Y X]) X ] [ Σ + E E[Y X] Y X)) X ] Σ 4) = E [ vary X) X ] [ Σ + E biasx) X ] Σ 5)

12 where the last equality exposes the effects of the approximation error: vary X) def = E[Y E[Y X]) X] and biasx) def = E[Y X] Y X). 6) In the regularized setting a..a. ridge regression) also not necessarily well-specified we have 3.. Statistical upper and lower bounds σ = E[ Y Y X))X + λw Σ+λI) ] 7) For comparison, the following corollary of Theorem 5.) provides lower and upper bounds for the statistical rate of the ERM. Corollary 3. Least-squares ERM bounds). Suppose that X Σ+λI) κ and the dimension is d in the infinite dimensional setting, we may tae d to be the intrinsic dimension, as per Remar 0). Let c be an appropriately chosen universal constant. For all p > 0, if p log c κ, then ) κp log d E[P ŵ ERM σ E [Z ) P w )] c 4 ] p/ where Z = ψw ) P w )) = Y w X)X + λw Σ+λI). For an upper bound, we have two cases: Unregularized case) Suppose λ = 0. Assume that we constrain the ERM to lie in some compact set S and supposing w S). Then for all p > 0, if p log c κ, we have ) κp log d E[P ŵ ERM σ ) P w )] + c + max w S P w) P w )) p Regularized case) Suppose λ > 0. Then for all p > 0, if p log c κ, we have ) κp log d E[P ŵ ERM σ λmaxσ+λ) ) P w )] + c + p λ σ this last equation follows from a modification of the argument in Equation 37)). Remar 8 ERM comparisons). Interestingly, for the upper bound when λ = 0), we see no way to avoid constraining the ERM to lie in some compact set; this allows us to bound the loss P in the event of some extremely low probability failure see Theorem 5.). The ERM upper bound has a term comparable to the initial error of our algorithm. In contrast, the lower bound is for the usual unconstrained least-squares estimator. 3. Logistic regression In binary) logistic regression, we have a distribution on X, Y ) R d {0, }. For any w, define PY = y w, X) def = expyxt w) + expx T w) 8)

13 for X R d and y {0, }. We do not assume the best fit model w is correct. The loss function is taen to be the regularized log lielihood ψ X,y w) = log PY w, X) + λ w and the objective ) instantiates as the negative expected regularized) log lielihood P w) = E[ log PY w, X)]+ λ w. Define Y X) = PY = w, X) and Σ = P w ) = E[Y X) Y X))XX ] + λi. Analogous to the least-squares case, we can interpret Y X) as the conditional expectation of Y under the possibly mis-specified) best fit model. With this notation, σ is similar to its instantiation under regularized least-squares Equation 7)): [ ] σ = E Y Y X))X + λw 9) Σ Under this definition of σ, by Theorem. together with the following defined quantities, the single-pass estimator of Algorithm achieves a rate competitive with that of the ERM: Corollary 3.3 Logistic regression performance). Suppose that X L. M = αe[ X 3 P w )) ]. Under parameters from Theorem., we have Define µ = λ and ) E[P ŵ ) P w )] + 5 σ b ) { κσ b min, Mσ+) 0 ) p/ } + P w0 ) P w ) 0 ) p+ The corollary uses Lemma 0, a straightforward lemma to handle self-concordance for logistic regression, which is included for completeness. See Bach 00) for techniques for analyzing the self-concordance of logistic regression. 4 Analysis of Streaming SVRG Here we analyze Algorithm. Section 4. provides useful common lemmas. Section 4. uses these lemmas to characterize the behavior of the Algorithm. These are then used to prove convergence in terms of both α-bounded Hessians Section 4.3) and M-self-concordance Section 4.4). 4. Common lemmas Our first lemma is a consequence of smoothness. It is the same observation made in Johnson and Zhang 03). Lemma. If ψ is smooth with probability one), then E ψ D [ ψw) ψw ) ] L P w) P w )). 0) Remar 9 A weaer smoothness assumption). Instead of the smoothness Assumption. in Equation 7, it suffices to directly assume 38) and still have all results hold as presented. In doing so, we incur an additional factor of as in this case we have P w ) LI by Lemma 9. For further explanation see Appendix A. Proof. For an L-smooth function f : R d R, we have fw) min w fw ) L fw). ) ) 3

14 To see this, observe that min w fw ) min η min η using the definition of L-smoothness. ow define: fw η fw)) fw) η fw) + ) η L fw) = fw) L fw) gw) = ψw) ψw ) w w ) ψw ). ) Since ψ is L-smooth with probability one) g is L-smooth with probability one) and it follows that: ψw) ψw ) = gw) Lgw) min w gw )) Lgw) gw )) = Lψw) ψw ) w w ) ψw )) where the second step follows from smoothness. The proof is completed by taing expectations and noting that E[ ψw )] = P w ) = 0. Our second lemma bounds the variance of ψ D in the P w )) norm. Lemma. Suppose Assumption. holds. Let w R d and let ψ D. Then E ψw) P w) P w )) κ P w) P w )) + σ). 3) Proof. For random vectors a and b, we have E a + b = E a + Ea b + E b E a + E a E b + E b = E a + ) E b Consequently, E ψw) P w) P w )) ) E ψw) ψw ) P w) E P + ψw w )) ) P w )) µ E ψw) ψw ) P w) + ) σ where the last step uses µi P w ) and the definition of σ. Observe that E [ ψw) ψw )] = P w) P w ) = P w). Applying Lemma and for random a, that E a Ea E a, we have E ψw) ψw ) P w) E ψw) ψw ) LP w) P w )). Combining and using the definition of κ yields the result. 4

15 4. Progress of the algorithm The following bounds the progress of one step of Algorithm. Lemma 3. Suppose Assumption. holds, w 0 R d, and ψ,... ψ are functions from R d R. Suppose ψ,... ψ m are sampled independently from D. Set w 0 = w 0 and for t {0,,... m }, set: def w t+ = w t η ψ t w t ) ψ t w 0 ) + L ψ i w 0 ) for some η > 0. Define: For all t let α t be such that i [] def = ψ i w 0 ) P w 0 ). i [] P w ) P w t ) + w w t ) P w t ) + α t w t w P w ) 4) note that such an α t exists by Assumption., as α t κ). Then for all t we have EL w t+ w E [ L w t w η 4η) P w t ) P w )) + 8η P w 0 ) P w )) + α t η + η ) P w )) ] Proof. Letting g t w) = ψ t w) w ψ t w 0 ) ψ i w 0 ) and recalling the definition of w t+ and we have E ψt D w t+ w = E ψt D w t w η L g tw t ) = E ψt D [ w t w ηl ] w t w ) g t w t ) + η L g tw t ) ow by 4) we now that i [] 5) = w t w η L w t w ) P w t ) + ) + η L E ψ t D g t w t ) 6) w t w ) P w t ) P w t ) P w )) α t w t w P w ). 7) Using Cauchy-Schwarz and that a b a + b for scalar a and b, we have w t w ) w t w α P w + α ) t P w )). 8) t 5

16 Furthermore E ψt D g t w t ) = E ψt D ψ tw t ) ψ t w 0 ) + ψ i w 0 ) i [] = E ψt D ψ t w t ) ψ t w )) ψ t w 0 ) ψ t w ) P w 0 )) + E ψt D ψ t w t ) ψ t w )) ψ t w 0 ) ψ t w ) P w 0 )) + 4E ψt D ψ t w t ) ψ t w ) + 4E ψt D ψ t w 0 ) ψ t w ) P w 0 ) + 4E ψt D ψ t w t ) ψ t w ) + 4E ψt D ψ t w 0 ) ψ t w ) + where we have used that E[ ψ t w 0 ) ψ t w ) P w 0 )] = 0 and E a Ea E a. Applying Lemma and using P w ) LI yields E ψt D g t w t ) 8L P w t ) P w )) + 8L P w 0 ) P w )) + L P w )). 9) Combining 6), 7), 8), and 9) yields E ψt D w t+ w w t w η L 4η) P w t) P w )) + 8 η L P w 0) P w )) ) η + α t L + η L P w )), and multiplying both sides by L yields the result. Finally we bound the progress of one stage of Algorithm. Lemma 4. Under the same assumptions as Lemma 3, for m chosen uniformly at random in def {,... m} and w = w m, we have E[P w ) P w )] [ ) [ ] ] κ α 4η mη + 4η m + η P w 0 ) P w ) + E P w )) where we are conditioning on w 0 and ψ,... ψ. Proof. Taing an unconditional expectation with respect to {ψ t } and summing 5) from Lemma 3 from t = m down to t = 0 yields m L E w m w L w 0 w η 4η) E P w t ) P w )) t=0 m [ αt 8mη EP w 0 ) P w )) + E η + η ) ] P w )) t=0 By strong convexity, w 0 w µ P w 0) P w )) and a little manipulation yields that: 6

17 m η 4η) E P w t ) P w )) m t=0 ) κ m + 8η P w 0 ) P w )) m + t=0 [ αt η + η ] E E m P w )) Rearranging terms and applying the definition of w then yields the result. 4.3 With α-bounded Hessians Here we prove the progress made by Algorithm in a single stage under only Assumption.. Theorem 4. Stage progress with α-bounded Hessians). Under Assumption., for Algorithm, we have for all s: E[P w s+ ) P w )] [ ) κ 4η mη + 4η E[P w s ) P w )] + α + η ] κ E[P ws ) P w )] + σ). Proof. By definition of α, we have α t α for all t in Lemma 4 and therefore E[P w s+ ) P w )] [ ) κ 4η mη + 4η E[P w s ) P w )] + α + η ] E [ ] P w )) ow using that the ψ i are independent and that E[ ψ i w s )] = P w s ) we have E[ P w )) ] = E ψ D [ ψ ] w s ) P w s ) P w )) [ E κp w s ) P w )) + σ κp w s ) P w )) + σ ] [κe[p w s ) P w )] + σ κe[p w s ) P w )] + σ ] = κ E[P ws ) P w )] + σ) where we have also used Lemma and Jensen s inequality. 4.4 With M-self-concordance Our main result in the self-concordant case follows. Theorem 4. Convergence under self-concordance). Suppose Assumption. and. hold. Under Algorithm, for η 8, 0C, and all s, we have E[P w s+ ) P w )] [ ) κ 4η mη + 4η E[P w s ) P w )] + Mσκ + 9κ) E[P w s ) P w )] + + η + 0Mσκ ) ) ] σ 7

18 The proof utilizes the following lemmas. First, we show how self concordance implies that there is a better effective strong convexity parameter in P w ) norm when we are close to w. Lemma 5. If P is M-self-concordant, then P w ) P w t ) + w w t ) P w t ) + w t w P w ) + M w t w P w )). 30) Proof. First we use the property of self-concordant functions: if f is M-self-concordant, then ft) f0) + tf 0) + 4 M t M f 0) ln + t M )) f 0). Apply this property to the function P restricted to the line between w t and w, where the 0 point is at w t and t is w t w P w t), then we have P w ) P w t )+w w t ) P w t )+ 4 M M w t w P w t) ln + M )) w t w P w t). In order to convert P w t ) norm to P w ) norm, we use another property of self-concordant function: f f 0) t) + t M f 0)). Again we restrict to the line between w and w t, where 0 point corresponds to w, and t is w t w, and we get w t w P w t) w t w P w ) + M w t w P w )). ow consider the function let hx) = x ln+x). The function has the following two properties: When x 0, hx) is monotone and hx) x / + x). This claim can be verified directly by taing derivatives. Therefore ) M h w t w P w t) h This concludes the proof. = M w ) t w P w ) + M w t w P w )) M 4 w t w P w ) + M w t w P w )) + ) M wt w P w ) + M wt w P w ) ) M 4 w t w P w ) + M w t w P w )) + M w t w P w )) M w t w P w ) 8 + M w t w P w )). 8

19 Essentially, this means when w t w P w is small the effective strong convexity in ) P w ) is small. In particular, { α t min α, + M ) } { w t w P w ) min κ, + M ) } w t w P w ) Thus we need to bound the residual error w t w P w ). Lemma 6 Crude residual error bound). Suppose the same assumptions in Lemma 3 hold and that η 8. Then for all t, we have E w t w P w ) 3κP w 0) P w )) + 6κ P w )) Proof. Since α t κ and by Lemma 3 we have EL w t+ w E [ L w t w η 4η) P w t ) P w )) + 8η P w 0 ) P w )) + κη + η ) P w )) ] Using that by strong convexity P w t ) P w ) µ w t w we have [ EL w t+ w E η ) L w t w ] + η P w 0 ) P w )) + ηκ κ P w )) Solving for the maximum value of L w t w in this recurrence we have, for all t, EL w t w 3κ ) η P w 0 ) P w )) + ηκ η P w )) Using that P w ) LI yields the result. Finally, we end up needing to bound higher moments of the error from. For this we provide two technical lemmas. Lemma 7. Suppose Assumption. and. hold. For ψ i sampled independently, we have 4 E ψ i w ) + C ) ) σ i P w )) Proof. By Assumption. we have 4 E ψ i w ) = 4 [ i P w )) ) ) ] E ψ D ψw ) 4 P w )) + 3 ) E ψ D ψ i w ) P w )) 3 ) + C ) 4 E ψ D ψ i w ) P w )) Recalling the definition of σ yields the result. 9

20 Lemma 8. Suppose a is a random variable such that E[a 4 ] C E[a ]), b is a random variable, and c is a constant. We have E[a min{b, c}] E[a ] C c E[b ]. 3) Proof. Let E be the indicator variable for the event a T E[a ] where T is chosen later. Let E = E. On one hand, we have E[a E ]T E[a ] E[a 4 ], therefore E[a E ] C T E[a ]. On the other hand, E[min{b, c}a E ] E[b a E ] T E[a ]E[b ]. Combining these two cases we have: E[min{b, c}a ] = E[min{b, c}a E ] + E[min{b, c}a E ] ce[a E ] + E[b a E ] c C T E[a ] + T E[a ]E[b ] = E[a ] c CE[b ]. c In the last step we chose T = C to balance the terms. E[b ] Using these lemmas, we are ready to provide the proof. Proof. of Theorem 4.. We analyze stage s of the algorithm. Let us define the variance term A) as [ ) ] α m + η A) = E P w )) Our main goal in the proof is to bound A). First, for all α, x, y and positive semidefinite H we have [ Eα x + y H = E H / αx + H / ] αy E αh / x + E ) αh / x Eα x H + Eα y H). 3) By the definition of we have ) A) B) + C) where B) and C) are defined below. Using that E a E[a] H E a H, Lemma, and the strong convexity of P we have ) α m + η B) = E ψ i w s ) ψ i w ) P w s ) i P w ) )) κ + η κ E[P w s) P w )] κ E[P w s) P w )]. 0

21 We use that min{a, b + c} b + min{a, c} for positive a, b, and c) by Lemma, the definition of σ, as well as 3) ) α m + η C) = E ψ i w ) i P w )) ησ + E min{κ, + M w t w P w )) } ψ i w ) i P w )) = ησ + E min{ κ, + M w t w P w )} ψ i w ) i P w )) ησ + E ψ i w ) + min{ κ, M w t w P w )} ψ i w ) ησ + σ i + D) ) i P w )) where D) is defined below. Using Lemma 6 and the independence of the different types of ψ D) = E min{κ, M w t w P w } ) ψ i w ) i P w )) { )} E min κ, M 3κ P w s ) P w ) + 6κ P w )) ψ i w ) 3κ M σ E[P w s ) P w )] + κ E) where E) is defined below. Using urtosis, E ψ i w ) i 4 P w )) 4σ /). i P w ))

22 By Lemma 7 and applying Lemma 8 we have { } E) E min, 6κM P w )) ψ i w ) i E min, κm ψ i w s ) ψ i w ) P w s ) i ψ i w ) i P w )) E min, κm ψ i w s ) ψ i w ) P w s ) + 70κM σ ) i P w )) P w )) + P w )) ψ i w ) i ψ i w ) 4σ 4 4κ M P w s) P w ) σ + 70κM 4 κmσ 96κ P w ) s) P w ) σ + 70κM by manipulation of constants ) σ 6κM + 96κ[P w ) s) P w )] σ + 70κM since a b a + b ) σ 00M κ + 96κ[P w s) P w )] ) i P w )) P w )) Using that x + y x + y this implies κ A) E[P w s) P w )] + ησ σ + + ) D) κ E[P ws ) P w )] + σ η + σ + ) D) κ E[P ws ) P w )] + σ η + σ + 3κ M σ E[P w s ) P w )] + κ ) ) σ 00M κ + 96κ[P w s) P w )] κ + Mσκ + 7κ) E[P w s ) P w )] + + η + 0Mσκ ) ) σ Using this bound in Lemma 4 then yields the result.

23 5 Empirical ris minimization M-estimation) for smooth functions We now provide finite-sample rates for the ERM. We tae the domain S to be compact in ) see Remar ). Throughout this section, define: for a matrix A of appropriate dimensions). A = P w )) / A P w )) / Theorem 5.. Suppose ψ, ψ,... are an independently drawn sample from D. Assume:. Convexity of ψ) Assume that ψ is convex with probability one).. Smoothness of ψ) Assume that ψ is smooth in the following sense: the first, second, and third derivatives exist at all interior points of S with probability one). 3. Regularity Conditions) Suppose a) S is compact so P w) is bounded on S). b) w is an interior point of S. c) P w ) is positive definite and, thus, is invertible). d) There exists a neighborhood B of w and a constant L 3, such that with probability one) ψw) is L 3 -Lipschitz, namely ψw) ψw ) L 3 w w P w ), for w, w in this neighborhood. 4. Concentration at w ) Suppose ψw ) P w )) L and ψw ) L hold with probability one. Suppose the dimension d is finite or, in the infinite dimensional setting, the intrinsic dimension is bounded, as in Remar 0). Then: lim E[P ŵ ERM ) P w )] σ / In particular, the following lower and upper bounds hold. Define = ε := c L L 3 + ) p log d L where c is an appropriately chosen universal constant. Also, let c be another appropriately chosen universal constant. { c min L, L L 3, diameterb) L }, then ε ) σ where Z = P w ) not compact. We have that for all p, if is large enough so that P w )) E[Z 4 ] p/ E[P ŵ ERM ) P w )] + ε ) σ + max w S P w) P w )) p p log d and so E[Z 4 ] L. The lower bound above holds even if S is 3

24 Remar 0 Infinite dimensional setting). Define M = ψw ) P w ) and d = TrEM ) λ maxem ), which we assume to be finite. Here can replace d with d in the theorems. See Lemma. Remar Compactness of S). The lower bound holds even if S is not compact. For the upper bound, the proof technique uses the compactness of S to bound the contribution to the expected regret due to a low probability) failure event that the ERM may not lie in the ball B or even the interior of S). If P is regularized then this last term can be improved, as S need not be compact. The basic idea of the proof follows that of Hsu et al. 04), along with various arguments based on Taylor s theorem. Proof. Throughout the proof use ŵ to denote the ERM ŵ ERM. Define: P w) = ψ i w) which is convex as it is the average of convex functions. Throughout the proof we tae t = cp logd) in the tail probability bounds in Appendix D for some universal constant c). This implies a probability of error less than p. For all w B, the empirical function P w) is L 3 -Lipschitz. In Lemma in Appendix D, we may tae v L as all eigenvalues of of P w ) are one, under the choice of norm). Using Lemma in Appendix D, for w B, we have: P w) P w ) P w) P w ) + P w ) P w ) L p log d L 3 w w P w ) + c for some other) universal constant c. ow we see to ensure that P w) is a constant spectral approximation to P w ). By choosing a sufficiently smaller ball B choose B to have radius of min{/0l 3 ), diameterb)}), the first term can be made small for w B. Also, for sufficiently large, the second term can be made arbitrarily small smaller than /0), which occurs if p log d c L. Hence, for such large enough, we have for w B : i= 33) P w) P w ) P w) 34) Suppose is at least this large from now on. ow let us show that ŵ B, with high probability, for sufficiently large. By Taylor s theorem, for all w in the interior of S, there exists a w, between w and w, such that: P w) = P w ) + P w ) w w ) + w w ) P w)w w ) Hence, for all w B and if Equation 34 holds, P w) P w ) = P w ) w w ) + w w P w) P w ) w w ) + 4 w w P w ) w w P w ) P w ) P w )) + ) 4 w w P w ) 4

25 Observe that if the right hand side is positive for some w B, then w is not a local minimum. Also, since P w ) 0, for a sufficiently small value of P w ), all points on the boundary of B will have values greater than that of w. Hence, we must have a local minimum of P w) that is strictly inside B for large enough). We can ensure this local minimum condition is achieved by choosing an large enough so that p log c min { L L 3, diameterb) L }, using Lemma and our bound on the diameter of B ). By convexity, we have that this is the global minimum, ŵ, and so ŵ B for large enough. Assume now that is this large from here on. For the ERM, 0 = P ŵ ). Again, by Taylor s theorem if ŵ is an interior point, we have: 0 = P ŵ ) = P w ) + P w )ŵ w ) for some w between w and ŵ. ow observe that w is in B since, for large enough, ŵ B ). Thus, ŵ w = P w )) P w ) 35) where the invertibility is guaranteed by Equation 34 and the positive definiteness of P w ). Using Lemma in Appendix D, ŵ w P w ) P w )) / P w )) P w )) / P w ) P w )) cl for some universal constant c. Again, by Taylor s theorem, we have that for some z : p log d 36) P ŵ ) P w ) = ŵ w ) P z )ŵ w ) where z is between w and ŵ. Observe that both w and z are between ŵ and w, which implies w w and z w since ŵ w. By Equations 33 and 36 and the tail inequalities in Appendix D), Define: P w ) P w ) c L L 3 + ) p log d L P z ) P w ) L 3 z w P w ) cl L 3 p log d Here the universal constant c is chosen so that: and ε = c L L 3 + ) p log d L ε ) P w ) P z ) + ε ) P w ) ε ) P w ) P w ) + ε ) P w ) using standard matrix perturbation results). 5

26 Define: For a lower bound, observe that: M, = P w )) / P w )) P w )) / M, = P w )) / P z ) P w )) / P ŵ ) P w ) λ minm, ) ŵ w P w ) = λ minm, ) P w )ŵ w ) λ minm, )) λ min M, ) P w )ŵ w ) = λ minm, )) λ min M, ) P w ) P w )) P w ) P w )) P w )) P w )) where we have used the ERM expression in Equation 35. Let IE) be the indicator that the desired previous events hold, which we can ensure with probability greater than c p. We have: E[P ŵ ) P w )] E[P ŵ ) P w ))IE)] [λ E min M, )) λ min M, ) P ] w ) IE) P w )) c ε ) [ ] E P w ) IE) P w )) = c ε ) [ ] E P w ) Inot E)) P w )) = c ε ) σ [ ]) E P w ) Inot E) P w )) [ ] c ε )σ E P w ) Inot E) P w )) for a universal constant c ). ow define the random variable Z = P w ) P w )) failure event probability of less than, for any z p 0, we have: E [ Z Inot E) ] = E [ Z Inot E)IZ z 0 ) ] + E [ Z Inot E)IZ z 0 ) ] z 0 E [Inot E)] + E [ Z IZ z 0 ) ] z 0 p + E [Z Z z 0 p + E[Z4 ] z 0 E[Z 4 ] p/ z 0 ] 6. With a

27 where we have chosen z 0 = p/ E[Z 4 ]. For an upper bound: E[P ŵ ) P w )] = E[P ŵ ) P w ))IE)] + E[P ŵ ) P w ))Inot E)] 37) E[P ŵ ) P w ))IE)] + max w S P w) P w )) p since the probability of not E is less than p. For an upper bound of the first term, observe that: E[P ŵ ) P w ))IE)] [λ E max M, )) λ max M, ) P ] w ) IE) P w ) + c ε ) [ ] E P w ) IE) P w )) + c ε ) [ ] E P w ) P w )) = + c ε ) σ This completes the proof using a different universal constant c in ε ). Acnowledgments The authors would lie to than Jonathan Kelner, Yin Tat Lee, and Boaz Bara for helpful discussion. Part of this wor was done while RF and AS were at Microsoft Research, ew England, and another part done while AS was visiting the Simons Institute for the Theory of Computing, UC Bereley. This wor was partially supported by SF awards and 09, SF Graduate Research Fellowship grant no. 374). References A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums. Technical report, arxiv, 04. URL A. Agarwal, P. L. Bartlett, P. Raviumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 585): , May 0. D. Anbar. On Optimal Estimation Methods Using Stochastic Approximation Procedures. University of California, 97. URL F. Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384 44, 00. F. Bach and E. Moulines. on-asymptotic analysis of stochastic approximation algorithms for machine learning. In eural Information Processing Systems IPS), 0. 7

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Optimal Newton-type methods for nonconvex smooth optimization problems

Optimal Newton-type methods for nonconvex smooth optimization problems Optimal Newton-type methods for nonconvex smooth optimization problems Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint June 9, 20 Abstract We consider a general class of second-order iterations

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

arxiv: v2 [math.oc] 5 May 2018

arxiv: v2 [math.oc] 5 May 2018 The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems Viva Patel a a Department of Statistics, University of Chicago, Illinois, USA arxiv:1709.04718v2 [math.oc]

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Recent Advances in SPSA at the Extremes: Adaptive Methods for Smooth Problems and Discrete Methods for Non-Smooth Problems

Recent Advances in SPSA at the Extremes: Adaptive Methods for Smooth Problems and Discrete Methods for Non-Smooth Problems Recent Advances in SPSA at the Extremes: Adaptive Methods for Smooth Problems and Discrete Methods for Non-Smooth Problems SGM2014: Stochastic Gradient Methods IPAM, February 24 28, 2014 James C. Spall

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

arxiv: v1 [stat.ml] 12 Nov 2015

arxiv: v1 [stat.ml] 12 Nov 2015 Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Second-Order Stochastic Optimization for Machine Learning in Linear Time Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

The Skorokhod reflection problem for functions with discontinuities (contractive case)

The Skorokhod reflection problem for functions with discontinuities (contractive case) The Skorokhod reflection problem for functions with discontinuities (contractive case) TAKIS KONSTANTOPOULOS Univ. of Texas at Austin Revised March 1999 Abstract Basic properties of the Skorokhod reflection

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Stochastic gradient descent on Riemannian manifolds

Stochastic gradient descent on Riemannian manifolds Stochastic gradient descent on Riemannian manifolds Silvère Bonnabel 1 Centre de Robotique - Mathématiques et systèmes Mines ParisTech SMILE Seminar Mines ParisTech Novembre 14th, 2013 1 silvere.bonnabel@mines-paristech

More information

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes A Piggybacing Design Framewor for Read-and Download-efficient Distributed Storage Codes K V Rashmi, Nihar B Shah, Kannan Ramchandran, Fellow, IEEE Department of Electrical Engineering and Computer Sciences

More information

Introduction: The Perceptron

Introduction: The Perceptron Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007

Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007 MIT OpenCourseWare http://ocw.mit.edu 18.409 Topics in Theoretical Computer Science: An Algorithmist's Toolkit Fall 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016 U.C. Berkeley CS294: Spectral Methods and Expanders Handout Luca Trevisan February 29, 206 Lecture : ARV In which we introduce semi-definite programming and a semi-definite programming relaxation of sparsest

More information

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points Inequalities Relating Addition and Replacement Type Finite Sample Breadown Points Robert Serfling Department of Mathematical Sciences University of Texas at Dallas Richardson, Texas 75083-0688, USA Email:

More information

Gradient Descent. Dr. Xiaowei Huang

Gradient Descent. Dr. Xiaowei Huang Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Maximization of Submodular Set Functions

Maximization of Submodular Set Functions Northeastern University Department of Electrical and Computer Engineering Maximization of Submodular Set Functions Biomedical Signal Processing, Imaging, Reasoning, and Learning BSPIRAL) Group Author:

More information

arxiv: v4 [math.oc] 24 Apr 2017

arxiv: v4 [math.oc] 24 Apr 2017 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute

More information

DECENTRALIZED algorithms are used to solve optimization

DECENTRALIZED algorithms are used to solve optimization 5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Variable Metric Stochastic Approximation Theory

Variable Metric Stochastic Approximation Theory Variable Metric Stochastic Approximation Theory Abstract We provide a variable metric stochastic approximation theory. In doing so, we provide a convergence theory for a large class of online variable

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Jim Lambers MAT 610 Summer Session Lecture 2 Notes Jim Lambers MAT 610 Summer Session 2009-10 Lecture 2 Notes These notes correspond to Sections 2.2-2.4 in the text. Vector Norms Given vectors x and y of length one, which are simply scalars x and y, the

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization

More information

Algorithms for Constrained Optimization

Algorithms for Constrained Optimization 1 / 42 Algorithms for Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University April 19, 2015 2 / 42 Outline 1. Convergence 2. Sequential quadratic

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information