Stochastic Subgradient Methods

Size: px

Start display at page:

Download "Stochastic Subgradient Methods"

Camron Hamilton
5 years ago
Views:

1 Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, Abstract Stochastic subgradient ethods play an iportant role in achine learning. We introduced the concepts of subgradient ethods and stochastic subgradient ethods in this project, discussed their convergence conditions as well as the strong and weak points against their copetitors. We deonstrated the application of (stochastic) subgradient ethods to achine learning with a running exaple of training support vector achines (SVM) throughout this report. 1 Introduction We have explored a variety of optiization ethods in class (e.g., generalized descent ethods, Newton s ethod, interior-point ethods), all of which apply to differentiable functions. However, any functions that arise in practice aybe non-differentiable at certain places. A coon exaple is the absolute value function f(x) = x, or its ultivariate extension, the l 1 -nor f(x) = x 1 = n x i. Therefore, we ai to discuss a generalization of the gradient for non-differentiable convex functions. In a lot of applications, the exact subgradient could be difficult to calculate because of errors in easureents, uncertainty in the data or because the coputation is intractable. However, it is usually possible to get a noisy estiate to the subgradient. In this case we can use the noisy estiate as the true value in the subgradient ethod, which is called the stochastic subgradient ethod. We ll show in this report that any convergence conditions still apply to the stochastic version, and the expectation of the solution could achieve the sae order of convergence rate as the exact subgradient ethod with often uch fewer coputations. The exaple of training SVMs as a stochastic subgradient ethod illustrates its advantage over deterinistic counterparts. It also shows the close connection between the statistic optiization and achine learning theories. This part of the report is ostly based on the lecture notes of Stephen Boyd [1] and a tutorial on stochastic optiization in ICML 2010 [8]. 2 Subgradient Methods 2.1 Definition A vector g R n is a subgradient of a convex function at x if f(y) f(x) + g T (y x) y (1) If f is convex and differentiable, then its gradient at x is a subgradient. But a subgradient can exist even when f is not differentiable at x, as illustrated in figure 1. The sae exaple shows that there can be ore than one subgradient of a function f at a point x. The set of all subgradients at x is called the subdifferential, and is denoted by f(x). 1

2 Figure 1: At x 1, the convex function f is differentiable, and g 1 (which is the derivative of f at x 1 ) is the unique subgradient at x 1. At the point x 2, f is not differentiable. At this point, f has any subgradients, g 2 and g 3, are shown. Taken fro [1] 2.2 Convergence for subgradient ethods Subgradient ethods are iterative ethods for solving convex iniization probles. Let f : R n R be a convex function with doain R n. A classical subgradient ethod iterates: x (k+1) = x (k) α k g (k) (2) where x (k) denotes the kth iterate, g (k) denotes any subgradient of f at x (k), and α k > 0 is the kth step size. Thus, at each iteration of the subgradient ethod,we take a step in the direction of a negative subgradient. Recall that a subgradient of f at x is any vector g that satisfies the inequality f(y) f(x) + g T (y x)for all y. It ay happen that g (k) is not a descent direction for f at x (k). We therefore aintain a list f best that keeps track of the lowest objective function value found so far, i.e.,the one with sallest function value.at each step, we set f (k) (k 1) best = in{f best, f(x (k) )} (3) and set i (k) best = k if f(x(k) ) = f (k) best, i.e., if x(k) is the best point found so far(in a descent ethod, there is no need to do this, since the current point is always the best one so far.) Then we have the best objective value found in k iterations f (k) best = in{f(x(1) ),..., f(x (k) )} (4) The subgradient ethods are guaranteed to converge to within soe range of the optial value under certain assuption. We assue (1) there is a iniizer of f, say x. (2) the nor of the subgradients is bounded, i.e., there is a G such that f (k) 2 G for all k. This will be the case if f satisfies the Lipschitz condition f(u) f(v) G u v 2, (5) for all u, v, because then g 2 G for any g f(x),and any x. (3) a nuber R is known that satisfies R x (1) x 2. Under those three assuptions, the subgradient algorith is guaranteed to converge to within soe range of the optial value, i.e., we have li f (k) best f <, (6) k where f denotes the optial value of the function f.the nuber is a function of the step size. So for different step size rules, the convergence results are different. The followings are five basic step size rules. (a) Constant step size, α k = α. 2

3 (b) Constant step length, α k = γ/ g (k) 2, which gives x (k+1) x (k) 2 = γ. (c) Square suable but not suable step size, i.e. any step sizes satisfying α k 0, αk 2 <, α k = (7) k=1 k=1 (d) Nonsuable diinishing, i.e. any step sizes satisfying α k 0, li α k = 0, α k = (8) k k=1 (e) Nonsuable diinishing step lengths, i.e. α k = γ k / g (k) 2 (9) whereγ k 0, li k γ k = 0, k=1 γ k =. For all five rules, the step-sizes are deterined off-line, before the ethod is iterated, the step-sizes do not depend on preceding iterations. This off-line property of subgradient ethods differs fro the on-line step-size rules used for descent ethods for differentiable functions: any ethods for iniizing differentiable functions satisfy Wolfe s sufficient conditions for convergence, where step-sizes typically depend on the current point and the current search direction. Terination of the subgradient search procedure occurs when R2 +G 2 k=1 α2 i 2 (which is an k=1 αi upper bound of f (k) best f ) is really, really slow. In fact, up to the present, the subgradient ethod has suffered fro lack of definite terination criterion. So the subgradient ethods are usually used for probles in which very high accuracy is not required. 2.3 Subgradient ethods for unconstrained proble Let f : R n R be a convex function. We seek to solve the unconstrained proble p = in x f(x) x arg in f(x) (10) x Rn For differentiable f the optiality condition reads f(x ) = in x R n f(x) 0 = f(x ) (11) For non-differentiable f the optiality condition generalizes to f(x ) = in x R n f(x) 0 f(x ) (12) which follows fro the definition of the subgradient f(x) f(x ) + 0 T (x x ), x R n. The pseudocode of the subgradient ethod for unconstrained convex function f is shown in Algorith 1. Algorith 1 Subgradient ethod for iniizing unconstrained f(x) Initialize x 0, k = 0, copute a subgradient g k f(x k ) repeat x (k+1) = x (k) α k g (k) until Terination condition is satisfied x (k) : kth iterate, α k > 0: kth step size, g (k) : any subgradient of f at x (k). Let s take the L 2 -nor SVM learning proble as an exaple. Consider the prial for of SVM: in s.t. w 2 + C ξ i ( w, x i y i ) 1 ξ i, i = 1,..., ξ i 0, i = 1,..., (13) 3

4 Table 1: Subgradients of function F (w, ξ). (a) (b) Figure 2: Constrained SVM: the value of f (t) best f versus iteration nuber t. Taken fro [10] Note that ξ i ax{0, 1 ( w, x i y i )} and the equality is attained in the optiu, i.e., we can optiize as s.t. ξ i = ax{0, 1 ( w, x i y i )} = l( w, x i y i ) (14) (which is usually called the hinge loss). We can therefore reforulate the proble as an unconstrained convex proble in w 2 + C l( w, x i y i ) (15) The first ter is differentiable, the second ter is a point-wise axiu, thus this forulation fit the subgradient fraework. The corresponding subgradients are shown in table 2 where b is the coefficient for the constant ter of x. γ g t 2 For this L 2 -nor SVM learning proble, we considered the constant step length rule α t = and the diinishing step size rule α t = γ t. The function converges to a suboptial solution li t f (t) best f γg 2 for the first step size rule, while it converges to optial solution li t f (t) best f = 0 for the second step size rule. Figure 3(a) and 3(b) shows convergence of f (t) best f for different γ. The figures reveal a trade-off: larger γ gives faster convergence, but larger final suboptiality. Non-onotonicity and the slow convergence of f(x (t) ) are also clearly seen fro these plots. 2.4 Subgradient ethods for constrained proble One extension of the subgradient ethod is the projected subgradient ethod, which solves the constrained optiization proble iniize f(x) subject to x C (16) 4

5 (a) (b) Figure 3: Constrained SVM: the value of f (t) best f versus iteration nuber t. Taken fro [10] where C is a convex set, the projected subgradient ethod uses the iteration ( x (k+1) = P x (k) α k g (k)) (17) where P is (Euclidean) projection on C. The pseudocode of the projected subgradient ethod for constrained convex function f is shown in Algorith 2. Algorith 2 Subgradient ethod for iniizing constrained f(x) Initialize x 0, k = 0, copute a subgradient g k f(x k ) repeat x (k+1) = x (k) α k g (k) x (k+1) = P ( x (k+1) ) until Terination condition is satisfied x (k) : kth iterate, α k > 0: kth step size, g (k) : any subgradient of f at x (k). For the sae SVM learning exaple in w 2 + C l( w, x i y i ) (18) An alternative forulation of the sae proble is in 1 l( w, x i y i ) s.t. w B (19) note that the regularization constant C > 0 is replaced here by a less abstract constant B > 0. To ipleent subgradient ethod, we need only a projection operator P (w) = {w so after update w := w α k w, we project w := w B} (20) w w w B. For the reforulated L 2 -nor SVM learning proble, we also plotted the convergence results with constant step length rule and diinishing step size. Figure 4(a) and 4(b) show convergence of f (t) best f for different γ. Coparing these two plots with figure 3(a) and 3(b) respectively, we found this approach converges uch faster. 5

6 The subgradient ethod can also be extended to solve the general inequality constrained proble iniize f 0 (x) subject to f i (x) 0, i = 1,..., (21) where f i are convex. The algorith takes the sae for: x (k+1) = x (k) α k g (k) (22) where α k > 0 is a step size, and g (k) is a subgradient of the objective or one of the constraint functions atx. More specifically, we take { g (k) f0 (x) if f = i (x) 0 i = 1... (23) f j (x) for soe j such that f j (x) > 0 where f denotes the subdifferential of f. In other words: If the current point is feasible, the algorith uses an objective subgradient, as if the proble is unconstrained; if the current point is infeasible, the algorith chooses a subgradient of any violated constraint. In this generalized version of the subgradient algorith, the iterates can be (and often are ) infeasible. In contrast, the iterates of the projected subgradient ethod ( and the basic subgradient algorith) are always feasible. So when the current point is infeasible, we can use an over-projection step length, as we would when solving convex inequalities. If we know (or estiate) f, we can use the Polyak s step length when the current point is feasible. Thus our step lengths are chose as { (f0 (x (k) ) f )/ g (k) 2 α k = 2, x (k) feasible (f i (x (k) ) + )/ g (k) 2 2, x (k) (24) infeasible where is a sall positive argin, and i is the index of the ost violated inequality in the case where x (k) is infeasible. 3 Stochastic Subgradient Methods When the subgradient of the objective function is difficult to copute exactly due to various reasons such as errors in easureents or intractability in the coputation, we can use a noisy estiate of the subgradient for optiization. With a proper choice of the step size, we can guarantee the convergence with probability 1 or even stronger conclusions. Stochastic ethods also apply when the objective function itself is difficult to copute exactly. In the following subsections, we ll introduce the stochastic subgradient descent (SD) ethod, its convergence condition and stochastic prograing. 3.1 Stochastic Subgradient Methods For a convex function f : R n R, g is called the noisy (unbiased) subgradient of f at x dof if g = E g f(x), i.e., f(z) f(x) + (E g) T (z x), z dof (25) If x is also a rando variable, then we say that g is a noisy subgradient of f at x (which is rando) if E( g x) f(x). We can consider the noisy subgradient as the su of g and a rando variable with zero ean. The noise can coe fro error in coputing the true subgradient, error in Monte Carlo evaluation of a function defined as an expected value, or easureent error. The algorith of stochastic subgradient ethod is alost the sae as the regular subgradient ethod except for using a noisy estiate instead of the true subgradient and a ore liited set of step size rules. The pseudocode of the stochastic subgradient ethod for iniization a convex function f without constraints is shown in Algorith 3. As the regular subgradient ethod, f(x (k) ) can increase during the algorith, so we keep track of the best point found so far, and the associated function value f (k) best = in{f(x(1) ),..., f(x (k) )} (26) 6

7 Algorith 3 Stochastic subgradient ethod for iniizing f(x) Initialize x 0, k = 0 repeat x (k+1) = x (k) α k g (k) until Terination condition is satisfied x (k) : kth iterate, α k > 0: kth step size, g (k) : noisy subgradient of f at x (k). We ll show a very basic convergence result for the stochastic subgradient ethod. Assue there is an x that iniizes f, and a G for which E g(k) 2 2 G 2 for all k. We also assue that B satisfies E x (1) x 2 2 B 2. It can be proved that Ef (k) best f B2 + G 2 k 2 k α i α2 i we can get the convergence in expectation (k) li Ef best f (28) k for various step size rules such as square-suable but not suable sequence (e.g. α k = 1/k), and not suable diinishing sequence (e.g. α k = 1/ k). Using Markov s inequality, we obtain the convergence in probability, that is, for any > 0, (k) li Prob(f best f + ) = 0 (29) k More sophisticated ethods can be used to show alost sure convergence. 3.2 Applying SD to SVM Let s still take the SVM learning proble as an exaple. Consider the nor constraint forulation of SVM: in F (w) = in 1 l( w, x i y i ) (30) w 2 B w 2 B The objective function F (w) is the average hinge loss over the training set. The regular gradient descent algorith, referred to as batch subgradient ethod, coputes the subgradient as the average of the subgradients of the loss function for every data point, i.e., g (k) = 1 w l( w k, x i y i ) (31) and then project the new paraeter vector back to the ball of radius B at each step as w (k+1) if the new vector outside the ball. B w(k+1) w (k+1) In contrast, we consider the stochastic version, Algorith 3, with the extra projection step above. At each iteration, the stochastic subgradient is coputed by randoly drawing a saple fro data points, x i(k), and coputing the noisy subgradient as g (k) = w l( w k, x i(k) y i(k) ) (32) It s easy to see that E g (k) = g (k). In this exaple, we choose the step size rule as α k = B/G k the final solution is coputed as w = 1 k k w k. It can be proved that for the batch subgradient descent with such a choice the step size, we can achieve the sub-optiality: F (w (k) ) F (w ) O( GB k ) (33) (27) and 7

8 where w is the optial solution. This is the best possible convergence rate using only F (w) and F (w). For SD we can get the sae guarantee in the expectation of the objective function: E[F ( w (k) ] F (w ) O( GB k ) (34) Taking into consideration that the batch ethod takes ties as any operations as the stochastic ethod with the sae guarantee on errors, we would coe to an intuitive conclusion that if only taking siple gradient steps, it s better to be stochastic. Forally speaking, to guarantee an error bound of in F (w), the runtie of SD is O( G2 B 2 d) while the runtie of batch subgradient ethod 2 is O( G2 B 2 d), where d is the diensionality of the feature vector. The stochastic ethod achieves the optial 2 runtie if only using gradients and only assuing Lipschitz condition. The difference in speed is significant for high diensional probles. We should notice that the error bound of the stochastic ethod is defined in expectation. The convergence rate as a function of update steps for individual realizations of the ethod is usually slower than the batch ethod due to the noise. In fact, the slow convergence acts as a soothing factor that reduces the variance in the stochastic subgradients. We could consider a ixed ethod that draws a subset of data points fro {x i } and coputes the average as the stochastic subgradient. This ethod, usually called ini-batch ethod, provides a tradeoff between the coputational burden at each step and the error variance (relevant to convergence rate). Also, the runtie of SD is optial only in the first order ethods. Second order ethods such as Newton s ethod can use uch fewer steps than SD for the sae error bound although it ay take uch ore tie in each step. 3.3 Stochastic Prograing We have discussed in the previous subsections the situation when the subgradient is difficult or expensive to copute and we take a noisy estiate to optiize the objective function. Now we ll talk about the case when the objective function itself is difficult to copute. A stochastic prograing proble has the for iniize Ef 0 (x, ω) subject to Ef i (x, ω) 0, i = 1,..., (35) where x R n is the optiization variable, and ω is a rando variable. If f i (x, ω) is convex in x for each ω, the proble is a convex stochastic prograing proble. In this case the objective and constraint functions are convex. Stochastic prograing can be used to odel a variety of robust design or decision probles with uncertain data. Although the basic for above involves only expectation or average values, soe tricks can be used to capture other easures of the probability distributions of f i (x, ω). We can replace an objective or constraint ter Ef(x, ω) with EΦ(f(x, ω)), where Φ is a convex increasing function. For exaple, with Φ(u) = ax(u, 0), we can for a constraint of the for Ef i (x, ω) + (36) where is a positive paraeter, and (.) + denotes positive part. Here Ef i (x, ω) + has a siple interpretation as the expected violation of the ith constraint. It s also possible to cobine the constraints, using a single constraint of the for E ax(f 1 (x, ω) +,..., f (x, ω + ) (37) The lefthand side here can be interpreted as the expected worst violation (over all constraints). Proble (35) is associated with another optiization proble that replaces the rando variable by its expected value, which is soeties called the certainty equivalent of the original stochastic prograing, iniize f 0 (x, Eω) subject to f i (x, Eω) 0, i = 1,..., (38) By Jensen s inequality, we get Ef i (x, ω) f i (x, Eω) (39) 8

9 It tells us that the constraint set for the certainty equivalent proble is larger than the original stochastic proble (35), and its objective is saller. It follows that the optial value of the certainty equivalent proble gives a lower bound on the optial value of the stochastic proble (35). (It can be a poor bound, of course.) The objective function f(x) = F (x, ω)p(ω)dω is usually not easy to copute exactly especially for a function with a high diensional variable. However, in any applications, we can approxiately copute f using Monte Carlo ethods. Assue we generate M independent identically distributed (iid) saples ω 1,..., ω M. We can copute an estiate of f as ˆf(x) = 1 M M F (x, ω i ) (40) Further ore, if we calculate the subgradient of F (x, ω) for each value of ω, x F (x, ω i ), it can be proved that their average, g(x) = 1 M M xf (x, ω i ), gives an unbiased noisy estiate of the subgradient of f(x), i.e., g(x) E g(x) = E F (x, ω) x Ef(x, ω) (41) Therefore, we can use the stochastic subgradient descent ethod described in subsection 3.1 to solve the optiization proble (35). This result is independent of M. We can even take M = 1. In this case, g = F (x, ω 1 ). In other words, we siply generate one saple ω 1 and use the subgradient of F for that value of ω. In this case g could hardly be called a good approxiation of a subgradient of f, but its ean is a subgradient, so it is a valid noisy unbiased subgradient. On the other hand, we can take M to be large. In this case, g is a rando vector with ean g and very sall variance, i.e., it is a good estiate of g, a subgradient of f at x. 3.4 Applying Stochastic Prograing to SVM The stochastic prograing is closely related to another training setting in achine learning: iniizing generalization error. Since the purpose of achine learning is to learn a odel that generalizes to unobserved data, it akes ore sense to iniize the generalization error E p [l(w, z)] rather than the error on the epirical distribution 1 M M l(w, z i) (soeties called epirical risk), where p is the true distribution of data z, and z i is the ith of the M data ites in the training set. Actually, optiization on the epirical risk often leads to over-fitting in the training set. Fro this aspect, we would like to forulate the achine learning proble in the generic learning setting [9] (taking SVM training as an exaple again) as in F (w) = in E zf (w, z) (42) w 2 B w 2 B = in w 2 B E (x,y)l( w, x y) (43) In general, training a odel by iniizing the generalization error (42) is a stochastic prograing proble. Since the loss function l of SVM is a convex function, proble (43) is also a convex stochastic prograing proble. We copare two approaches to proble (43). One approach is called saple average approxiation (SAA) [4, 7, 5]. When an th saple, (x, y ), is collected, SAA optiizes the epirical loss in w ˆF (w) = in w 1 l( w, x i y i ) (44) by any optiization ethod such as SD, batch subgradient descent or Newton s ethod. This corresponds to the epirical risk iniization ethod we entioned above if we consider the set of saples as the training set. Analysis on the error bound with SAA is typically based on the unifor concentration theore. The other approach is called saple approxiation (SA) [6]. SA uses the siple stochastic subgradient descent ethod to directly iniize F (w). Every tie a new saple is collected, a noisy 9

10 Figure 4: Illustration of the relationships between the initialization (w (0) ), outputs of SAA/SA, ŵ/ w () and the optial solution w. Taken fro [8]. subgradient is coputed as g (k) = w l( w (k), x k y k ). Then w (k) is updated based on that weak estiate to F (w (k) ). Since this is a typical stochastic subgradient descent ethod on a convex optiization proble, conclusions about convergence rate on the general stochastic ethods also apply here. We show the conclusions of SAA and SA below without proofs. Let ŵ be the optial solution of proble (44). We can find an upper bound on the error L(ŵ) L(w ) O by the unifor concentration theore. Thus to guarantee L( w () SAA ) L(w ) +, we need a saple of size = Ω( G2 B 2 ) which leads to the conclusion that the runtie to guarantee an upper bound on 2 the error is Ω(d) = Ω( G2 B 2 d). Notice that this result applies to any optiization algorith. 2 In contrast, for SA after one pass over the data ites, the generalization error with w () SA is upper bounded by L( w () SA ) L( w ) + O( ). So to guarantee that L( w() SA ) L( w ) G 2 B 2 the saple size would be = O( G2 B 2 ) and consequently the runtie for an error gap of is 2 O(d) = O( G2 B 2 d). 2 SA guarantees the best-possible generalization error with optial runtie in theory. The relationship between the outputs of SAA/SA and the optial solution is shown in figure 4. However, we should notice that SA achieves the optial generalization error up to a constant factor, and in practice, SA still sees to be better than SA in the generalization error. Nevertheless, stochastic gradient descent is usually considered the fastest algorith for solving any achine learning probles. G 2 B 2 4 Discussion The subgradient ethod is very siple and general, it applies directly to nondifferentiable f,any subgradient fro f(x (k) ) is sufficient. It is guaranteed to converge for properly selected step size. However, coparing with ordinary gradient ethod, the convergence can be uch slower, as it usually requires a lot of iterations. There is no onotonic convergence since it is not a descent ethod, so the line search cannot be used to final optial α k. Besides, there is no good stopping criteria. Subgradient ethod is part of a large class of ethods known as first-order ethods, which are based on subgradients or closely related objects. In contrast, second-order ethods(like interior-point ethods or Newton s ethod) typically use Hessian inforation; each iteration is expensive (both in flops and eory) but convergence is obtained very quickly. There are several general approaches that can be used to speed up subgradient ethods. Localization ethods such as cutting-plane and ellipsoid ethods are typically uch faster than basic subgradient ethods. Another general approach is to base the update on soe conic cobination of previously evaluated subgradients. For exaple, in bundle ethods, the update direction is found as the least-nor convex cobination of soe previous subgradients. Further ore, the convergence can be iproved by keeping eory of past steps, x (k+1) = x (k) α k g (k) + β k (x (k) x (k 1) ) (45) 10

11 where α k and β k are positive.we can interpret the second ter as a eory ter. Such algoriths have state, whereas the basic subgradient ethods is stateless. Conjugate gradients ethods have a siilar for. The stochastic variant of the subgradient ethod relaxes the condition for the application subgradient ethod where exact coputation of the subgradient isn t necessary anyore. This usually slows down the convergence as a function of update steps due to the noise in the descent direction, but it could save a lot of tie in each step because a noisy estiate can be easy to obtain. Moreover, the sae order of convergence rate as the deterinistic descent ethods either in iniizing the epirical error or the generalization error akes SD a very copetitive tool in achine learning. Stochastic learning is in fact closed related to online learning except for few differences such as the objective function (online learning iniizes epirical regret), and the assuption for observation (adversary vs iid) distributions [8]. There are also soe other learning algoriths that enjoy siilar or even better advantages than SD including stochastic coordinate descent [3] and stochastic second order ethods [2]. References [1] S. Boyd. Stochastic subgradient ethods. lecture slides and notes for ee364b. Stanford University, Spring quarter [2] R. Byrd, G. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic hessian inforation in unconstrained optiization. Technical report, Technical Report, Northwestern University, [3] C. Hsieh, K. Chang, C. Lin, S. Keerthi, and S. Sundararajan. A dual coordinate descent ethod for large-scale linear SVM. In Proceedings of the 25th international conference on Machine learning, pages ACM, [4] A. J. Kleywegt, A. Shapiro, and T. H. de Mello. The saple average approxiation ethod for stochastic discrete optiization. SIAM J. on Optiization, 12: , February [5] E. Plabeck, B. Fu, S. Robinson, and R. Suri. Saple-path optiization of convex stochastic perforance functions. Matheatical Prograing, 75(2): , [6] H. Robbins and S. Monro. A stochastic approxiation ethod. The Annals of Matheatical Statistics, 22(3): , [7] R. Y. Rubinstein and A. Shapiro. Optiization of static siulation odels by the score function ethod. Math. Coput. Siul., 32: , October [8] N. Srebro and A. Tewari. Stochastic optiization: Icl 2010 tutorial, July [9] V. Vapnik. The nature of statistical learning theory [10] F. Vojtech. A short tutorial on subgradient ethods. 11

Stochastic Subgradient Method

Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)