Some Relevant Topics in Optimization

Some Relevant Topics in Optimization Stephen Wright University of Wisconsin-Madison IPAM, July 2012 Stephen Wright (UW-Madison) Optimization IPAM, July 2012 1 / 118

1 Introduction/Overview 2 Gradient Methods 3 Stochastic Gradient Methods for Convex Minimization 4 Sparse and Regularized Optimization 5 Decomposition Methods 6 Augmented Lagrangian Methods and Splitting Stephen Wright (UW-Madison) Optimization IPAM, July 2012 2 / 118

Introduction Learning from data leads naturally to optimization formulations. Typical ingredients of a learning problem include Collection of training data, from which we want to learn to make inferences about future data. Parametrized model, whose parameters can in principle be determined from training data + prior knowledge. Objective that captures prediction errors on the training data and deviation from prior knowledge or desirable structure. Other typical properties of learning problems are huge underlying data set, and requirement for solutions with only low-medium accuracy. Formulation as an optimization problem can be difficult and controversial. However there are several important paradigms in which the issue is well settled. (e.g. Support Vector Machines, Logistic Regression, Recommender Systems.) Stephen Wright (UW-Madison) Optimization IPAM, July 2012 3 / 118

Optimization Formulations There is a wide variety of optimization formulations for machine learning problems. But several common issues and structures arise in many cases. Imposing structure. Can include regularization functions in the objective or constraints. x 1 to induce sparsity in the vector x; Nuclear norm X (sum of singular values) to induce low rank in X. Objective: Can be derived from Bayesian statistics + maximum likelihood criterion. Can incorporate prior knowledge. Objectives f have distinctive properties in several applications: Partially separable: f (x) = e E f e(x e ), where each x e is a subvector of x, and each term f e corresponds to a single item of data. Sometimes possible to compute subvectors of the gradient f at proportionately lower cost than the full gradient. These two properties are often combined: In partially separable f, subvector x e is often small. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 4 / 118

Examples: Partially Separable Structure 1. SVM with hinge loss: f (w) = C N max(1 y i (w T x i ), 0) + 1 2 w 2, i=1 where variable vector w contains feature weights, x i are feature vectors, y i = ±1 are labels, and C > 0 is a parameter. 2. Matrix completion. Given k n marix M with entries (u, v) E specified, seek L (k r) and R (n r) such that M LR T. { } (L u Rv T M uv ) 2 + µ u L u 2 F + µ v R v 2 F. min L,R (u,v) E Stephen Wright (UW-Madison) Optimization IPAM, July 2012 5 / 118

Examples: Partially Separable Structure 3. Regularized logistic regression (2 classes): f (w) = 1 N N log(1 + exp(y i w T x i )) + µ w 1. i=1 4. Logistic regression (M classes): y ij = 1 if data point i is in class j; y ij = 0 otherwise. w [j] is the subvector of w for class j. f (w) = 1 N N M M M y ij (w[j] T x i) log( exp(w[j] T x i)) + w [j] 2 2. i=1 j=1 j=1 j=1 Stephen Wright (UW-Madison) Optimization IPAM, July 2012 6 / 118

Examples: Partial Gradient Structure 1. Dual, nonlinear SVM: 1 min α 2 αt Kα 1 T α s.t. 0 α C1, y T α = 0, where K ij = y i y j k(x i, x j ), with k(, ) a kernel function. Subvectors of the gradient Kα 1 can be updated and maintained economically. 2. Logistic regression (again): Gradient of log-likelihood function is 1 N X T u, where u i = y i (1 + exp(y i w T x i )), i = 1, 2,..., N. If w is sparse, it may be cheap to evaluate u, which is dense. Then, evaluation of partial gradient [ f (x)] G may be cheap. Partitioning of x may also arise naturally from problem structure, parallel implementation, or administrative reasons (e.g. decentralized control). (Block) Coordinate Descent methods that exploit this property have been successful. (More tomorrow.) Stephen Wright (UW-Madison) Optimization IPAM, July 2012 7 / 118

Batch vs Incremental Considering the partially separable form f (x) = f e (x e ), e E the size E of the training set can be very large. Practical considerations, and differing requirements for solution accuracy lead to a fundamental divide in algorithmic strategy. Incremental: Select a single e at random, evaluate f e (x e ), and take a step in this direction. (Note that E( f e (x e )) = E 1 f (x).) Stochastic Approximation (SA). Batch: Select a subset of data Ẽ E, and minimize the function f (x) = e Ẽ f e(x e ). Sample-Average Approximation (SAA). Minibatch is a kind of compromise: Aggregate the e into small groups, consisting of 10 or 100 individual terms, and apply incremental algorithms to the redefined summation. (Gives lower-variance gradient estimates.) Stephen Wright (UW-Madison) Optimization IPAM, July 2012 8 / 118

Background: Optimization and Machine Learning A long history of connections. Examples: Back-propagation for neural networks was recognized in the 80s or earlier as an incremental gradient method. Support Vector machine formulated as a linear and quadratic program in the late 1980s. Duality allowed formulation of nonlinear SVM as a convex QP. From late 1990s, many specialized optimization methods were applied: interior-point, coordinate descent / decomposition, cutting-plane, stochastic gradient. Stochastic gradient. Originally Robbins-Munro (1951). Optimizers in Russia developed algorithms from 1980 onwards. Rediscovered by machine learning community around 2004 (Bottou, LeCun). Parallel and independent work in ML and Optimization communities until 2009. Intense research continues. Connections are now stronger than ever, with much collaborative and crossover activity. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 9 / 118

Gradient Methods min f (x), with smooth convex f. Usually assume µi 2 f (x) LI for all x, with 0 µ L. (L is thus a Lipschitz constant on the gradient f.) µ > 0 strongly convex. Have f (y) f (x) f (x) T (y x) 1 2 µ y x 2. (Mostly assume := 2.) Define conditioning κ := L/µ. Sometimes discuss convex quadratic f : f (x) = 1 2 x T Ax, where µi A LI. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 10 / 118

What s the Setup? Assume in this part of talk that we can evaluate f and f at each iterate x i. But we are interested in extending to broader class of problems: nonsmooth f ; f not available; only an estimate of the gradient (or subgradient) is available; impose a constraint x Ω for some simple set Ω (e.g. ball, box, simplex); a nonsmooth regularization term may be added to the objective f. Focus on algorithms that can be adapted to these circumstances. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 11 / 118

Steepest Descent x k+1 = x k α k f (x k ), for some α k > 0. Different ways to identify an appropriate α k. 1 Hard: Interpolating scheme with safeguarding to identify an approximate minimizing α k. 2 1 1 1 Easy: Backtracking. ᾱ, 2ᾱ, 4ᾱ, 8ᾱ,... until a sufficient decrease in f is obtained. 3 Trivial: Don t test for function decrease. Use rules based on L and µ. Traditional analysis for 1 and 2: Usually yields global convergence at unspecified rate. The greedy strategy of getting good decrease from the current search direction is appealing, and may lead to better practical results. Analysis for 3: Focuses on convergence rate, and leads to accelerated multistep methods. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 12 / 118

Line Search Works for nonconvex f also. Seek α k that satisfies Wolfe conditions: sufficient decrease in f : f (x k α k f (x k )) f (x k ) c 1 α k f (x k ) 2, (0 < c 1 1) while not being too small (significant increase in the directional derivative): f (x k+1 ) T f (x k ) c 2 f (x k ) 2, (c 1 < c 2 < 1). Can show that for convex f, accumulation points x of {x k } are stationary: f ( x) = 0. (Optimal, when f is convex.) Can do a one-dimensional line search for α k, taking minima of quadratic or cubics that interpolate the function and gradient information at the last two values tried. Use brackets to ensure steady convergence. Often find a suitable α within 3 attempts. (See e.g. Ch. 3 of Nocedal & Wright, 2006) Stephen Wright (UW-Madison) Optimization IPAM, July 2012 13 / 118

Backtracking Try α k = ᾱ, ᾱ/2, ᾱ/4, ᾱ/8,... until the sufficient decrease condition is satisfied. (No need to check the second Wolfe condition, as the value of α k thus identified is within striking distance of a value that s too large so it is not too short.) These methods are widely used in many applications, but they don t work on nonsmooth problems when subgradients replace gradients, or when f is not available. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 14 / 118

Constant (Short) Steplength By elementary use of Taylor s theorem, obtain For α k 1/L, have thus f (x k+1 ) f (x k ) α k ( 1 α k 2 L ) f (x k ) 2 2. f (x k+1 ) f (x k ) 1 2L f (x k) 2 2. f (x k ) 2 2L[f (x k ) f (x k+1 )]. By summing from k = 0 to k = N, and telescoping the sum, we have N f (x k ) 2 2L[f (x 0 ) f (x N+1 )]. k=1 (It follows that f (x k ) 0 if f is bounded below.) Stephen Wright (UW-Madison) Optimization IPAM, July 2012 15 / 118

Rate Analysis Another elementary use of Taylor s theorem shows that x k+1 x 2 x k x 2 α k (2/L α k ) f (x k ) 2, so that { x k x } is decreasing. Define for convenience: k := f (x k ) f (x ). By convexity, have k f (x k ) T (x k x ) f (x k ) x k x f (x k ) x 0 x. From previous page (subtracting f (x ) from both sides of the inequality), and using the inequality above, we have k+1 k (1/2L) f (x k ) 2 k 1 2L x 0 x 2 2 k. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 16 / 118

Weakly convex: 1/k sublinear; Strongly convex: linear Take reciprocal of both sides and manipulate (using (1 ɛ) 1 1 + ɛ): which yields 1 k+1 1 k + The classic 1/k convergence rate! 1 2L x 0 x 2 1 0 + f (x k+1 ) f (x ) 2L x 0 x 2. k + 1 k + 1 2L x 0 x 2, By assuming µ > 0, can set α k 2/(µ + L) and get a linear (geometric) rate: Much better than sublinear, in the long run x k x 2 ( ) L µ 2k ( x 0 x 2 = 1 2 ) 2k x 0 x 2. L + µ κ + 1 Stephen Wright (UW-Madison) Optimization IPAM, July 2012 17 / 118

Since by Taylor s theorem we have it follows immediately that k = f (x k ) f (x ) (L/2) x k x 2, f (x k ) f (x ) L 2 ( 1 2 ) 2k x 0 x 2. κ + 1 Note: A geometric / linear rate is generally much better than any sublinear (1/k or 1/k 2 ) rate. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 18 / 118

The 1/k 2 Speed Limit Nesterov (2004) gives a simple example of a smooth function for which no method that generates iterates of the form x k+1 = x k α k f (x k ) can converge at a rate faster than 1/k 2, at least for its first n/2 iterations. Note that x k+1 x 0 + span( f (x 0 ), f (x 1 ),..., f (x k )). 2 1 0 0...... 0 1 1 2 1 0...... 0 0 A = 0 1 2 1 0... 0, e......... 1 = 0. 0... 0 1 2 0 and set f (x) = (1/2)x T Ax e T 1 x. The solution has x (i) = 1 i/(n + 1). If we start at x 0 = 0, each f (x k ) has nonzeros only in its first k entries. Hence, x k+1 (i) = 0 for i = k + 1, k + 2,..., n. Can show f (x k ) f 3L x 0 x 2 32(k + 1) 2. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 19 / 118

Exact minimizing α k : Faster rate? Take α k to be the exact minimizer of f along f (x k ). Does this yield a better rate of linear convergence? Consider the convex quadratic f (x) = (1/2)x T Ax. (Thus x = 0 and f (x ) = 0.) Here κ is the condition number of A. We have f (x k ) = Ax k. Exact minimizing α k : α k = x k T A2 x k xk T = arg min A3 x k 2 (x k αax k ) T A(x k αax k ), [ which is in the interval 1 L µ], 1. Thus so, defining z k := Ax k, we have α 1 f (x k+1 ) f (x k ) 1 (xk T A2 x k ) 2 2 (xk T Ax k)(xk T A3 x k ), f (x k+1 ) f (x ) f (x k ) f (x ) 1 z k 4 (z T k A 1 z k )(z T k Az k). Stephen Wright (UW-Madison) Optimization IPAM, July 2012 20 / 118

Use Kantorovich inequality: (z T Az)(z T A 1 z) (L + µ)2 4Lµ z 4. Thus and so f (x k+1 ) f (x ) f (x k ) f (x ) f (x k ) f (x ) 1 4Lµ ( (L + µ) 2 = 1 2 ) 2, κ + 1 ( 1 2 ) 2k [f (x 0 ) f (x )]. κ + 1 No improvement in the linear rate over constant steplength. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 21 / 118

The slow linear rate is typical! Not just a pessimistic bound! Stephen Wright (UW-Madison) Optimization IPAM, July 2012 22 / 118

Multistep Methods: Heavy-Ball Enhance the search direction by including a contribution from the previous step. Consider first constant step lengths: x k+1 = x k α f (x k ) + β(x k x k 1 ) Analyze by defining a composite iterate vector: [ xk x w k := ] x k 1 x. Thus w k+1 = Bw k + o( w k ), B := [ α 2 f (x ] ) + (1 + β)i βi. I 0 Stephen Wright (UW-Madison) Optimization IPAM, July 2012 23 / 118

B has same eigenvalues as [ αλ + (1 + β)i βi I 0 ], Λ = diag(λ 1, λ 2,..., λ n ), where λ i are the eigenvalues of 2 f (x ). Choose α, β to explicitly minimize the max eigenvalue of B, obtain α = 4 1 L (1 + 1/ κ) 2, β = ( 1 2 κ + 1 ) 2. Leads to linear convergence for x k x with rate approximately ( ) 2 1. κ + 1 Stephen Wright (UW-Madison) Optimization IPAM, July 2012 24 / 118

Summary: Linear Convergence, Strictly Convex f Steepest descent: Linear rate approx (1 2/κ); Heavy-ball: Linear rate approx (1 2/ κ). Big difference! To reduce x k x by a factor ɛ, need k large enough that ( 1 2 ) k ɛ k κ κ ( 1 2 κ ) k ɛ k log ɛ (steepest descent) 2 κ 2 log ɛ (heavy-ball) A factor of κ difference. e.g. if κ = 100, need 10 times fewer steps. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 25 / 118

Conjugate Gradient Basic step is x k+1 = x k + α k p k, p k = f (x k ) + γ k p k 1. We can identify it with heavy-ball by setting β k = α k γ k /α k 1. However, CG can be implemented in a way that doesn t require knowledge (or estimation) of L and µ. Choose α k to (approximately) miminize f along p k ; Choose γ k by a variety of formulae (Fletcher-Reeves, Polak-Ribiere, etc), all of which are equivalent if f is convex quadratic. e.g. γ k = f (x k ) 2 / f (x k 1 ) 2. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 26 / 118

CG, cont d. Nonlinear CG: Variants include Fletcher-Reeves, Polak-Ribiere, Hestenes. Restarting periodically with p k = f (x k ) is a useful feature (e.g. every n iterations, or when p k is not a descent direction). For f quadratic, convergence analysis is based on eigenvalues of A and Chebyshev polynomials, min-max arguments. Get Finite termination in as many iterations as there are distinct eigenvalues; Asymptotic linear convergence with rate approx 1 2/ κ. (Like heavy-ball.) See e.g. Chap. 5 of Nocedal & Wright (2006) and refs therein. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 27 / 118

Accelerated First-Order Methods Accelerate the rate to 1/k 2 for weakly convex, while retaining the linear rate (related to κ) for strongly convex case. Nesterov (1983, 2004) describes a method that requires κ. 0: Choose x 0, α 0 (0, 1); set y 0 x 0./ k: x k+1 y k 1 L f (y k); (*short-step gradient*) solve for α k+1 (0, 1): α 2 k+1 = (1 α k+1)α 2 k + α k+1/κ; set β k = α k (1 α k )/(α 2 k + α k+1); set y k+1 x k+1 + β k (x k+1 x k ). Still works for weakly convex (κ = ). Stephen Wright (UW-Madison) Optimization IPAM, July 2012 28 / 118

y k+2 x k+2 y k+1 x k+1 x k y k Stephen Wright (UW-Madison) Optimization IPAM, July 2012 29 / 118

Convergence Results: Nesterov If α 0 1/ κ, have f (x k ) f (x ) c 1 min ( ( 1 1 κ ) k, where constants c 1 and c 2 depend on x 0, α 0, L. 4L ( L + c 2 k) 2 Linear convergence at heavy-ball rate in strongly convex case, otherwise 1/k 2. In the special case of α 0 = 1/ κ, this scheme yields ), α k 1 κ, β k 1 2 κ + 1. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 30 / 118

FISTA (Beck & Teboulle 2007). Similar to the above, but with a fairly short and elementary analysis (though still not very intuitive). 0: Choose x 0 ; set y 1 = x 0, t 1 = 1; k: x k y k ( 1 L f (y k); 1 + 1 + 4t 2 k ) ; t k+1 1 2 y k+1 x k + t k 1 t k+1 (x k x k 1 ). For (weakly) convex f, converges with f (x k ) f (x ) 1/k 2. When L is not known, increase an estimate of L until it s big enough. Beck & Teboulle (2010) does the convergence analysis in 2-3 pages: elementary, technical. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 31 / 118

A Non-Monotone Gradient Method: Barzilai-Borwein (Barzilai & Borwein 1988) BB is a gradient method, but with an unusual choice of α k. Allows f to increase (sometimes dramatically) on some steps. where Explicitly, we have x k+1 = x k α k f (x k ), α k := arg min α s k αz k 2, s k := x k x k 1, z k := f (x k ) f (x k 1 ). α k = st k z k zk T z. k Note that for convex quadratic f = (1/2)x T Ax, we have α k = st k As k s T k A2 s k [L 1, µ 1 ]. Hence, can view BB as a kind of quasi-newton method, with the Hessian approximated by α 1 k I. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 32 / 118

Comparison: BB vs Greedy Steepest Descent Stephen Wright (UW-Madison) Optimization IPAM, July 2012 33 / 118

Many BB Variants can use α k = s T k s k/s T k z k in place of α k = s T k z k/z T k z k; alternate between these two formulae; calculate α k as above and hold it constant for 2, 3, or 5 successive steps; take α k to be the exact steepest descent step from the previous iteration. Nonmonotonicity appears essential to performance. Some variants get global convergence by requiring a sufficient decrease in f over the worst of the last 10 iterates. The original 1988 analysis in BB s paper is nonstandard and illuminating (just for a 2-variable quadratic). In fact, most analyses of BB and related methods are nonstandard, and consider only special cases. The precursor of such analyses is Akaike (1959). More recently, see Ascher, Dai, Fletcher, Hager and others. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 34 / 118

Primal-Dual Averaging (see Nesterov 2009) Basic step: x k+1 = arg min x 1 k + 1 k [f (x i ) + f (x i ) T (x x i )] + γ x x 0 2 k i=0 = arg min x ḡ T k x + γ k x x 0 2, where ḡ k := k i=0 f (x i)/(k + 1) the averaged gradient. The last term is always centered at the first iterate x 0. Gradient information is averaged over all steps, with equal weights. γ is constant - results can be sensitive to this value. The approach still works for convex nondifferentiable f, where f (x i ) is replaced by a vector from the subgradient f (x i ). Stephen Wright (UW-Madison) Optimization IPAM, July 2012 35 / 118

Convergence Properties Nesterov proves convergence for averaged iterates: x k+1 = 1 k + 1 k x i. Provided the iterates and the solution x lie within some ball of radius D around x 0, we have i=0 f ( x k+1 ) f (x ) C k, where C depends on D, a uniform bound on f (x), and γ (coefficient of stabilizing term). Note: There s averaging in both primal (x i ) and dual ( f (x i )) spaces. Generalizes easily and robustly to the case in which only estimated gradients or subgradients are available. (Averaging smooths the errors in the individual gradient estimates.) Stephen Wright (UW-Madison) Optimization IPAM, July 2012 36 / 118

Extending to the Constrained Case: x Ω How do these methods change when we require x Ω, with Ω closed and convex? Some algorithms and theory stay much the same, provided we can involve Ω explicity in the subproblems. Example: Primal-Dual Averaging for min x Ω f (x). x k+1 = arg min x Ω ḡ T k x + γ k x x 0 2, where ḡ k := k i=0 f (x i)/(k + 1). When Ω is a box, this subproblem is easy to solve. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 37 / 118

Example: Nesterov s Constant Step Scheme for min x Ω f (x). Requires just only calculation to be changed from the unconstrained version. 0: Choose x 0, α 0 (0, 1); set y 0 x 0, q 1/κ = µ/l. k: x k+1 arg min y Ω 1 2 y [y k 1 L f (y k)] 2 2 ; solve for α k+1 (0, 1): α 2 k+1 = (1 α k+1)α 2 k + qα k+1; set β k = α k (1 α k )/(α 2 k + α k+1); set y k+1 x k+1 + β k (x k+1 x k ). Convergence theory is unchanged. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 38 / 118

Regularized Optimization (More Later) FISTA can be applied with minimal changes to the regularized problem min x f (x) + τψ(x), where f is convex and smooth, ψ convex and simple but usually nonsmooth, and τ is a positive parameter. Simply replace the gradient step by x k = arg min x L 2 [y x k 1 ] 2 L f (y k) + τψ(x). (This is the shrinkage step; when ψ 0 or ψ = 1, can be solved cheaply.) More on this later. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 39 / 118

Further Reading 1 Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic Publishers, 2004. 2 A. Beck and M. Teboulle, Gradient-based methods with application to signal recovery problems, in press, 2010. (See Teboulle s web site). 3 B. T. Polyak, Introduction to Optimization, Optimization Software Inc, 1987. 4 J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA Journal of Numerical Analysis, 8, pp. 141-148, 1988. 5 Y. Nesterov, Primal-dual subgradient methods for convex programs, Mathematical Programming, Series B, 120, pp. 221-259, 2009. 6 J. Nocedal and S. Wright, Numerical Optimization, 2nd ed., Springer, 2006. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 40 / 118

Stochastic Gradient Methods Still deal with (weakly or strongly) convex f. But change the rules: Allow f nonsmooth. Can t get function values f (x). At any feasible x, have access only to an unbiased estimate of an element of the subgradient f. Common settings are: f (x) = E ξ F (x, ξ), where ξ is a random vector with distribution P over a set Ξ. Also the special case: m f (x) = f i (x), i=1 where each f i is convex and nonsmooth. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 41 / 118

Applications This setting is useful for machine learning formulations. Given data x i R n and labels y i = ±1, i = 1, 2,..., m, find w that minimizes τψ(w) + m l(w; x i, y i ), i=1 where ψ is a regularizer, τ > 0 is a parameter, and l is a loss. For linear classifiers/regressors, have the specific form l(w T x i, y i ). Example: SVM with hinge loss l(w T x i, y i ) = max(1 y i (w T x i ), 0) and ψ = 1 or ψ = 2 2. Example: Logistic regression: l(w T x i, y i ) = log(1 + exp(y i w T x i )). In regularized version may have ψ(w) = w 1. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 42 / 118

Subgradients For each x in domain of f, g is a subgradient of f at x if f (z) f (x) + g T (z x), for all z domf. Right-hand side is a supporting hyperplane. The set of subgradients is called the subdifferential, denoted by f (x). When f is differentiable at x, have f (x) = { f (x)}. We have strong convexity with modulus µ > 0 if f (z) f (x)+g T (z x)+ 1 2 µ z x 2, for all x, z domf with g f (x). Generalizes the assumption 2 f (x) µi made earlier for smooth functions. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 43 / 118

f x supporting hyperplanes Stephen Wright (UW-Madison) Optimization IPAM, July 2012 44 / 118

Classical Stochastic Approximation Denote by G(x, ξ) ths subgradient estimate generated at x. For unbiasedness need E ξ G(x, ξ) f (x). Basic SA Scheme: At iteration k, choose ξ k i.i.d. according to distribution P, choose some α k > 0, and set x k+1 = x k α k G(x k, ξ k ). Note that x k+1 depends on all random variables up to iteration k, i.e. ξ [k] := {ξ 1, ξ 2,..., ξ k }. When f is strongly convex, the analysis of convergence of E( x k x 2 ) is fairly elementary - see Nemirovski et al (2009). Stephen Wright (UW-Madison) Optimization IPAM, July 2012 45 / 118

Rate: 1/k Define a k = 1 2 E( x k x 2 ). Assume there is M > 0 such that E( G(x, ξ) 2 ) M 2 for all x of interest. Thus 1 2 x k+1 x 2 2 = 1 2 x k α k G(x k, ξ k ) x 2 = 1 2 x k x 2 2 α k (x k x ) T G(x k, ξ k ) + 1 2 α2 k G(x k, ξ k ) 2. Taking expectations, get For middle term, have a k+1 a k α k E[(x k x ) T G(x k, ξ k )] + 1 2 α2 k M2. E[(x k x ) T G(x k, ξ k )] = E ξ[k 1] E ξk [(x k x ) T G(x k, ξ k ) ξ [k 1] ] = E ξ[k 1] (x k x ) T g k, Stephen Wright (UW-Madison) Optimization IPAM, July 2012 46 / 118

... where By strong convexity, have g k := E ξk [G(x k, ξ k ) ξ [k 1] ] f (x k ). (x k x ) T g k f (x k ) f (x ) + 1 2 µ x k x 2 µ x k x 2. Hence by taking expectations, we get E[(x k x ) T g k ] 2µa k. Then, substituting above, we obtain a k+1 (1 2µα k )a k + 1 2 α2 k M2 When α k 1 kµ, a neat inductive argument (below) reveals the 1/k rate: a k Q ( ) 2k, for Q := max x 1 x 2, M2 µ 2. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 47 / 118

Proof: Clearly true for k = 1. Otherwise: as claimed. a k+1 (1 2µα k )a k + 1 2 α2 k M2 ( 1 2 ) a k + M2 k 2k 2 µ ( 2 1 2 ) Q k 2k + Q 2k 2 (k 1) = 2k 2 Q = k2 1 Q k 2 2(k + 1) Q 2(k + 1), Stephen Wright (UW-Madison) Optimization IPAM, July 2012 48 / 118

But... What if we don t know µ? Or if µ = 0? The choice α k = 1/(kµ) requires strong convexity, with knowledge of the modulus µ. An underestimate of µ can greatly degrade the performance of the method (see example in Nemirovski et al. 2009). Now describe a Robust Stochastic Approximation approach, which has a rate 1/ k (in function value convergence), and works for weakly convex nonsmooth functions and is not sensitive to choice of parameters in the step length. This is the approach that generalizes to mirror descent. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 49 / 118

Robust SA At iteration k: set x k+1 = x k α k G(x k, ξ k ) as before; set k i=1 x k = α ix i k i=1 α. i For any θ > 0 (not critical), choose step lengths to be α k = θ M k. Then f ( x k ) converges to f (x ) in expectation with rate approximately (log k)/k 1/2. The choice of θ is not critical. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 50 / 118

Analysis of Robust SA The analysis is again elementary. As above (using i instead of k), have: α i E[(x i x ) T g i ] a i a i+1 + 1 2 α2 i M 2. By convexity of f, and g i f (x i ): f (x ) f (x i ) + g T i (x x i ), thus α i E[f (x i ) f (x )] a i a i+1 + 1 2 α2 i M 2, so by summing iterates i = 1, 2,..., k, telescoping, and using a k+1 > 0: k α i E[f (x i ) f (x )] a 1 + 1 k 2 M2 αi 2. i=1 i=1 Stephen Wright (UW-Madison) Optimization IPAM, July 2012 51 / 118

Thus dividing by i=1 α i: [ k ] i=1 E α if (x i ) k i=1 α f (x ) i a 1 + 1 2 M2 k i=1 α2 i k i=1 α. i By convexity, we have f ( x k ) k i=1 α if (x i ) k i=1 α, i so obtain the fundamental bound: E[f ( x k ) f (x )] a 1 + 1 2 M2 k i=1 α2 i k i=1 α. i Stephen Wright (UW-Madison) Optimization IPAM, July 2012 52 / 118

By substituting α i = That s it! θ M, we obtain i E[f ( x k ) f (x )] a 1 + 1 2 θ2 k i=1 1 i θ M k i=1 1 i a 1 + θ 2 log(k + 1) θ M k [ a1 ] = M θ + θ log(k + 1) k 1/2. Other variants: constant stepsizes α k for a fixed budget of iterations; periodic restarting; averaging just over the recent iterates. All can be analyzed with the basic bound above. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 53 / 118

Constant Step Size We can also get rates of approximately 1/k for the strongly convex case, without performing iterate averaging and without requiring an accurate estimate of µ. The tricks are to (a) define the desired threshold for a k in advance and (b) use a constant step size Recall the bound from a few slides back, and set α k α: Define the limiting value α by a k+1 (1 2µα)a k + 1 2 α2 M 2. a = (1 2µα)a + 1 2 α2 M 2. Take the difference of the two expressions above: (a k+1 a ) (1 2µα)(a k a ) from which it follows that {a k } decreases monotonically to a, and (a k a ) (1 2µα) k (a 0 a ). Stephen Wright (UW-Madison) Optimization IPAM, July 2012 54 / 118

Constant Step Size, continued Rearrange the expression for a to obtain a = αm2 4µ. From the previous slide, we thus have a k (1 2µα) k (a 0 a ) + a (1 2µα) k a 0 + αm2 4µ. Given threshold ɛ > 0, we aim to find α and K such that a k ɛ for all k K. We ensure that both terms on the right-hand side of the expression above are less than ɛ/2. The right values are: α := 2ɛµ M 2, ( M2 K := 4ɛµ 2 log a0 ). 2ɛ Stephen Wright (UW-Madison) Optimization IPAM, July 2012 55 / 118

Constant Step Size, continued Clearly the choice of α guarantees that the second term is less than ɛ/2. For the first term, we obtain k from an elementary argument: (1 2µα) k a 0 ɛ/2 k log(1 2µα) log(2a 0 /ɛ) k( 2µα) log(2a 0 /ɛ) since log(1 + x) x k 1 2µα log(2a 0/ɛ), from which the result follows, by substituting for α in the right-hand side. If µ is underestimated by a factor of β, we undervalue α by the same factor, and K increases by 1/β. (Easy modification of the analysis above.) Underestimating µ gives a mild performance penalty. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 56 / 118

Constant Step Size: Summary PRO: Avoid averaging, 1/k sublinear convergence, insensitive to underestimates of µ. CON: Need to estimate probably unknown quantities: besides µ, we need M (to get α) and a 0 (to get K). We use constant size size in the parallel SG approach Hogwild!, to be described later. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 57 / 118

Mirror Descent The step from x k to x k+1 can be viewed as the solution of a subproblem: x k+1 = arg min z G(x k, ξ k ) T (z x k ) + 1 2α k z x k 2 2, a linear estimate of f plus a prox-term. This provides a route to handling constrained problems, regularized problems, alternative prox-functions. For the constrained problem min x Ω f (x), simply add the restriction z Ω to the subproblem above. In some cases (e.g. when Ω is a box), the subproblem is still easy to solve. We may use other prox-functions in place of (1/2) z x 2 2 above. Such alternatives may be particularly well suited to particular constraint sets Ω. Mirror Descent is the term used for such generalizations of the SA approaches above. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 58 / 118

Mirror Descent cont d Given constraint set Ω, choose a norm (not necessarily Euclidean). Define the distance-generating function ω to be a strongly convex function on Ω with modulus 1 with respect to, that is, (ω (x) ω (z)) T (x z) x z 2, for all x, z Ω, where ω ( ) denotes an element of the subdifferential. Now define the prox-function V (x, z) as follows: V (x, z) = ω(z) ω(x) ω (x) T (z x). This is also known as the Bregman distance. We can use it in the subproblem in place of 1 2 2 : x k+1 = arg min z Ω G(x k, ξ k ) T (z x k ) + 1 α k V (z, x k ). Stephen Wright (UW-Madison) Optimization IPAM, July 2012 59 / 118

Bregman distance is the deviation from linearity: ω V(x,z) x z Stephen Wright (UW-Madison) Optimization IPAM, July 2012 60 / 118

Bregman Distances: Examples For any Ω, we can use ω(x) := (1/2) x x 2 2, leading to prox-function V (x, z) = (1/2) x z 2 2. For the simplex Ω = {x R n : x 0, n i=1 x i = 1}, we can use instead the 1-norm 1, choose ω to be the entropy function ω(x) = n x i log x i, i=1 leading to Bregman distance V (x, z) = n z i log(z i /x i ). i=1 These are the two most useful cases. Convergence results for SA can be generalized to mirror descent. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 61 / 118

Incremental Gradient (See e.g. Bertsekas (2011) and references therein.) Finite sums: f (x) = m f i (x). i=1 Step k typically requires choice of one index i k {1, 2,..., m} and evaluation of f ik (x k ). Components i k are selected sometimes randomly or cyclically. (Latter option does not exist in the setting f (x) := E ξ F (x; ξ).) There are incremental versions of the heavy-ball method: x k+1 = x k α k f ik (x k ) + β(x k x k 1 ). Approach like dual averaging: assume a cyclic choice of i k, and approximate f (x k ) by the average of f i (x) over the last m iterates: x k+1 = x k α k m m f ik l+1 (x k l+1 ). l=1 Stephen Wright (UW-Madison) Optimization IPAM, July 2012 62 / 118

Achievable Accuracy Consider the basic incremental method: x k+1 = x k α k f ik (x k ). How close can f (x k ) come to f (x ) deterministically (not just in expectation). Bertsekas (2011) obtains results for constant steps α k α. cyclic choice of i k : lim inf k f (x k) f (x ) + αβm 2 c 2. random choice of i k : lim inf k f (x k) f (x ) + αβmc 2. where β is close to 1 and c is a bound on the Lipschitz constants for f i. (Bertsekas actually proves these results in the more general context of regularized optimization - see below.) Stephen Wright (UW-Madison) Optimization IPAM, July 2012 63 / 118

Applications to SVM SA techniques have an obvious application to linear SVM classification. In fact, they were proposed in this context and analyzed independently by researchers in the ML community for some years. Codes: SGD (Bottou), PEGASOS (Shalev-Schwartz et al, 2007). Tutorial: Stochastic Optimization for Machine Learning, Tutorial by N. Srebro and A. Tewari, ICML 2010 for many more details on the connections between stochastic optimization and machine learning. Related Work: Zinkevich (ICML, 2003) on online convex programming. Aiming to approximate the minimize the average of a sequence of convex functions, presented sequentially. No i.i.d. assumption, regret-based analysis. Take steplengths of size O(k 1/2 ) in gradient f k (x k ) of latest convex function. Average regret is O(k 1/2 ). Stephen Wright (UW-Madison) Optimization IPAM, July 2012 64 / 118

Parallel Stochastic Approximation Several approaches tried for parallel stochastic approximation. Dual Averaging: Average gradient estimates evaluated in parallel on different cores. Requires message passing / synchronization (Dekel et al, 2011; Duchi et al, 2010). Round-Robin: Cores evaluate f i in parallel and update centrally stored x in round-robin fashion. Requires synchronization (Langford et al, 2009). Asynchronous: Hogwild!: Each core grabs the centrally-stored x and evaluates f e (x e ) for some random e, then writes the updates back into x (Niu, Ré, Recht, Wright, NIPS, 2011). Hogwild!: Each processor runs independently: 1 Sample e from E; 2 Read current state of x; 3 for v in e do x v x v α[ f e (x e )] v ; Stephen Wright (UW-Madison) Optimization IPAM, July 2012 65 / 118

Hogwild! Convergence Updates can be old by the time they are applied, but we assume a bound τ on their age. Nui et al (2011) analyze the case in which the update is applied to just one v e, but can be extended easily to update the full edge e, provided this is done atomically. Processors can overwrite each other s work, but sparsity of f e helps updates to not interfere too much. Analysis of Niu et al (2011) recently simplified and generalized by Richtarik (2012). In addition to L, µ, M, D 0 defined above, also define quantities that capture the size and interconnectivity of the subvectors x e. ρ e = {e : e e } : number of indices e such that x e and x e have common components; ρ = e E ρ e/ E 2 : average rate of overlapping subvectors. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 66 / 118

Hogwild! Convergence (Richtarik 2012) (for full atomic update of index e) Given ɛ (0, D 0 /L), we have min 0 j k E(f (x j) f (x )) ɛ, for and k K, where α k K = (1 + 2τρ)LM2 E 2 µ 2 ɛ µɛ (1 + 2τρ)LM 2 E 2 ( ) 2LD0 log 1. ɛ Broadly, recovers the sublinear 1/k convergence rate seen in regular SGD, with the delay τ and overlap measure ρ both appearing linearly. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 67 / 118

Hogwild! Performance Hogwild! compared with averaged gradient (AIG) and round-robin (RR). Experiments run on a 12-core machine. (10 cores used for gradient evaluations, 2 cores for data shuffling.) Stephen Wright (UW-Madison) Optimization IPAM, July 2012 68 / 118

Hogwild! Performance Stephen Wright (UW-Madison) Optimization IPAM, July 2012 69 / 118

Extensions To improve scalability, could restrict write access. Break x into blocks; assign one block per processor; allow a processor to update only components in its block; Share blocks by periodically writing to a central repository, or gossipping between processors. Analysis in progress. Le et al (2012) (featured recently in the NY Times) implemented an algorithm like this on 16,000 cores. Another useful tool for splitting problems and coordinating information between processors is the Alternating Direction Method of Mulitipliers (ADMM). Stephen Wright (UW-Madison) Optimization IPAM, July 2012 70 / 118

Further Reading 1 A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization, 19, pp. 1574-1609, 2009. 2 D. P. Bertsekas, Incremental gradient, subgradient, and proximal methods for convex optimization: A Survey, Chapter 4 in Optimization and Machine Learning, S. Nowozin, S. Sra, and S. J. Wright (2011). 3 A. Juditsky and A. Nemirovski, First-order methods for nonsmooth convex large-scale optimization. I and II methods, Chapters 5 and 6 in Optimization and Machine Learning (2011). 4 O. L. Mangasarian and M. Solodov, Serial and parallel backpropagation convergencevia nonmonotone perturbed minimization, Optimization Methods and Software 4 (1994), pp. 103 116. 5 D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with constant step size, SIAM Journal on Optimization 18 (2008), pp. 29 51. 6 Niu, F., Recht, B., Ré, C., and Wright, S. J., Hogwild!: A Lock-free approach to parallelizing stochastic gradient descent, NIPS 24, 2011. Stephen Wright (UW-Madison) Optimization IPAM, July 2012 71 / 118