Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions

Size: px

Start display at page:

Download "Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions"

Shona Stewart
5 years ago
Views:

1 Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research with Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro : Joseph Fourier University, Grenoble, France; : ISyE, University of Florida; : ISyE, Georgia Tech BFG 09, Leuven September 14-18, 2009

2 Overview Convex Stochastic Minimization and Convex-Concave Stochastic Saddle Point problems Classical Stochastic Approximation Robust Stochastic Approximation & Stochastic Mirror Prox construction efficiency estimates how it works Acceleration via Randomization Matrix Game & l 1 -minimization with uniform fit l 1 -minimization with Least Squares fit Minimizing maxima of convex polynomials over a simplex

3 Convex Stochastic Minimization Problem The basic Convex Stochastic Minimization problem is min f (x) x X (SM) X R n : convex compact set f (x) : X R: convex and Lipschitz continuous objective given by a Stochastic Oracle. At i-th call to the oracle, the input being x X, the oracle returns stochastic gradient G(x, ξ i ) R n of f, with independent ξ i P, i = 1, 2,... and E ξ P {G(x, ξ)} f (x) x X. In many cases (but not always!) G(x, ξ) x F(x, ξ), where F(x, ξ) is convex in x. Here (SM) is the problem of minimizing the expectation f (x) = E{F(x, ξ)} of a convex in x integrant F(x, ξ) given by a first order oracle.

4 Convex-Concave Stochastic Saddle Point Problem Convex Stochastic Minimization problem is a particular case of Convex-Concave Stochastic Saddle Point problem min max x X y Y f (x, y) (SP) Z = X Y R n = R nx R ny R n : convex compact set f (x, y) : X Y R: convex in x X and concave in y Y Lipschitz continuous function given by a Stochastic Oracle. At i-th call to the oracle, the input being z = (x, y) Z, Stochastic Oracle returns stochastic derivative G(z, ξ i ) = [G x (z, ξ i ); G y (z, ξ i )] R nx R ny of f, with independent ξ i P i = 1, 2,... and F(x, y) := E ξ P {G(x, y, ξ)} x f (x, y) y ( f (x, y))

5 Classical Stochastic Approximation min f (x) x X min f (x, y) (SP) max x X y Y (SM) The very first method for solving (SM) is the classical Stochastic Approximation (Robbins & Monro 51) which mimics the usual Subgradient Descent: { x t+1 = argmin 1 u X 2 u x t γ t G(x t, ξ t ), u }, γ x t+1 = closest to x t γ t G(x t, ξ t ) point of X t = a t When f is strongly convex on X, SA with properly chosen a ensures that E{ x t argmin x f 2 2 } O(1/t). If, in addition, argmin X f intx and f is Lipschitz continuous, then also E{f (x t ) min x f } O(1/t). Similar construction, with similar results, works for (SP) with strongly convex-concave f.

6 Classical Stochastic Approximation (continued) Good news on Classical SA: with smooth strongly convex f attaining its minimum in intx, the rate of convergence E{f (x t ) min x f } O(1/t) is the best allowed under circumstances by Laws of Statistics. Bad news on Classical SA: it requires strong convexity of the objective f and is highly sensitive to the scale factor a in the stepsize rule γ t = a t. To choose a properly, we need to know the modulus of strong convexity of f, and overestimating this modulus may result in dramatic slowing down the convergence. The Stochastic Optimization community commonly treats SA as a highly unstable and impractical optimization technique.

7 Robust Stochastic Approximation x t+1 = argmin u X { 1 2 u x t γ t G(x t, ξ t ), u }, γ t = a t (CSA) It was discovered long ago [Nem.&Yudin 79] that SA admits a robust version which does not require strong convexity of the objective in (SM). To make SA robust, it suffices to replace short steps γ t = a t with long steps γ t = a t, and [ t ] 1 to add to (CSA) the averaging x t = τ=1 γ t τ τ=1 γ τ x τ and to treat, as approximate solutions, the averages x t rather than the search points x t. Another important improvement (in its rudimentary form going back to [Nem.&Yudin 79]) is to replace the quadratic prox term 1 2 u x t 2 2 in (CSA) with a more general prox term, which allows to adjust, to some extent, the method to the geometry of the problem of interest.

8 Robust Stochastic Approximation [Jud.&Lan&Nem.&Shap. 09] min max x X y Y f (x, y) G(x, y, ξ) : E ξ {G(x, y, ξ)} x f (x, y) y [ f (x, y)] Setup for RSA: a norm on R n Z = X Y and a convex continuous distance generating function ω(z) : Z R such that the set Z o = {z Z : ω(z) } is convex, and ω Z o is C 1 and strongly convex: z, z Z o : ω (z) ω (z ), z z α z z 2 [α > 0] Robust Stochastic Approximation: z 1 = z ω := argmin Z ω( ) z t+1 = argmin {[ω(u) ω (z t ), u ] + γ t G(z t, ξ t ), u } z t u Z [ t ] 1 = τ=1 γ t τ τ=1 γ τ z τ

9 Stochastic Mirror Prox [Jud.&Nem.&Tauvel 08] min max x X y Y f (x, y) G(x, y, ξ) : E ξ {G(x, y, ξ)} x f (x, y) y [ f (x, y)] Stochastic Mirror Prox algorithm: z 1 = z ω := argmin Z ω( ) w t = argmin {[ω(u) ω (z t ), u ] + γ t G(z t, ξ 2t 1 ), u } u Z z t+1 = argmin {[ω(u) ω (z t ), u ] + γ t G(w t, ξ 2t ), u } z t u Z [ t ] 1 = τ=1 γ t τ τ=1 γ τ w τ

10 Saddle Point accuracy measure min max x X y Y f (x, y) (SP) Accuracy measure for (SM): ErrSad(x, y) = max y Y f (x, y ) min x X f (x, y). Explanation: (SP) gives rise to two optimization programs with equal optimal values: Primal: Dual: min x X max y Y f (x), f (x) = maxf (x, y) y Y f (x, y) f (y), f (y) = min x X ErrSad(x, y) is the duality gap the sum of non-optimalities of x and y as approximate solutions to the respective problems.

11 RSA & SMP: Efficiency Estimates min max x X y Y f (x, y) G(x, y, ξ) : E ξ {G(x, y, ξ)} x f (x, y) y [ f (x, y)] (SP) Theorem: Assume that E{ G(z, ξ) 2 } M 2 z Z = X Y, and let N 1. N-step RSA with properly chosen constant steps γ t = γ, 1 t N, ensures that E { ErrSad(z N ) } O(1)ΘM/ N [ Θ = 2 maxz Z [ω(z) ω(z ω ) ω (z ω ), z z ω ]/α ] With light tail G(z, ξ) /M, probabilities of large deviations in RSA admit exponential bounds: E { exp{ G(z, ξ) 2 /M 2 } } { exp{1} z Z Ω > 1 : Prob ErrSad(z N ) > O(1)ΩΘM/ } N 2 exp{ Ω} Similar results hold true for SMP.

12 Good Geometry case Let rint Z rint Z +, where Z + is the direct product of K basic blocks : K b unit Euclidean balls B i E i = R n i ; K s spectahedrons S j F j ; F j : spaces of symmetric k j k j matrices of a given block-diagonal structure with the Frobenius inner product S j : the set of all positive semidefinite matrices from F j with unit trace Good setup for RSA/SMP is given by Kb the norm z = i=1 zi K s j=1 z j 2 tr z i : ball components of z z j : spectahedron components of z tr : trace norm of a symmetric matrix. the dgf ω(z) = K b i=1 1 2 zi K s j=1 Entropy(z j) Entropy(u) = l λ l(u) ln λ l (u), λ l (u) being eigenvalues of a symmetric matrix u.

13 Good Geometry case (continued) With the above setup, we have Θ = O(1) K b + K s j=1 ln k j E{ErrSad(z N )} O(1)[K b + K s j=1 ln k j] 1/2 M/ N Good news: When K = O(1), the rate of convergence is nearly dimension-independent and, moreover, nearly the best allowed under circumstances by Laws of Statistics The computational cost of a step is dominated [ by the costs of a call to SO and a prox-step g argmin z Z g T z + ω(z) ] When Z = Z +, the cost( C of a prox-step is C = O(1) i dim (E i) + ) j C j C j : the cost of eigenvalue decomposition of A F j When Z is simple product of K balls and simplexes we have C = O(1)dim Z.

14 How it works: Convex Stochastic Programming min f (x) := E ξ P{F(x, ξ)} x X R n G(x, ξ) x F(x, ξ), E ξ P { G(x, ξ) 2 } M 2 (SM) Till now, the working horse of Convex Stochastic Minimization is the Sample Average Approximation method. In SAA, problem (SM) is replaced with its empirical counterpart min x fn (x), f N (x) = 1 N N i=1 F(x, ξ i) [ξ 1,..., ξ N : sample drawn from P] In the Good Geometry case, the SAA sample size N resulting, with probability close to 1, in ɛ-solution to (SM) is O(1)n(M/ɛ) 2 [Shap. 03], which is by factor O(n) worse than the number of steps of RSA/SMP yielding a solution of the same quality.

15 How it works (continued) min f (x) := E ξ P{F (x, ξ)} x X R n min fn (x) = 1 N x X N i=1 f (x, ξ i) (SM) (SAA) A single call to the first order oracle for f N is as costly as N calls to the SO. RSA and SMP are much cheaper computationally than SAA. When Z is simple, the overall computational effort in the N-step RSA/SMP is of the same order as the effort to compute a single subgradient of f N ( )! Our experimental results consistently demonstrate that while the solutions built by RSA and SAA with a common sample size N are of nearly the same quality, RSA by orders of magnitude outperforms SAA in terms of running time.

16 How it works: Typical experiment f = min x { f (x) = Eξ N (a,in){φ(ξ T x)} : x 0, i x 1 = 1 }, a = 1 φ( ) : piecewise linear convex loss function n = 1000, f = n = 5000, f = Method f (z 4000 ) : f (z 4000 ) f CPU f (z 4000 ) : f (z 4000 ) f CPU SAA : : N-RSA : : E-RSA : : N-RSA: = 1, ω(x) = i x i ln x i E{f (z N ) f } O(1) ln n N E-RSA: = 2, ω(x) = 1 2 x T x E{f (z N ) f } O(1) n N

17 SMP vs RSA min max x X y Y f (x, y) (SP) F(x, y) := E ξ {G(x, y, ξ)} x f (x, y) y [ f (x, y)] Question: What, if any, are the advantages of SMP as compared to RSA? Answer: SMP can utilize the fact that some components of F(x, y) are regular and well-observable. Assumption A: F(z) F(z ) L z z + σ z, z Z = X Y E ξ { G(z, ξ) F(z) 2 } σ 2 Fact: Under A, the quantity M 2 = sup z Z E ξ { G(z, ξ) 2 } can be as large as O(1)(ΘL + σ) 2 The efficiency of RSA cannot be better than E { ErrSad(z N RSA )} O(1)Θ ΘL + σ N even when σ = 0, i.e., f C 1,1 (L), no noise at all!

18 SMP vs RSA (continued) min max x X y Y f (x, y) F(x, y) := E ξ {G(x, y, ξ)} x f (x, y) y [ f (x, y)] F(z) F(z ) L x y + σ, E ξ { G(z, ξ) F(z) 2 } σ 2 Theorem: N-step SMP with properly chosen constant stepsizes ensures that E { [ ] ErrSad(zSMP N )} O(1)Θ ΘL N + σ (!) { N When E ξ exp{ G(z, ξ) F(z) 2 /σ 2 } } exp{1}, for all Ω > 0 one has [ [ ] ]} Prob {ErrSad(zSMP N ) > O(1) Θ ΘL N + σ + ΩΘσ N N exp{ Ω 2 /3} + exp{ ΩN} Good news: When Z is the product of O(1) high-dimensional balls, (!) is the best efficiency estimate achievable under circumstances.

19 Applying SMP: Stochastic Feasibility problem We want to solve a feasible system of simple and difficult convex constraints: S µ (x) 0, 1 µ p D ν (x) 0, 1 ν q x X = {x R n : x 2 1} S µ are given by precise First Order oracle and are smooth: S µ(x) 2 L µ, S µ(x) S µ(x ) 2 ln(p + q)l µ x x 2. D ν are given by Stochastic Oracle. At i-th call to the SO, x X being the input, the SO returns unbiased estimates h ν (x, ξ i ), H ν (x, ξ i ) of D ν (x), D ν(x), with independent ξ i P and E ξ {max ν [ h 2 ν (x,ξ) M 2 ν ]} 1, max ν E ξ { Hν(x,ξ) 2 2 M 2 ν } ln(p + q).

20 Stochastic Feasibility problem (continued) S µ (x) 0, 1 µ p D ν (x) 0, 1 ν q x X = {x R n : x 2 1} S ν : S µ L µ, S µ(x) S µ(x ) 2 ln(p } + q)l µ x x 2 D ν (x) = E ξ {h ν (x, ξ)}, E ξ {max h 2 ν ν(x, ξ)/mν 2 1 D ν(x) = E ξ {H ν (x, ξ)}, max ν E ξ { H(x, ξ) 2 2 /M 2 ν } ln(p + q) Properly applying N-step SMP to the Saddle Point problem min x X [ max p y 0, i y i =1 µ=1 y µα µ S µ (x) + we build a solution x N X such that q ν=1 ] y p+ν α p+ν D ν (x) S µ (x N ) ln(p + q) Lν N ζ, D ν(x N ) ln(p + q) Mν N ζ ζ 0, E{ζ} O(1).

21 Acceleration via Randomization When solving large-scale well-structured convex optimization problems, e.g., the l 1 minimization problem min x { x 1 : Ax b δ} with large-scale and possibly dense matrix A, polynomial time methods, and thus generating high accuracy solutions, become prohibitively time consuming; when medium accuracy solutions are sought, our best choice is computationally cheap first order methods. There are interesting and important cases when first order methods can be accelerated significantly by randomization passing from computationally demanding deterministic first order oracles to their computationally cheap stochastic counterparts.

22 Example 1: Matrix Game min max f (x, y) := y T Ax, k = {u R k + : i u i = 1} x n y m F(x, y) := [f x(x, y); f y(x, y)] = [A T y; Ax] A := max A ij i,j (MG) When A is large and dense, solving (MG) by polynomial time algorithms is prohibitively time consuming. The best deterministic First Order algorithms find ɛ-solution to (MG) in O(1) ln(m + n)(a/ɛ) steps, with O(1) computations of F( ) per step. The total effort is O(1) ln(m + n)mn(a/ɛ) a.o. Matrix-vector multiplications m y A T y, n x Ax required to compute F(x, y) are easy to randomize: to build an unbiased estimate of Bu = i u jb j, draw at random a column index j: Prob{j = j} = u j u 1, and return u 1 sign(u j )B j.

23 Matrix Game (continued) min x n max y m y T Ax, A := max A ij (MG) i,j Equipping (MG) with Stochastic Oracle and applying RSA, we get, with confidence 1 β, an ɛ solution to (MG) in O(1) ln(mn/β)(a/ɛ) 2 steps. A step reduces to extracting from A randomly chosen row and column plus O(m + n) a.o. overhead. ɛ-solutions costs O(1) ln(mn/β)(m + n)(a/ɛ) 2 a.o. With ɛ, A fixed and m = O(n) large, RSA outperforms by order of magnitudes the best known deterministic alternatives exhibits sublinear time behaviour: ɛ-solution is found by just O(1) ln(mn/β)(m + n)(a/ɛ) 2 of the total of mn data entries, cf. Grigoriadis & Khachiyan 95.

24 Application: l 1 minimization with uniform fit Numerous sparsity-oriented Signal Processing problems reduce to (small series of) problems min u { Au b p : u 1 1} [A : m n] (l 1 ) When p =, (l 1 ) reduces to Matrix Game min max [p q] T [ A b1 T, A b1 T ] [u; v] x=[u;v] 2n y=[p;q] 2m RSA can be used to solve l 1 -minimization problems. When m, n are really large (like 10 4 and more) and A is a general-type dense analytically given matrix, (l 1 ) becomes prohibitively difficult for all known deterministic algorithms, and randomized algorithms become a kind of "last resort".

25 How it works Opt = min u { Au b : u 1 1} [ A: dense analytically given m n matrix b: Au b δ = 1.e-3 with 16-sparse u, u 1 = 1 Errors CPU m n Method u ũ 1 u ũ 2 u ũ sec Mult DMP RSA DMP RSA DMP: Deterministic Mirror Prox ũ: resulting solution Last column: (equivalent) # of matrix-vector multiplications. ]

26 How it works (continued) x l 1 -recovery: Blue: true signal, Cyan: MDSA

27 Example 2: l 1 -minimization with Least Squares fit The Least Squares l 1 minimization problem min u { Au b 2 : u 1 1} [A : m n] reduces to the Saddle Point problem min max y T [Ax b] [F(x, y) = [A T y; b Ax]] y 2 1 x n Randomizing matrix-vector multiplications and applying RSA, we get, with confidence 1 β, an ɛ-solution with the overall effort of O(1) ln(mn/β)( mc 2 /ɛ) 2 a.o. ( ) C 2 : maximum of 2 -norms of columns in C = [A, b] Bad news: Factor m in ( ) reflecting high potential noisiness of the unbiased estimate (y, y 2 1) y 1 sign(y ı )[A ı,: ] T of A T y: it may happen that y 1 sign(y ı )[A ı,: ] T mc 2

28 l 1 -minimization with Least Squares fit (continued) min max y T [Ax b] y 2 1 x n [F(x, y) = [A T y; b Ax]] Remedy: Randomized preprocessing [A, b] UD[A, b] with orthogonal U, U ij O(m 1/2 ) and random diagonal D with iid D ii Uniform{ 1; 1} results in an equivalent problem, preserves C 2 and reduces noise in the estimate of A T y: now, with confidence 1 β, y 2 1 y 1 sign(y ı )[A ı,: ] T O(1) ln(mn/β)c 2 After preprocessing, RSA finds, with confidence 1 β, an ɛ-solution to the problem at the cost of O(1) ln 2 (mn/β)(c 2 /ɛ) 2 a.o. With properly chosen U (e.g., Hadamard or DFT matrix), the preprocessing costs just O(1)mn ln(m) a.o.

29 Example 3: Minimizing the maximum of convex polynomials over a simplex/l 1 -ball Consider the optimization problem min max p k(x), p k (x) = d l=0 A kl[x,..., x] ( ) x n 1 k m A kl [x 1,..., x l ]: symmetric l-linear forms; p k ( ): convex. We can reduce ( ) to the Saddle Point problem min k y jp k (x) max x n y m and randomize in the aforementioned manner computation of p k ( ), p k ( ): To build an unbiased estimates p k (x), P k (x) of p k (x) and P k (x) = p k (x), x n, we generate at random indices j 1,..., j d, Prob{j = j} = x j, and set p k = d l=0 A kl[e j1,..., e jl ] [ P k ] j = d l=1 la kl[e j1,..., e jl 1, e j ] 1 j n.

30 Minimizing maximum of convex polynomials (continued) min max d l=0 A kl[x,..., x] ( ) x n 1 k m A: maximum magnitude of coefficients in A kl [,..., ] Applying RSA, we find, with confidence 1 β, an ɛ-solution to ( ) at the cost of O(1) ln(mn/β)d 2 (A/ɛ) 2 steps, with a step reducing to extracting O(1)d(m + n) coefficients of the forms A kl [,,..., ], given the addresses k, l, j 1,..., j l of the coefficients, plus O(m + dn)-a.o. overhead.

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research