Primal-dual subgradient methods for convex problems

Size: px

Start display at page:

Download "Primal-dual subgradient methods for convex problems"

Erick Arnold
5 years ago
Views:

1 Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different types of nonsmooth problems with convex structure. Our methods are primaldual since they are always able to generate a feasible approximation to the optimum of an appropriately formulated dual problem. Besides other advantages, this useful feature provides the methods with a reliable stopping criterion. The proposed schemes differ from the classical approaches (divergent series methods, mirror descent methods) by presence of two control sequences. The first sequence is responsible for aggregating the support functions in the dual space, and the second one establishes a dynamically updated scale between the primal and dual spaces. This additional flexibility allows to guarantee a boundedness of the sequence of primal test points even in the case of unbounded feasible set (however, we always assume the uniform boundedness of subgradients). We present the variants of subgradient schemes for nonsmooth convex minimization, minimax problems, saddle point problems, variational inequalities, and stochastic optimization. In all situations our methods are proved to be optimal from the view point of worst-case black-box lower complexity bounds. Keywords: convex optimization, subgradient methods, non-smooth optimization, minimax problems, saddle points, variational inequalities, stochastic optimization, black-box methods, lower complexity bounds. CORE Discussion Paper # 2005/67 Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 34 voie du Roman Pays, 348 Louvain-la-Neuve, Belgium; nesterov@core.ucl.ac.be. This paper presents research results of the Belgian Program on Interuniversity Attraction Poles, initiated by the Belgian Federal Science Policy Office. The scientific responsibility rests with its author.

2 Introduction. Prehistory The results presented in this paper are not very new. Most of them were obtained by the author in However, a further purification of the developed framework led to rather surprising results related to the smoothing technique. Namely, in [3] it was shown that many nonsmooth convex minimization problems with an appropriate explicit structure can be solved up to absolute accuracy ϵ in O( ϵ ) iterations of special gradient-type methods. Recall that the exact lower complexity bound for any black-box subgradient scheme was established on the level of O( ) iterations (see [], or [2], Section 3., for ϵ 2 a more recent exposition). Thus, in [3] it was shown that the gradient-type methods of Structural Optimization outperform the black-box ones by an order of magnitude. At that moment of time, the author got an illusion that the importance of black-box approach in Convex Optimization will be irreversibly vanishing, and, finally, this approach will be completely replaced by other ones based on a clever use of problem s structure (interior-point methods, smoothing, etc.). This explains why the results included in this paper were not published at time. However, the developments of the last years clearly demonstrated that in some situations the black-box methods are irreplaceable. Indeed, the structure of a convex problem may be too complex for constructing a good selfconcordant barrier or for applying a smoothing technique. Note also, that optimization schemes sometimes are employed for modelling certain adjustment processes in real-life systems. In this situation, we are not free in selecting the type of optimization scheme and in the choice of its parameters. However, the results on convergence and the rate of convergence of corresponding methods remain interesting. These considerations encouraged the author to publish the above mentioned results on primal-dual subgradient methods for nonsmooth convex problems. Note that some elements of developed technique were used by the author later on in different papers related to smoothing approach (see [3], [4], [5]). Hence, by a proper referencing we have tried to shorten this paper..2 Motivation Historically, a subgradient method with constant step was the first numerical scheme suggested for approaching a solution to optimization problem min x {f(x) : x R n } (.) with nonsmooth convex objective function f (see [9] for historical remarks). In a convergent variant of this method [7, 8] we need to choose in advance a sequence of steps {λ k } k=0 satisfying the divergent-series rule: λ k > 0, λ k 0, λ k =. (.2) k=0 This unexpected observation resulted also in a series of papers of different authors related to improved methods for variational inequalities [0], [5], and []. 2

3 Then, for k 0 we can iterate: x k+ = x k λ k g k, k 0, (.3) with some g k f(x k ). In another variant of this scheme, the step direction is normalized: x k+ = x k λ k g k / g k 2, k 0, (.4) where 2 denotes the standard Euclidean norm in R n, introduced by the inner product x, y = n x (j) y (j), x, y R n. j= Since the objective function f is nonsmooth, we cannot expect its subgradients be vanishing in a neighborhood of optimal solution to (.). Hence, the condition λ k 0 is necessary for convergence of the processes (.3) or (.4). On the other hand, assuming that g k 2 L, for any x R n we get x x k+ 2 2 (.3) = x x k λ k g k, x x k + λ 2 k g k 2 2 Hence, for any x R n with 2 x x D x x k λ k g k, x x k + λ 2 k L2. f(x) l k (x) def = k λ i [f(x i ) + g i, x x i ]/ k λ i { } λ i f(x i ) D k 2 L2 λ 2 i / k λ i. (.5) Thus, denoting f D = min x {f(x) : 2 x x D}, f k = λ i f(x i ), ω k = λ i 2D+L 2 λ 2 i, 2 λ i we conclude that f k f D ω k. Note that the conditions (.2) are necessary and sufficient for ω k 0. In the above analysis, convergence of the process (.3) is based on the fact that the derivative of l k (x), the lower linear model of the objective function, is vanishing. This model is very important since it provides us also with a reliable stopping criterion. However, examining the structure of linear function l k ( ), we can observe a very strange feature: New subgradients enter the model with decreasing weights. (.6) This feature contradicts common sense. It contradicts also to general principles of iterative schemes, in accordance to which the new information is more important than the old one. Thus, something is wrong. Unfortunately, in our situation a simple treatment is hardly 3

4 possible: we have seen that decreasing weights ( steps) are necessary for convergence of the primal sequence {x k } k=0. The above contradiction served as a point of departure for the developments presented in this paper. The proposed alternative looks quite natural. Indeed, we have seen that in primal space it is necessary to have a vanishing sequence of steps, but in the dual space (the space of linear functions), we would like to apply non-decreasing weights. Consequently, we need two different sequences of parameters, each of which is responsible for some processes in primal and dual spaces. The idea to relate the primal minimization sequence with a master process existing in the dual space is not new. It was implemented first in the mirror descent methods (see [], and [5, 6] for newer versions and historical remarks). However, the divergent series somehow penetrated in this approach too. Therefore, in Euclidean situation the mirror descent method coincides with subgradient one and, consequently, shares the drawback (.6). In our approach, we managed to improve the situation. Of course, it was impossible to improve the dependence of complexity estimates in ϵ. However, the improvement in the constant is significant: the best step-size strategy in [5], [6] gives the same (infinite) constant in the estimate of the rate of convergence as the worst strategy for the new schemes (see Section 8 for discussion). In this paper we consider the primal-dual subgradient schemes. It seems that this intrinsic feature of all subgradient methods was not recognized yet explicitly. Consider, for example, the scheme (.3). From inequality (.5), it is clear that f D ˆf k (D) def = min x {l k (x) : 2 x x D}. Note that the value ˆf k (D) can be easily computed. On the other hand, in Convex Optimization there is only one way to get a lower bound for the optimal solution of a minimization problem. For that, we need to find a feasible solution to a certain dual problem. 2 Thus, computability of ˆf k (D) implies that we are able to point out a dual solution. And convergence of the primal-dual gap f k (D) ˆf k to zero implies that the dual solution approaches the optimal one. Below, we will discuss in details the meaning of the dual solutions generated by the proposed schemes. Finally, note that the subgradient schemes proposed in this paper are different from the standard search methods 3. For example, for problem (.) we suggest to use the following process: { } x k+ = arg min x R n k+ [f(x i ) + g i, x x i ] + µ k x x 0 2 2, (.7) where µ k = O( k ) 0. Note that in this scheme no artificial elements are needed to ensure the boundedness of the sequence of test points. 2 Depending on available information on the structure of the problem, this dual problem can be posed in different forms. Therefore, sometimes we prefer to call such problems the adjoint ones. 3 We mean the methods described in terms of the steps in primal space. The alternative is the mirror descent approach [], in accordance to which the main process is arranged in the dual space. 4

5 .3 Contents The paper is organized as follows. In Section 2 we consider a general scheme of Dual Averaging (DA) and prove the upper bounds for corresponding gap functions. These bounds will be specified later on for the particular problem classes. We give also two main variants of DA-methods, the method of Simple Dual Averages (SDA) and the method of Weighted Dual Averages. In Section 3 we apply DA-methods to minimization over simple sets. In Section 3. we consider different forms of DA-methods as applied to a general minimization problem. The main goal of Section 3 is to show that all DA-methods are primal-dual. For that, we always consider a kind of dual problem and point out the sequence which converges to its solution. In Section 3.2 this is done for general unconstrained minimization problem. In Section 3.3 we do that for minimax problems. It is interesting that approximations of dual multipliers in this case are proportional to the number of times the corresponding functional components were active during the minimization process (compare with [3]). Finally, in Section 3.4 we consider primal-dual schemes for solving minimization problems with simple functional constraints. We manage to obtain for such problems an approximate primal-dual solution despite to the fact that usually even a complexity of computation of the value of dual objective function is comparable with complexity of the initial problem. In the next Section 4 we apply SDA-method to saddle point problems, and in Section 5 it is applied to variational inequalities. It is important that for solvable problems SDAmethod generates a bounded sequence even for unbounded feasible set. In Section 6 we consider a stochastic version of SDA-method. The output of this method can be seen as a random point (the repeated runs of the algorithm give different results). However, we prove that the expected value of the objective function at the output points converges to the optimal value of stochastic optimization problem with certain rate. Finally, in Section 7 we consider two applications of DA-methods to modelling. In Section 7. it is shown that a natural adjustment strategy (which we call balanced development) can be seen as an implementation of DA-method. Hence, it is possible to prove that in the limit this process converges to a solution of some optimization problem. In Section 7.2, for a multi-period optimization problem we compare the efficiency of static strategies based on preliminary planning and a certain dynamic adjustment process (based on SDA-method). It is clear that the dynamic adjustment computationally is much cheaper. On the other hand, we show that its results are comparable with the results of preliminary planning based on complete knowledge of the future. We conclude the paper with a short discussion of results in Section 8. In Appendix we put the proof of the only result on strongly convex functions, for which we did not find an appropriate reference in the comprehensive monographs [8] and [7]..4 Notations and generalities Let E be a finite-dimensional real vector space and E be its dual. We denote the value of linear function s E at x E by s, x. For measuring distances in E, let us fix some (primal) norm. This norm defines a system of primal balls: B r (x) = {y E : y x r}. 5

6 The dual norm on E is introduced, as usual, by s = max x { s, x : x B (0)}, s E. Let Q be a closed convex set in E. Assume that we know a prox-function d(x) of the set Q. This means that d(x) is a continuous function with domain belonging to Q, which is strongly convex on Q with respect to : x, y Q, α [0, ], d(αx + ( α)y) αd(x) + ( α)d(y) 2 σα( α) x y 2, (.8) where σ 0 is the convexity parameter. Denote by x 0 the prox-center of the set Q: x 0 = arg min x {d(x) : x Q}. (.9) Without loss of generality, we assume that d(x 0 ) = 0. In view of Lemma 6 in Appendix, the prox-center is well-defined and d(x) 2 σ x x 0 2, x Q. An important example is the standard Euclidean norm: x = x 2 x, x /2, d(x) = 2 x x 0 2 2, σ =, with some x 0 Q. For another example, consider l -norm: Define Q = n def = {x R n + : x = x n x (i). (.0) i= n x (i) = }. Then the entropy function: i= d(x) = ln n + n x (i) ln x (i). i= is strongly convex on Q with σ = and x 0 = ( n,... n )T (see Lemma 3 [3]). 2 Main algorithmic schemes Let Q be a closed convex set in E endowed with a prox-function d(x). We allow Q to be unbounded (for example, Q E). For our analysis we need to define two support-type functions of the set Q: ξ D (s) = max x Q { s, x x 0 : d(x) D}, V β (s) = max x Q { s, x x 0 βd(x)}, (2.) where D 0 and β > 0 are some parameters. The first function is a usual support function for the set F D = {x Q : d(x) D}. 6

7 The second one is a proximal-type approximation of the support function of set Q. Since d( ) is strongly convex, for any positive D and β we have dom ξ D = dom V β = E. Note that both of the functions are nonnegative. Let us mention some properties of function V ( ). If β 2 β > 0, then for any s E we have V β2 (s) V β (s). (2.2) Note that the level of smoothness of function V β ( ) is controlled by parameter β. Lemma Function V β ( ) is convex and differentiable on E. Moreover, its gradient is Lipschitz continuous with constant βσ : V β (s ) V β (s 2 ) βσ s s 2, s, s 2 E. (2.3) For any s E, vector V β (s) belongs to Q: V β (s) = π β (s) x 0, π β (s) def = arg min{ s, x + βd(x)}. (2.4) x Q (This is a particular case of Theorem in [3].) As a trivial corollary of (2.3) we get the following inequality: V β (s + δ) V β (s) + δ, V β (s) + 2σβ δ 2 s, δ E. (2.5) Note that in view of definition (.9) we have π β (0) = x 0. This implies V β (0) = 0 and V β (0) = 0. Thus, in this case inequality (2.5) with s = 0 yields V β (δ) 2σβ δ 2 δ E. (2.6) In the sequel, we assume that the set Q is simple enough for computing vector π β (s) exactly. For our analysis we need the following relation between the functions (2.). Lemma 2 For any s E and β 0 we have ξ D (s) βd + V β (s). (2.7) Proof: Indeed, ξ D (s) = max x Q { s, x x 0 : d(x) D} = max x Q min { s, x x 0 + β [D d(x)] } β 0 min max { s, x x 0 + β [D d(x)] } β 0 x Q βd + V β (s). Consider now the sequences X k = {x i } k Q, G k = {g i } k E, Λ k = {λ i } k R +. 7

8 Typically, the test points x i and the weights λ i are generated by some algorithmic scheme and the points g i are computed by a black-box oracle G( ): g i = G(x i ), i 0, which related to a specific convex problem. In this paper we consider only the problem instances, for which there exists a point x Q satisfying the condition g, x x 0, g G(x), x Q. (2.8) We call such x the primal solution of our problem. At the same time, we are going to approximate a certain dual solution by the means of observed subgradients. In the sequel, we will explain the actual sense of this dual object for each particular problem class. We are going to approximate the primal and dual solutions of our problem using the following aggregates: = k s k+ = k λ i, ˆx k+ = λ i x i, λ i g i, ŝ k+ = s k+, (2.9) with ˆx 0 = x 0 and s 0 = 0. As we will see later, the quality of the test sequence X k can be naturally described by the following gap function: { } δ k (D) = max λ i g i, x i x : x F D,, D 0. (2.0) x Using notation (2.9), we get an explicit representation of the gap: Sometimes we will use an upper gap function δ k (D) = λ i g i, x i x 0 + ξ D ( s k+ ). (2.) k (β, D) = βd + k λ i g i, x i x 0 + V β ( s k+ ) = k λ i g i, x i π β ( s k+ ) + β (D d(π β ( s k+ ))). In view of (2.7) and (2.), for any non-negative D and β we have (2.2) δ k (D) k (β, D). (2.3) Since Q is a simple set, the values of the gap functions can be easily computed. Note that for some D these values can be negative. However, if a solution x of our problem exists (in the sense of (2.8)), then for D d(x ) ( x F D ), 8

9 the value δ k (D) is non-negative independently on the sequences X k, Λ k and G k, involved in its definition. Consider now the generic scheme of Dual Averaging (DA-scheme). Initialization: Set s 0 = 0 E. Choose β 0 > 0. Iteration (k 0):. Compute g k = G(x k ). (2.4) 2. Choose λ k > 0. Set s k+ = s k + λ k g k. 3. Choose β k+ β k. Set x k+ = π βk+ ( s k+ ). Theorem Let the sequences X k, G k and Λ k be generated by (2.4). Then:. For any k 0 and D 0 we have: δ k (D) k (β k+, D) β k+ D + 2σ 2. Assume that a solution x in the sense (2.8) exists. Then 2 σ x k+ x 2 d(x ) + 2σβ k+ λ 2 i β i g i 2. (2.5) λ 2 i β i g i 2. (2.6) 3. Assume that x is an interior solution: B r (x ) F D for some positive r and D. Then [ ] ŝ k+ r β k+ D + λ 2 i 2σ β i g i 2. (2.7) Proof:. In view of (2.3), we need to prove only the second inequality in (2.5). By the rules of scheme (2.4), for i we obtain V βi+ ( s i+ ) (2.2) V βi ( s i+ ) (2.5) V βi ( s i ) λ i g i, V βi ( s i ) + λ2 i 2σβ i g i 2 (2.4) = V βi ( s i ) + λ i g i, x 0 x i + λ2 i 2σβ i g i 2. Thus, λ i g i, x i x 0 V βi ( s i ) V βi+ ( s i+ ) + λ2 i 2σβ i g i 2, i =,..., k. 9

10 The summation of all these inequalities results in λ i g i, x i x 0 V β ( s ) V βk+ ( s k+ ) + 2σ i= λ 2 i β i g i 2. (2.8) But in view of (2.6) V β ( s ) λ2 0 2σβ g 0 2 λ2 0 2σβ 0 g 0 2. Thus, (2.8) results in (2.5). 2. Let us assume that x exists. Then, 0 (2.8) λ i g i, x i x (2.8) s k+, x 0 x V βk+ ( s k+ ) + 2σ (2.) = s k+, x k+ x + β k+ d(x k+ ) + 2σ (9.5) β k+ d(x ) 2 β k+σ x k+ x 2 + 2σ and that is (2.6). 3. Let us assume now that x is an interior solution. Then { } δ k (D) = max λ i g i, x i x : x F D x max x { Thus, (2.7) follows from (2.5). λ i g i, x i x : x B r (x ) } λ 2 i β i g i 2 λ 2 i β i g i 2 λ 2 i β i g i 2, r s k+. The form of inequalities (2.5) and (2.6) suggests some natural strategies for choosing the parameters β i in the scheme (2.4). Let us define the following sequence: ˆβ 0 = ˆβ =, ˆβi+ = ˆβ i + ˆβi, i. (2.9) The advantage of this sequence is justified by the following relation: ˆβ k+ = k ˆβ i, k 0. Thus, this sequence can be used for balancing the terms appearing in the right-hand side of inequality (2.5). Note that the growth of the sequence can be estimated as follows. Lemma 3 2k ˆβk k, k. (2.20) Proof: From (2.9) we have ˆβ = and ˆβ k+ 2 = ˆβ k ˆβ k + 2 for k. This gives the first inequality in (2.20). On the other hand, since the function β + β is increasing for β and ˆβ k, the second inequality in (2.20) can be justified by induction. We will consider two main strategies for choosing λ i in (2.4): 0

11 Simple averages: λ k =. Weighted averages: λ k = g k. Let us write down the corresponding algorithmic schemes in an explicit form. theorems below are straightforward consequences of Theorem. Both Method of Simple Dual Averages Initialization: Set s 0 = 0 E. Choose γ > 0. Iteration (k 0): (2.2). Compute g k = G(x k ). Set s k+ = s k + g k. 2. Choose β k+ = γ ˆβ k+. Set x k+ = π βk+ ( s k+ ). Theorem 2 Assume that g k L, k 0. For method (2.2) we have = k + and δ k (D) ˆβ ( ) k+ γd + L2 2σγ. Moreover, if a solution x in the sense (2.8) exists, then the scheme (2.2) generates a bounded sequence: x k x 2 2 σ d(x ) + L2 σ 2 γ 2, k 0. Method of Weighted Dual Averages Initialization: Set s 0 = 0 E. Choose ρ > 0. Iteration (k 0): (2.22). Compute g k = G(x k ). Set s k+ = s k + g k / g k. 2. Choose β k+ = ˆβ k+ ρ σ. Set x k+ = π βk+ ( s k+ ). Theorem 3 Assume that g k L, k 0. For method (2.22) we have k+ L δ k (D) ˆβ ( ) σ k+ D ρ + 2 ρ. and Moreover, if a solution x in the sense (2.8) exists, then the scheme (2.22) generates a bounded sequence: x k x 2 σ (2d(x ) + ρ 2 ). In the next sections we show how to apply the above results to different classes of problems with convex structure.

12 3 Minimization over simple sets 3. General minimization problem Consider the following minimization problem: min x {f(x) : x Q}, (3.) where f is a convex function defined on E and Q is a closed convex set. Recall that we assume Q to be simple, which means computability of exact optimal solutions to both minimization problems in (2.). In order to solve problem (3.), we need a black-box oracle, which is able to compute a subgradient of objective function at any test point: 4 G(x) f(x), x E. Then, we have the following interpretation of the gap function δ k (D). Denote l k (x) = λ i [f(x i ) + g i, x x i ], (g i f(x i )) ˆf k (D) = min x {l k (x) : x F D }, fd = min{f(x) : x F D}. x Since f( ) is convex, in view of definition (2.0) of gap function, we have δ k (D) = λ i f(x i ) ˆf N (D) f(ˆx k+ ) fd. (3.2) Thus, we can justify the rate of convergence of methods (2.2) and (2.22) as applied to problem (3.). In the estimates below we assume that g L g f(x), x Q.. Simple averages. In view of Theorem 2 and inequalities (2.20), (3.2) we have ) f(ˆx k+ ) fd k+ k+ (γd + L2 2σγ. (3.3) Note that parameters D and L are not used explicitly in this scheme. However, their estimates are needed for choosing a reasonable value of γ. The optimal choice of γ is as follows: γ = L 2σD. In the method of Simple Dual Averages (SDA), the accumulated lower linear model is remarkably simple: l k (x) = k+ [f(x i ) + g i, x x i ]. 4 Usually, such an oracle can compute also the value of the objective function. However, we do not need it in the algorithm. 2

13 Thus, its algorithmic scheme looks as follows: { } x k+ = arg min x Q k+ [f(x i ) + g i, x x i ] + µ k d(x), k 0, (3.4) where {µ k } k=0 is a sequence of positive scaling parameters. In accordance to the rules of (2.2), we choose µ k = β k+ k+. However, the rate of convergence of the method (3.4) ( remains on the same level for any µ k = O k ). Note that in (3.4) we do not actually need the function values. Note that for method (3.4) we do not need Q to be bounded. For example, we can have Q = E. Nevertheless, if a solution x of problem (3.) do exists, then by Theorem 2 the generated sequence {x k } k=0 is bounded. This feature may look surprising since µ k 0 and no special caution is taken in (3.4) to ensure the boundedness of the sequence. 2. Weighted averages. In view of Theorem 3 and inequalities (2.20), (3.2) we have ( ) f(ˆx k+ ) fd k+ (k+) σ L ρ D + ρ 2. (3.5) As in (2.2), parameters D and L are not used explicitly in method (2.22). But now, in order to choose a reasonable value of ρ, we need a reasonable estimate only for D. The optimal choice of ρ is as follows: ρ = 2D. 3.2 Primal-dual problem For the sake of notation, in this section we assume that the prox-center of the set Q is at the origin: x 0 = arg min d(x) = 0 E. (3.6) x Q Let us assume that an optimal solution x of problem (3.) exists. Then, choosing D d(x ), we can rewrite this problem in an equivalent form: f Consider the conjugate function Note that for any x E we have = min x {f(x) : x F D }. (3.7) f (s) = sup[ s, x f(x) ]. (3.8) x E f(x) dom f. (3.9) Since dom f = E, a converse representation to (3.8) is always valid: f(x) = max[ s, x f (s) : s dom f ], x E. s Hence, we can rewrite the problem (3.7) in a dual form: f = min x F D max [ s, x f (s) ] = s dom f = max s dom f [ ξ D ( s) f (s) ]. 3 max s dom f min [ s, x f (s) ] x F D

14 Thus, we come to the following dual problem: f = min s [ f (s) + ξ D ( s) : s dom f ]. (3.0) As usual, it is worth to unify (3.) and (3.0) in a single primal-dual problem: 0 = min x,s [ ψ D(x, s) def = f(x) + f (s) + ξ D ( s) : x Q, s dom f ]. (3.) Let us show that DA-methods converge to a solution of this problem. Theorem 4 Let pair (ˆx k+, ŝ k+ ) be defined by (2.9) with sequences X k, G k and Λ k being generated by method (2.4) for problem (3.7). Then ˆx k+ Q, ŝ k+ dom f, and ψ D (ˆx k+, ŝ k+ ) δ k (D). (3.2) Proof: Indeed, point ˆx k+ is feasible since it is a convex combination of feasible points. In view of inclusion (3.9), same arguments also work for ŝ k+. Finally, ψ D (ˆx k+, ŝ k+ ) = ξ D ( ŝ k+ ) + f(ˆx k+ ) + f (ŝ k+ ) (2.9) ξ D ( ŝ k+ ) + λ i [f(x i ) + f (g i )] (2.), (3.8) = ξ D ( s k+ ) + λ i g i, x i (2.), (3.6) = δ k (D). Thus, for corresponding methods, the right-hand sides of inequalities (3.3), (3.5) establish the rate of convergence of primal-dual function in (3.2) to zero. In the case of interior solution x, the optimal solution of the dual problem (3.0) is attained at 0 E. In this case the rate of convergence of ŝ k+ to the origin is given by (2.7). In this section, we have shown the abilities of DA-methods (2.4) on the general primal-dual problem (3.). However, as we will see in the next sections, these schemes can provide us with much more detailed dual information. For that we need to employ an available information on the structure of minimization problem. 3.3 Minimax problems Consider the following variant of problem (3.): min{f(x) = max f x j(x) : x Q}, (3.3) j p where f j ( ), j =,..., p, are convex functions defined on E. Of course, this problem admits a dual representation. Let us fix D d(x ). Recall that we denote by p a standard simplex in R p : { } p = y R+ p p : y (j) =. 4 j=

15 Then f = min x Q max f j(x) = j p min x F D max y p p y (j) f j (x) j= = max y p Thus, defining ϕ D (y) = min x F D p min y (j) f j (x). x F D j= p y (j) f j (x), we obtain the dual problem j= f = max y p ϕ D (y). (3.4) Let us show that the DA-schemes generate also an approximation to the optimal solution of the dual problem. For that, we need to employ the structure of the oracle G for the objective function in (3.3). Note that in our case f(x) = Conv { f j (x) : j I(x)}, I(x) = {j : f j (x) = f(x)}. Thus, for any g k in the method (2.4) as applied to the problem (3.3) we can define a vector y k p such that y (j) k = 0, j / I(x k ), g k = j I(x k ) y (j) k g k,j, where g k,j f j (x k ) for j I(x k ). Denote ŷ k+ = y i. (3.5) Theorem 5 Let pair (ˆx k+, ŷ k+ ) be defined by sequences X k, G k and Λ k generated by method (2.4) for problem (3.3). Then this pair is primal-dual feasible and 0 f(ˆx k+ ) ϕ D (ŷ k+ ) δ k (D). (3.6) Proof: Indeed, the pair (ˆx k+, ŷ k+ ) is feasible in view of convexity of primal-dual set Q p. Further, denote F (x) = (f (x),..., f p (x)) T R p. Then in view of (3.5), for any k 0 we have g k, x k x = j I(x k ) y (j) k g k,j, x k x j I(x k ) y (j) k [f j(x k ) f j (x)] Therefore δ k (D) = = y k, F (x k ) F (x) = f(x k ) y k, F (x). max x F D max x F D { λ i g i, x i x { } λ i [f(x i ) y i, F (x) ] } = λ i f(x i ) ϕ D (ŷ k+ ) f(ˆx k+ ) ϕ D (ŷ k+ ). 5

16 Let us write down an explicit form of SDA-method as applied to problem (3.3). Denote by e j the jth coordinate vector in R p. Initialization: Set l 0 (x) 0, m 0 = 0 Z p. Iteration (k 0):. Choose any j k : f j k (x k) = f(x k ). 2. Set l k+ (x) = k k+ l k(x) + k+ [f(x k) + g k,j k, x x k ]. { 3. Compute x k+ = arg min l k+ (x) + γ ˆβ k+ x Q k+ }. d(x) (3.7) 4. Update m k+ = m k + e j k. Output: ˆx k+ = k+ x i, ŷ k+ = k+ m k+. In (3.7), the entry of the optimal dual vector is approximated by the frequency of detecting the corresponding functional component as the maximal one. Note that SDA-method can be applied also to a continuous minimax problem. In this case the objective function has the following form: f(x) = max [ y, F (x) : y Q d}, y where Q d is a bounded closed convex set in R p. For this problem an approximation to the optimal dual solution is obtained as an average of the vectors y(x k ) defined by y(x) Arg max [ y, F (x) : y Q d}. y (Note that for computing y(x) we need to maximize a linear function over a convex set.) Corresponding modifications of the method (3.7) are straightforward. 3.4 Problems with simple functional constraints Let us assume that the set Q in problem (3.) has the following structure: Q = { x Q : Ax = b, F (x) 0 }, (3.8) where Q is a closed convex set in E, dim E = n, b R m, A is an m n-matrix, and F (x) : Q R p is a component-wise convex function. Usually, depending on the importance of corresponding functional components, we can freely decide whether they should be hidden 6

17 in the set Q or not. In any case, representation of the set Q in the form (3.8) is not unique; therefore we call the corresponding dual problem the adjoint one. Let us introduce dual variables u R m and v R p +. Assume that an optimal solution x of (3.), (3.8) exists and D d(x ). Denote ϕ D (u, v) = min{f(x) + u, b Ax + v, F (x) : d(x) D}. (3.9) x Q In this definition d(x) is still a prox-function of the set Q with prox-center x 0 Q. Then the adjoint problem to (3.), (3.8) consists in max u,v {ϕ D(u, v) : u R m, v R p +}. (3.20) Note that the complexity of computation of the objective function in this problem can be comparable with the complexity of the initial problem (3.). Nevertheless, we will see that DA-methods (2.4) are able to approach its optimal solutions. Let us consider two possibilities.. Let us fix D d(x ). Denote by û k, ˆv k the optimal multipliers for essential constraints in the following optimization problem: { } δ k = max x Q λ i g i, x i x : Ax = b, F (x) 0, d(x) D = max x Q, d(x) D min u R m,v R p + ˆL(x, u, v), (3.2) ˆL(x, u, v) = λ i g i, x i x + u, Ax b v, F (x). And let ˆx k be its optimal solution. Then for any x Q with d(x) D we have λ i g i + A T û k p j= with some ψ j F (j) (ˆx k ), j =,..., p, and p j= ˆv (j) k F (j) (ˆx k ) = 0. ˆv (j) k ψ j, x ˆx k 0, Therefore, since f and F are convex, for any such x we get f(x) + û k, b Ax + ˆv k, F (x) λ i [f(x i ) + g i, x x i ] + û k, b Ax + p ˆv (j) k [F (j) (ˆx k ) + ψ j, x ˆx k ] j= λ i [f(x i ) + g i, ˆx k x i ] = 7 λ i f(x i ) δ k (D).

18 Thus, we have proved that 0 f(ˆx k+ ) ϕ D (û k, ˆv k ) λ i f(x i ) ϕ D (û k, ˆv k ) δ k (D). (3.22) Hence, the quality of this primal-dual object can be estimated by Item of Theorem. 2. The above suggestion to form an approximate solution to the adjoint problem (3.20) by the dual multipliers of problem (3.2) has two drawbacks. First of all, it is necessary to guarantee that the auxiliary parameter D is sufficiently big. Secondly, the problem (3.2) needs additional computational efforts. Let us show that the approximate solution to problem (3.20) can be obtained for free, using the objects necessary for finding the point π βk+ ( s k+ ). Denote by ū k, v k the optimal multipliers for functional constraints in the following optimization problem: x k+ = π βk+ ( s k+ ) = arg max x Q = arg max x Q L(x, u, v) = Then for any x Q we have { min u R m,v R p + λ i g i, x β k+ d(x) : Ax = b, F (x) 0 L(x, u, v), λ i g i, x β k+ d(x) + u, Ax b v, F (x). λ i g i β k+ d + A T ū k p v (j) k ψ j, x x k+ 0, j= with some d d(x k+ ) and ψ j F (j) (x k+ ), j =,..., p, and Therefore, for all such x we obtain p j= f(x) + ū k, b Ax + v k, F (x) v (j) k F (j) (x k+ ) = 0. λ i [f(x i ) + g i, x x i ] + ū k, b Ax + p v (j) k [F (j) (x k+ ) + ψ j, x x k+ ] j= λ i [f(x i ) + g i, x k+ x i ] + β k+ d, x k+ x λ i f(x i ) β k+ d(x) [ λ i g i, x i x k+ β k+ d(x k+ ) 8 } ]. (3.23)

19 Since in definition (3.9) we need d(x) D, by (2.2) the above estimates results in ϕ D (ū k, v k ) λ i f(x i ) (β k+, D). (3.24) Hence, the quality of this primal-dual object can be estimated by Item of Theorem. 4 Saddle point problems Consider the general saddle point problem: min u U max v V f(u, v), (4.) where U is a closed convex set in E u, V is a closed convex set in E v, function f(, v) is convex in the first argument on E u for any v V, and function f(u, ) is concave in the second argument on E v for any u U. For set U we assume existence of prox-function d u ( ) with prox-center u 0, which is strongly convex on U with respect to the norm u with convexity parameter σ u. For the set V we introduce the similar assumptions. Note that the point x = (u, v ) Q def = U V is a solution to the problem (4.) if and only if f(u, v) f(u, v ) f(u, v ) u U, v V. Since for any x def = (u, v) Q we have we conclude that f(u, v ) f(u, v) + g v, v v g v f v (u, v), f(u, v) f(u, v) + g u, u u g u f u (u, v), g u, u u + g v, v v 0 (4.2) for any g = (g u, g v ) from f u (u, v) f v (u, v). Thus, the oracle G : x = (u, v) Q g(x) = (g u (x), g v (x)) (4.3) satisfies condition (2.8). Further, let us fix some α (0, ). Then we can introduce the following prox-function of the set Q: d(x) = αd u (u) + ( α)d v (v). In this case, the prox-center of Q is x 0 = (u 0, v 0 ). Defining the norm for E = E u E v by [ ] /2 x = ασ u u 2 u + ( α)σ v v 2 v. we get for function d( ) the convexity parameter σ =. Note that the norm for the dual space E = E u E v is defined now as [ g = ] /2 ασ u g u 2 u, + ( α)σ v g v 2 v,. 9

20 Assuming now the partial subdifferentials of f be uniformly bounded, g u u, L u, g v v, L v, g f u (u, v) f v (u, v), (u, v) Q, we obtain the following bound for the answers of oracle G: g 2 L 2 def = L2 u ασ u + L2 v ( α)σ v. (4.4) Finally, if d u (u ) D u and d v (v ) D v, then x = (u, v ) F D with D = αd u + ( α)d v. (4.5) Let us apply now to (4.) SDA-method (2.2). In accordance to Theorem 2, we get the following bound: δ k (D) ˆβ k+ def k+ Ω, Ω = γd + L2 2γ, with L and D defined by (4.4) and (4.5) respectively. With the optimal choice γ = we obtain Ω = [ 2L 2 D ] ( ) ] /2 = [2 L 2 /2 u ασ u + L2 v ( α)σ v (αd u + ( α)d v ) L 2D = [ ( )] 2 L 2 u D /2 u σ u + L2 vd v σ v + αl2 vd u ( α)σ v + ( α)l2 ud v ασ u Minimizing the latter expression in α we obtain Ω = ( ) 2 L Du u σ u + L Dv v σ v. Thus, we have shown that under an appropriate choice of parameters we can ensure for SDA-method the following rate of convergence for the gap function: δ k (D) ˆβ k+ ( ) k+ 2 L Du u σ u + L Dv v σ v. (4.6) It remains to show that the vanishing gap results in approaching the solution of the saddle point problem (4.). Let us introduce two auxiliary functions: ϕ Dv (u) = max v V {f(u, v) : d v(v) D v }, ψ Du (v) = min u U {f(u, v) : d u(u) D u }. In view of our assumptions, ϕ Dv ( ) is convex on U and ψ Du ( ) is concave on V. Moreover, for any u U and v V we have where f = f(u, v ). ψ Du (v) f ϕ Dv (u), 20

21 Theorem 6 Let point ˆx k+ = (û k+, ˆv k+ ) be defined by (2.9) with sequences X k, G k and Λ k being generated by method (2.4) for problem (4.) with oracle G defined by (4.3). Then ˆx k+ Q and 0 ϕ Dv (û k+ ) ψ Du (ˆv k+ ) δ k (D). (4.7) Proof: Indeed, since f(, v) is convex and f(u, ) is concave, { } def τ k = max λ i g u (u i, v i ), u i u : d u (u) D u u U max u U { } λ i [f(u i, v i ) f(u, v i )] : d u (u) D u λ i f(u i, v i ) min u U {f(u, ˆv k+) : d u (u) D u } Similarly, σ k = λ i f(u i, v i ) ψ Du (ˆv k+ ). { } def = max λ i g v (u i, v i ), v v i : d v (v) D v v V max v V { } λ i [f(u i, v) f(u i, v i )] : d v (v) D v max {f(û k+, v) : d v (v) D v } v V λ i f(u i, v i ) Thus, = ϕ Dv (û k+ ) λ i f(u i, v i ). max x Q { ϕ Dv (û k+ ) ψ Du (ˆv k+ ) τ k + σ k } λ i g(x i ), x i x : d(x) αd u + ( α)d v = δ k (D). Note that we did not assume boundedness of the sets U and V. Parameter D from inequality (4.7) does not appear in DA-method (2.4) in explicit form. 5 Variational inequalities Let Q be a closed convex set endowed, as usual, with a prox-function d(x). Consider a continuous operator V ( ) : Q E, which is monotone on Q: V (x) V (y), x y 0 x, y Q. 2

22 In this section we are interested in variational inequality problem (VI): Find x Q: V (x), x x 0 x Q. (5.) Sometimes this problem is called a weak variational inequality. Since V ( ) is continuous and monotone, the problem (5.) is equivalent to its strong variant: Find x Q: V (x ), x x 0 x Q. (5.2) We assume that for this problem there exists at least one solution x. Let us fix D > d(x ). A standard measure for quality of approximate solution to (5.) is the restricted merit function f D (x) = max{ V (y), x y : d(y) D}. (5.3) y Q Let us present two important facts related to this function (our exposition is partly taken from Section 2 in [5]). Lemma 4 Function f D (x) is well defined and convex on E. For any x Q we have f D (x) 0, and f D (x ) = 0. Conversely, if f D (ˆx) = 0 for some ˆx Q with d(ˆx) < D, then ˆx is a solution to problem (5.). Proof: Indeed, function f D (x) is well defined since the set F D is bounded. It is convex in x as a maximum of a parametric family of linear functions. If x F D, then taking in (5.3) y = x we guarantee f D (x) 0. If x Q and d(x) > D, consider y Q satisfying condition for certain α (0, ). Then y F D and d(y) = D, y = αx + ( α)x, V (y), x y = α α V (y), y x 0. Thus, f D (x) 0 for all x Q. On the other hand, since x is a solution to (5.), V (y), x y 0 for all y from F D. Hence, since x F D, we get f D (x ) = 0. Finally, let f D (ˆx) = 0 for some ˆx Q and d(ˆx) < D. This means that ˆx is a solution of the weak variational inequality V (y), y ˆx 0 for all y F D. Since V (y) is continuous, we conclude that ˆx is a solution to the strong variational inequality V (ˆx), y ˆx 0 y F D. Hence, the minimum of linear function l(y) = V (ˆx), y, y F D, is attained at y = ˆx. Moreover, at this point the constraint d(y) D is not active. Thus, this constraint can be removed and we conclude that ˆx is a solution to (5.2) and, consequently, to (5.). Let us use for problem (5.) the oracle G : x V (x). Lemma 5 For any k 0 we have f D (ˆx k+ ) δ k (D). (5.4) 22

23 Proof: Indeed, since V (x) is a monotone operator, f D (ˆx k+ ) = max x { V (x), ˆx k+ x : x F D } = max x max x { } λ i V (x), x i x : x F D { } λ i V (x i ), x i x : x F D δ k (D). Let us assume that V (x) L for all x Q. Then, for example, by Theorem 2 we conclude that SDA-method (2.2) as applied to the problem (5.) converges with the following rate: f D (ˆx k+ ) ˆβ k+ k+ ) (γd(x ) + L2 2σγ. (5.5) Note that the sequence {x k } k=0 remains bounded even for unbounded feasible set Q: x k x 2 2 σ d(x ) + L2 σ 2 γ 2. (5.6) 6 Stochastic optimization Let Ξ be a probability space endowed with a probability measure, and let Q be a closed convex set in E endowed with a prox-function d(x). We are given a cost function f : Q Ξ R. This mapping defines a family of random variables f(x, ξ) on Ξ. We assume that the expectation in ξ, E ξ (f(x, ξ)) is well-defined for all x Q. Our problem of interest is the stochastic optimization problem: min x {ϕ(x) E ξ (f(x, ξ)) : x Q}. (6.) We assume that problem (6.) is solvable and denote by x one of its solutions with ϕ def = ϕ(x ). In this section we make the standard convexity assumption. Assumption For any ξ Ξ, function f(, ξ) is convex and subdifferentiable on Q. Then, the function ϕ(x) is convex on Q. Note that computation of the exact value of ϕ(x) may not be numerically possible, even when the distribution of ξ is known. Indeed, ( ) if Ξ R m, then computing ˆϕ, such that ˆϕ ϕ(x) ˆϵ, may involve up to O ˆϵ computations of f(x, ξ) for different m ξ Ξ. Hence, the standard deterministic notion of approximate solution to problem (6.) becomes useless. However, an alternative definition of the solution to (6.) is possible. Indeed, it is clear that no solution concept can exclude failures in actual implementation. This is the very nature of decision under uncertainty not to have full control over the consequences of a 23

24 given decision. In this context, there is no reason to require that the computed solution be the result of some deterministic process. A random computational scheme would be equally acceptable, if the solution meets some probabilistic criterion. Namely, let x be a random output of some random algorithm. It appears, that it is enough to assume that this output is good only in average: E(ϕ(x)) ϕ ϵ. (6.2) Then, as it was shown in [6], any random optimization algorithm, which is able to generate such random output for any ϵ > 0, can be transformed in an algorithm generating random solutions with an appropriate level of confidence. In [6] this transformation was justified for a stochastic version of subgradient method (.3) (which is known to possess the feature (6.2)). The goal of this section is to prove that a stochastic version of SDA-method (2.2) as applied to problem (6.) is also able to produce a random variable x Q satisfying condition (6.2). In order to justify convergence of the method we need more assumptions. Assumption 2 At any point x Q and any implementation of random vector ξ Ξ the stochastic oracle G(x, ξ) always returns a predefined answer f (x, ξ) x f(x, ξ). Moreover, the answers of the oracle are uniformly bounded: f (x, ξ) L, x Q, ξ Ξ. Let us write down a stochastic modification of SDA-method in the same vein as [6]. In this section, we adopt notation τ for a particular result of drawing of a random variable τ. Thus, all ξ k R m in the scheme below denote some observations of corresponding i.i.d. random vectors ξ k, which are the identical copies of the random vector ξ defined in (6.). Method of Stochastic Simple Averages Initialization: Set s 0 = 0 E. Choose γ > 0. Iteration (k 0): (6.3). Generate ξ k and compute g k = f ( x k, ξ k ). 2. Set s k+ = s k + g k, x k+ = π γ ˆβk+ ( s k+ ). Denote by ξ k, k 0, the collection of random variables (ξ 0,..., ξ k ). (6.3) defines two sequences of random vectors SSA-method s k+ = s k+ (ξ k ) E, x k+ = x k+ (ξ k ) Q, k 0, with deterministic s 0 and x 0. Note that their implementations s k+ and x k+ are different for different runs of SSA-method. 24

25 Thus, for each particular run of this method, we can define a gap function { } δ k (D) = max f ( x i, ξ i ), x i x : x F D,, D 0, (6.4) x which can be seen as a drawing of random variable δ k (D). Since g k L, Theorem 2 results in the following uniform bound: S δk k (D) ˆβ ) k+ k+ (γd + L2 2σγ. (6.5) Let us choose now D big enough: D d(x ). Then S δk k (D) k+ f ( x i, ξ i ), x i x k+ [f( x i, ξ i ) f(x, ξ i )]. Since all ξ i are i.i.d. random vectors, in average we have ( ) E ξk δ k (D) k+ E ξk (f(x i, ξ i )) ϕ. (6.6) Note that the random vector x i is independent on ξ j, i j k. Therefore, and we conclude that k+ E ξk (f(x i, ξ i )) = k+ E ξk (f(x i, ξ i )) = E ξk (ϕ(x i )), ( E ξk (ϕ(x i )) E ξk (ϕ k+ )) x i. Thus, in view of inequalities (6.5), (6.6), we have proved the following theorem. Theorem 7 Let the random sequence {x k } k=0 be defined by SSA-method (6.3). Then for any k 0 we have ( )) E ξk (ϕ k+ x i ϕ ˆβ ) k+ k+ (γd(x ) + L2 2σγ. 7 Applications in modelling 7. Algorithm of balanced development It appears that a variant of DA-algorithm (2.4) has a natural interpretation of certain rational dynamic investment strategy. Let us discuss an example of such a model. Assume we are going to develop our property by a given priori sequence Λ = {λ k } k=0 of consecutive investments of the capital. This money can be spent for buying certain products with unitary prices p (j) 0, j =,..., N. 25

26 All products are perfectly divisible; they are available on the market in unlimited quantities. Thus, if for investment k we buy quantity X (j) k 0 of each product j, j =,..., N, then the balance equation can be written as X k = λ k x k, p, x k =, x k 0, (7.) where the vectors p = (p (),..., p (N) ) T and x k = (x () k,... x(n) ) T are both from R N. In order to measure the progress of our property (or, the efficiency of our investments) we introduce a set of m characteristics (features). Let us assume that after k 0 investments we are able to measure exactly an accumulated level s (i) k, i =,..., m, of each feature i in our property. On the other hand, we have a set of standards b (i) > 0, i =,..., m, which are treated as lower bounds on concentration of each feature in a perfectly balanced unit of such a property. Finally, let us assume that each product j is described by a vector a j = (a () j,..., a (m) j ) T, j =,..., N, with entries being the unitary concentrations of corresponding characteristics. Introducing the matrix A = (a,..., a N ) R m N, we can describe the dynamics of accumulated features as follows: s k+ = s k + λ k g k, g k = Ax k, k 0. (7.2) In accordance to our system of standards, the strategy of optimal balanced development can be found from the following optimization problem: τ def = max x,τ {τ : Ax τb, p, x =, x 0}. (7.3) Indeed, denote by x its optimal solution. Then, choosing in (7.2) x k = x we get s k = Ax τ b. Thus, the efficiency of such an investment strategy is maximal possible and its quality even does not depend on the schedule of the payments. However, note that in order to apply the above strategy, it is necessary to solve in advance a linear programming problem (7.3), which can be very complicated because of a high dimension. What can be done if this computation appears to be too difficult? Do we have some simple tools for approaching the optimal strategy in real life? Apparently, the latter question has a positive answer. First of all, let us represent the problem (7.3) in a dual form. From the Lagrangian L(x, τ, y, λ) = τ + T y, Ax τb + λ ( p, x ) where T is a diagonal matrix, T (i,i) =, i =,..., m, and corresponding saddle-point b (i) problem max min L(x, τ, y, λ), τ; x 0 λ; y 0 we come to the following dual representation τ = min y,λ {λ : AT T y λp, y m }. 26

27 Or, introducing the function ϕ(y) = max a j N p (j) j, T y, we obtain τ = min y {ϕ(y) : y m }. (7.4) Let us apply to problem (7.4) a variant of DA-scheme (2.4). For feasible set of this problem m, we choose the entropy prox-function d(y) = ln m + m y (i) ln y (i) with prox-center y 0 = ( m,..., m )T. Note that d(y) is strongly convex on m in l -norm (.0) with convexity parameter σ =, and for any y m we have i= d(y) ln m. (7.5) Moreover, with this prox-function the computation of projection π β (s) is very cheap: [ m π β (s) (i) = e s(i) /β e /β] s(j), i =,..., m, (7.6) j= (see, for example, Lemma 4 in [3]). In the algorithm below, the sequence Λ is the same as in (7.2). Algorithm of Balanced Development Initialization: Set s 0 = 0 E. Choose β 0 > 0. Iteration (k 0):. Form the set J k = { } j : a p (j) j, T y k = ϕ(y k ). 2. Choose any x k 0, p, x k =, with x (j) k = 0 for j / J k. (7.7) 3. Update s k+ = s k + λ k g k with g k = Ax k. 4. Choose β k+ β k and for i =,..., m define [ y (i) k+ = e s(i) k+ /(β k+b (i)) m k+ /(β k+b )] (j). e s(j) j= Note that in this scheme T g k ϕ(y k ) and ( y k+ = π βk+ T k j=0 λ j g j ). 27

28 Therefore it is indeed a variant of (2.4), and in view of Theorem and (7.5) we have ( ) λ i ϕ(y i ) τ β k+ ln m + k 2 L2 λ 2 i β i, L = max j N p (j) T a j = max j N, i m a (i) p (j) b (i) j. (7.8) BD-algorithm (7.7) admits the following interpretation. Its first three steps describe a natural investment strategy. Indeed, the algorithm updates a system of personal prices for characteristics def u k = T y k, k 0. By these prices, for any product j we can compute an estimate of its utility a j, u k. Therefore, in Step we form a subset of available products with the best quality-price ratio (Step ): J k = { j : p (j) a j, u k = max i N } a p (i) i, u k. In Step 2 we share a unit of budget in an arbitrary way among the best products. And in Step 3 we buy the products within the budget λ k and update the vector of accumulated features. A non-trivial part of interpretation is related to Step 4, which describes a dependence of the personal prices, developed to the end of period k +, in the vector of accumulated characteristics s k+ observed during this period. For this interpretation, we need to explain first the meaning of (7.6). Relations (7.6) appear in so-called logit variant of discrete choice model (see, for example, [2]). In this model we need to choose one of m variants with utilities s (i), i =,..., m. The utilities are observed with additive random errors ϵ (i), which are i.i.d. in accordance to double-exponential distribution: [ ] { F (τ) = Prob ϵ (i) τ = exp e γ τ/β}, i =,..., m, where γ is the Euler s constant and β > 0 is a tolerance parameter. Then, the expected observation of the utility function is given by ( ) U β (s) = E max i m [s(i) + ϵ (i) ], It can be proved that U β (s) = βd (s/β) + const with ( m ) d (s) = ln e s(i). Therefore we have the following important expressions for so-called choice probabilities: [ ] Prob s (i) + ϵ (i) = max j m [s(j) + ϵ (j) ] = U β (s) (i) [ m = d (s/β) (i) = e s(i) /β e /β] s(j), i =,..., m, j= 28

29 (compare with (7.6)). The smaller β is, the closer is our choice strategy to deterministic maximum. Using the logit model, we can adopt the following behavioral explanation of the rules of Step 4 in (7.7). During the period k + we regularly compare the levels of accumulated features in relative scale defined by the vector of minimal standards b. Each audit defines the worst characteristic, for which the corresponding level is minimal. The above computations are done with additive errors 5, which satisfy the logit model. To the end of the period we obtain the frequencies y (i) k+ of detecting the level of corresponding characteristic as the worst one. Then, we apply the following prices: u (i) k+ = b (i) y (i) k+, i =,..., m. (7.9) Note that the conditions for convergence of BD-method can be derived from the righthand side of inequality (7.8). Assume for example, that we have a sequence of equal investments λ. Further, during each period we compare average accumulation of characteristics using the logit model with tolerance parameter µ. Hence, in (7.7) we take β k+ = µ (k + ), λ k = λ, k 0. Defining β 0 = µ, we can estimate the right-hand side of inequality (7.8) as follows: µ λl2 λ ln m + 2µ 2+ln k k+, k. Thus, the quality of our property is stabilized on the level τ + µ λ ln m. In order to have convergence to the optimal balance, the inaccuracy level µ of audit testing must gradually go to zero. 7.2 Preliminary planning versus dynamic adjustment Let us consider a multi-period optimization problem in the following form. For k consecutive periods, k = 0,..., N, our expenses will be defined by different convex objective functions f 0 (x),..., f N (x), x Q, where Q is a closed convex set. Thus, if we choose to apply for the period k some strategy x k Q, then the total expenses are given by Ψ(X) = N f k (x k ), X = {x k } N k=0. k=0 In this model there are several possibilities for defining a rational strategy.. If all functions {f k (x)} N k=0 are known in advance, then we can define an optimal dynamic strategy X as x k = x k def = arg min x {f k (x) : x Q}, k = 0,..., N. (7.0) 5 Note that these errors can occur in a natural way, due to inexact data, for example. However, even if the actual data and measurements are exact, we need to introduce an artificial randomization of the results. 29

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum