The Proximal Gradient Method

Size: px
Start display at page:

Download "The Proximal Gradient Method"

Transcription

1 Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product, andtheeuclideannorm =, The Composite Model In this chapter we will be mostly concerned with the composite model whereweassumethefollowing. min{f x) fx)+gx)}, 10.1) x E Assumption A) g : E, ] is proper closed and convex. B) f : E, ] is proper and closed, domf) is convex, domg) intdomf)), and f is L f -smooth over intdomf)). C) The optimal set of problem 10.1) is nonempty and denoted by X. The optimal value of the problem is denoted by F opt. Three special cases of the general model 10.1) are gathered in the following example. Example 10.. stam Smooth unconstrained minimization. If g 0 and domf) =E, then 10.1) reduces to the unconstrained smooth minimization problem min fx), x E where f : E R is an L f -smooth function. 69 Copyright 017 Society for Industrial and Applied Mathematics

2 70 Chapter 10. The Proximal Gradient Method Convex constrained smooth minimization. If g = δ C,whereC is a nonempty closed and convex set, then 10.1) amounts to the problem of minimizing a differentiable function over a nonempty closed and convex set: min x C fx), where here f is L f -smooth over intdomf)) and C intdomf)). l 1 -regularized minimization. Taing gx) =λ x 1 for some λ>0, 10.1) amounts to the l 1 -regularized problem min {fx)+λ x 1} x E with f being an L f -smooth function over the entire space E. 10. The Proximal Gradient Method To understand the idea behind the method for solving 10.1) we are about to study, we begin by revisiting the projected gradient method for solving 10.1) in the case where g = δ C with C being a nonempty closed and convex set. In this case, the problem taes the form min{fx) :x C}. 10.) The general update step of the projected gradient method for solving 10.) taes the form x +1 = P C x t fx )), where t is the stepsize at iteration. It is easy to verify that the update step can be also written as see also Section 9.1 for a similar discussion on the projected subgradient method) { x +1 =argmin x C fx )+ fx ), x x + 1 } x x. t That is, the next iterate is the minimizer over C of the sum of the linearization of the smooth part around the current iterate plus a quadratic prox term. Bac to the more general model 10.1), it is natural to generalize the above idea and to define the next iterate as the minimizer of the sum of the linearization of f around x, the nonsmooth function g, and a quadratic prox term: { x +1 =argmin x E fx )+ fx ), x x + gx)+ 1 } x x. 10.3) t After some simple algebraic manipulation and cancellation of constant terms, we obtain that 10.3) can be rewritten as { x +1 =argmin x E t gx)+ 1 x x t fx )) }, which by the definition of the proximal operator is the same as x +1 =prox t gx t fx )). Copyright 017 Society for Industrial and Applied Mathematics

3 10.. The Proximal Gradient Method 71 The above method is called the proximal gradient method, as it consists of a gradient step followed by a proximal mapping. From now on, we will tae the stepsizes as t = 1 L, leading to the following description of the method. The Proximal Gradient Method Initialization: pic x 0 intdomf)). General step: for any =0, 1,,... execute the following steps: a) pic L > 0; b) set x +1 =prox 1 L g ) x 1 L fx ). The general update step of the proximal gradient method can be compactly written as x +1 = T f,g L x ), where T f,g L by : intdomf)) E L >0) is the so-called prox-grad operator defined T f,g L x) prox 1 L g x 1 L fx) ). When the identities of f and g are clear from the context, we will often omit the superscripts f,g and write T L ) instead of T f,g L ). Later on, we will consider two stepsize strategies, constant and bactracing, where the meaning of bactracing slightly changes under the different settings that will be considered, and hence several bactracing procedures will be defined. Example The table below presents the explicit update step of the proximal gradient method when applied to the three particular models discussed in Example The exact assumptions on the models are described in Example 10.. Model Update step Name of method min x E fx) x +1 = x t fx ) gradient min x C fx) x +1 = P C x t fx )) projected gradient min x E {fx)+λ x 1 } x +1 = T λt x t fx )) ISTA The third method is nown as the iterative shrinage-thresholding algorithm ISTA) in the literature, since at each iteration a soft-thresholding operation also nown as shrinage ) is performed. 54 Here we use the facts that prox t g 0 = I, prox t δ C = P C and prox t λ 1 = T λt, where g 0 x) 0. Copyright 017 Society for Industrial and Applied Mathematics

4 7 Chapter 10. The Proximal Gradient Method 10.3 Analysis of the Proximal Gradient Method The Nonconvex Case Sufficient Decrease To establish the convergence of the proximal gradient method, we will prove a sufficient decrease lemma for composite functions. Lemma 10.4 sufficient decrease lemma). Suppose that f and g satisfy properties A) and B) of Assumption Let F = f + g and T L T f,g L. Then for any x intdomf)) and L L f, ) the following inequality holds: F x) F T L x)) L L f L G f,g L x), 10.4) where G f,g L : intdomf)) E is the operator defined by Gf,g L x) =Lx T Lx)) for all x intdomf)). Proof. For the sae of simplicity, we use the shorthand notation x + = T L x). By the descent lemma Lemma 5.7), we have that fx + ) fx)+ fx), x + x + L f x x ) By the second prox theorem Theorem 6.39), since x + =prox 1 L x g 1 L fx)),we have x 1L fx) x+, x x + 1 L gx) 1 L gx+ ), from which it follows that fx), x + x L x + x + gx) gx + ), which, combined with 10.5), yields fx + )+gx + ) fx)+gx)+ L + L ) f x + x. Hence, taing into account the definitions of x +,G f,g L x) and the identities F x) = fx)+gx),fx + )=fx + )+gx + ), the desired result follows The Gradient Mapping The operator G f,g L that appears in the right-hand side of 10.4) is an important mapping that can be seen as a generalization of the notion of the gradient. Definition 10.5 gradient mapping). Suppose that f and g satisfy properties A) and B) of Assumption Then the gradient mapping is the operator 55 The analysis of the proximal gradient method in Sections 10.3 and 10.4 mostly follows the presentation of Bec and Teboulle in [18] and [19]. Copyright 017 Society for Industrial and Applied Mathematics

5 10.3. Analysis of the Proximal Gradient Method The Nonconvex Case 73 G f,g L : intdomf)) E defined by for any x intdomf)). G f,g L x) L x T f,g L x) ) When the identities of f and g will be clear from the context, we will use the notation G L instead of G f,g L. With the terminology of the gradient mapping, the update step of the proximal gradient method can be rewritten as x +1 = x 1 L G L x ). In the special case where L = L f, the sufficient decrease inequality 10.4) taes a simpler form. Corollary Under the setting of Lemma 10.4, the following inequality holds for any x intdomf)): F x) F T Lf x)) 1 GLf x). L f The next result shows that the gradient mapping is a generalization of the usual gradient operator x fx) in the sense that they coincide when g 0 and that, for a general g, the points in which the gradient mapping vanishes are the stationary points of the problem of minimizing f + g. Recall see Definition 3.73) that a point x domg) is a stationary point of problem 10.1) if and only if fx ) gx ) and that this condition is a necessary optimality condition for local optimal points see Theorem 3.7). Theorem Let f and g satisfy properties A) and B) of Assumption 10.1 and let L>0. Then a) G f,g0 L x) = fx) for any x intdomf)), whereg 0x) 0; b) for x intdomf)), it holds that G f,g L x )=0 if and only if x is a stationary point of problem 10.1). Proof. a) Since prox 1 L g0y) =y for all y E, it follows that G f,g0 L f,g0 x) =Lx TL x x)) = L prox 1 x 1 )) L g0 L fx) = L x x 1L )) fx) = fx). b) G f,g L x )=0 if and only if x =prox 1 L g x 1 L fx ) ). By the second prox theorem Theorem 6.39), the latter relation holds if and only if x 1 L fx ) x 1 L gx ), Copyright 017 Society for Industrial and Applied Mathematics

6 74 Chapter 10. The Proximal Gradient Method that is, if and only if fx ) gx ), which is exactly the condition for stationarity. If in addition f is convex, then stationarity is a necessary and sufficient optimality condition Theorem 3.7b)), which leads to the following corollary. Corollary 10.8 necessary and sufficient optimality condition under convexity). Let f and g satisfy properties A) and B) of Assumption 10.1, andlet L>0. Suppose that in addition f is convex. Then for x domg), G f,g L x )=0 if and only if x is an optimal solution of problem 10.1). We can thin of the quantity G L x) as an optimality measure in the sense that it is always nonnegative, and equal to zero if and only if x is a stationary point. The next result establishes important monotonicity properties of G L x) w.r.t. the parameter L. Theorem 10.9 monotonicity of the gradient mapping). Suppose that f and g satisfy properties A) and B) of Assumption 10.1 and let G L G f,g L. Suppose that L 1 L > 0. Then G L1 x) G L x) 10.6) and G L1 x) G L x) 10.7) L 1 L for any x intdomf)). Proof. Recall that by the second prox theorem Theorem 6.39), for any v, w E and L>0, the following inequality holds: v prox 1 L g v), prox 1 L g v) w 1 ) L g prox 1 L g v) 1 L gw). x 1 L fx) ) = T L x) Plugging L = L 1, v = x 1 L 1 fx), and w =prox 1 L g into the last inequality, it follows that x 1 fx) T L1 x),t L1 x) T L x) 1 gt L1 x)) 1 gt L x)) L 1 L 1 L 1 or 1 G L1 x) 1 1 fx), G L x) 1 G L1 x) 1 gt L1 x)) 1 gt L x)). L 1 L 1 L L 1 L 1 L 1 Exchanging the roles of L 1 and L yields the following inequality: 1 G L x) 1 1 fx), G L1 x) 1 G L x) 1 gt L x)) 1 gt L1 x)). L L L 1 L L L Multiplying the first inequality by L 1 and the second by L and adding them, we obtain 1 G L1 x) G L x), G L x) 1 G L1 x) 0, L L 1 Copyright 017 Society for Industrial and Applied Mathematics

7 10.3. Analysis of the Proximal Gradient Method The Nonconvex Case 75 which after some expansion of terms can be seen to be the same as 1 G L1 x) G L x) + 1 ) G L1 x),g L x). L 1 L L 1 L Using the Cauchy Schwarz inequality, we obtain that 1 G L1 x) G L x) + 1 ) G L1 x) G L x). 10.8) L 1 L L 1 L Note that if G L x) =0, then by the last inequality, G L1 x) =0, implying that in this case the inequalities 10.6) and 10.7) hold trivially. Assume then that G L x) 0 and define t = GL 1 G L x). Then, by 10.8), 1 1 t + 1 ) t L 1 L 1 L L Since the roots of the quadratic function on the left-hand side of the above inequality are t =1, L1 L, we obtain that 1 t L 1, L showing that G L x) G L1 x) L 1 G L x). L A straightforward result of the nonexpansivity of the prox operator and the L f -smoothness of f over intdomf)) is that G L ) is Lipschitz continuous with constant L + L f. Indeed, for any x, y intdomf)), G L x) G L y) = L x prox 1 L g x 1 ) L fx) y +prox1 y L g 1 ) L fy) L x y + L prox 1 L g x 1 ) L fx) prox 1 y L g 1 ) L fy) L x y + L x 1L ) fx) y 1 ) L fy) L x y + fx) fy) L + L f ) x y. In particular, for L = L f, we obtain the inequality G Lf x) G Lf y) 3L f x y. The above discussion is summarized in the following lemma. Lemma Lipschitz continuity of the gradient mapping). Let f and g satisfy properties A) and B) of Assumption LetG L = G f,g L.Then a) G L x) G L y) L + L f ) x y for any x, y intdomf)); b) G Lf x) G Lf y) 3L f x y for any x, y intdomf)). Copyright 017 Society for Industrial and Applied Mathematics

8 76 Chapter 10. The Proximal Gradient Method Lemma below shows that when f is assumed to be convex and L f - 3 smooth over the entire space, then the operator 4L f G Lf is firmly nonexpansive. A direct consequence is that G Lf is Lipschitz continuous with constant 4L f 3. 3 Lemma firm nonexpansivity of 4L f G Lf ). Let f be a convex and L f - smooth function L f > 0), and let g : E, ] be a proper closed and convex function. Then a) the gradient mapping G Lf G f,g L f satisfies the relation GLf x) G Lf y), x y 3 G Lf x) G Lf y) 10.9) 4L f for any x, y E; b) G Lf x) G Lf y) 4L f 3 x y for any x, y E. Proof. Part b) is a direct consequence of a) and the Cauchy Schwarz inequality. We will therefore prove a). To simplify the presentation, we will use the notation L = L f. By the firm nonexpansivity of the prox operator Theorem 6.4a)), it follows that for any x, y E, T L x) T L y), where T L T f,g L x 1 ) L G L x) x 1 L G L x) x 1L f x) ) y 1L ) f y) T L x) T L y), is the prox-grad mapping. Since T L = I 1 L G L, we obtain that y 1 ) L G L y), x 1L ) f x) y 1L ) f y) ) y 1 ) L G L y), which is the same as x 1 ) L G Lx) y 1 ) L G Ly), G L x) fx)) G L y) fy)) 0. Therefore, G L x) G L y), x y 1 L G Lx) G L y) + fx) fy), x y 1 L G Lx) G L y), fx) fy). Since f is L-smooth, it follows from Theorem 5.8 equivalence between i) and iv)) that fx) fy), x y 1 L fx) fy). Consequently, L G L x) G L y), x y G L x) G L y) + fx) fy) G L x) G L y), fx) fy). Copyright 017 Society for Industrial and Applied Mathematics

9 10.3. Analysis of the Proximal Gradient Method The Nonconvex Case 77 From the Cauchy Schwarz inequality we get L G L x) G L y), x y G L x) G L y) + fx) fy) G L x) G L y) fx) fy) ) By denoting α = G L x) G L y) and β = f x) f y), the right-hand side of 10.10) reads as α + β αβ and satisfies α + β αβ = 3 α ) 4 α + β 3 4 α, which, combined with 10.10), yields the inequality Thus, 10.9) holds. L G L x) G L y), x y 3 4 G Lx) G L y). The next result shows a different ind of a monotonicity property of the gradient mapping norm under the setting of Lemma the norm of the gradient mapping does not increase if a prox-grad step is employed on its argument. Lemma 10.1 monotonicity of the norm of the gradient mapping w.r.t. the prox-grad operator). 56 Let f be a convex and L f -smooth function L f > 0), and let g : E, ] be a proper closed and convex function. Then for any x E, G Lf T Lf x)) G Lf x), where G Lf G f,g L f and T Lf T f,g L f. Proof. Let x E. We will use the shorthand notation x + = T Lf x). By Theorem 5.8 equivalence between i) and iv)), it follows that fx + ) fx) L f fx + ) fx), x + x ) Denoting a = fx + ) fx) andb = x + x, inequality 10.11) can be rewritten as a L f a, b, which is the same as a L f b L f 4 b and as 1 a 1 L f b 1 b. Using the triangle inequality, 1 a b L f 1 a b + 1 L f b + 1 b b. 56 Lemma 10.1 is a minor variation of Lemma.4 from Necoara and Patrascu [88]. Copyright 017 Society for Industrial and Applied Mathematics

10 78 Chapter 10. The Proximal Gradient Method Plugging the expressions for a and b into the above inequality, we obtain that x 1 fx) x fx + ) L f L f x+ x. Combining the above inequality with the nonexpansivity of the prox operator Theorem 6.4b)), we finally obtain G Lf T Lf x)) = G Lf x + ) = L f x + T Lf x + ) = L f T Lf x) T Lf x + ) = L f prox 1 L g x 1 ) fx) prox 1 f L f L g x + 1 fx )) + f L f L f x 1 fx) x fx + ) L f L f which is the desired result. L f x + x = L f T Lf x) x = G Lf x), Convergence of the Proximal Gradient Method The Nonconvex Case We will now analyze the convergence of the proximal gradient method under the validity of Assumption Note that we do not assume at this stage that f is convex. The two stepsize strategies that will be considered are constant and bactracing. Constant. L = L Lf, ) for all. Bactracing procedure B1. The procedure requires three parameters s, γ, η), where s>0,γ 0, 1), and η>1. The choice of L is done as follows. First, L is set to be equal to the initial guess s. Then, while F x ) F T L x )) < γ L G L x ), we set L := ηl. In other words, L is chosen as L = sη i,wherei is the smallest nonnegative integer for which the condition is satisfied. F x ) F T sη i x )) γ sη i G sη i x ) Remar Note that the bactracing procedure is finite under Assumption Indeed, plugging x = x into 10.4), weobtain F x ) F T L x )) L L f L GL x ). 10.1) Copyright 017 Society for Industrial and Applied Mathematics

11 10.3. Analysis of the Proximal Gradient Method The Nonconvex Case 79 If L L f 1 γ),then L Lf L γ, and hence, by 10.1), theinequality F x ) F T L x )) γ L G Lx ) L f 1 γ). holds, implying that the bactracing procedure must end when L We can also compute an upper bound on L : either L is equal to s, orthe bactracing procedure is invoed, meaning that L η did not satisfy the bactracing condition, which by the above discussion implies that L ηl f η < L f 1 γ), so that L < 1 γ). To summarize, in the bactracing procedure B1, theparameterl satisfies { } ηl f L max s, ) 1 γ) The convergence of the proximal gradient method in the nonconvex case is heavily based on the sufficient decrease lemma Lemma 10.4). We begin with the following lemma showing that consecutive function values of the sequence generated by the proximal gradient method decrease by at least a constant times the squared norm of the gradient mapping. Lemma sufficient decrease of the proximal gradient method). Suppose that Assumption 10.1 holds. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize defined by L = L L f, ) or with a stepsize chosen by the bactracing procedure B1 with parameters s, γ, η), wheres>0,γ 0, 1),η >1. Then for any 0, F x ) F x +1 ) M G d x ), 10.14) where and M = L L f L), constant stepsize, { max γ ηl s, f 1 γ) }, bactracing, L, constant stepsize, d = s, bactracing ) 10.16) Proof. The result for the constant stepsize setting follows by plugging L = L and x = x into 10.4). As for the case where the bactracing procedure is used, by its definition we have F x ) F x +1 ) γ L G L x ) γ { max s, ηl f 1 γ) } G L x ), where the last inequality follows from the upper bound on L given in 10.13). The result for the case where the bactracing procedure is invoed now follows by Copyright 017 Society for Industrial and Applied Mathematics

12 80 Chapter 10. The Proximal Gradient Method the monotonicity property of the gradient mapping Theorem 10.9) along with the bound L s, which imply the inequality G L x ) G s x ). We are now ready to prove the convergence of the norm of the gradient mapping to zero and that limit points of the sequence generated by the method are stationary points of problem 10.1). Theorem convergence of the proximal gradient method nonconvex case). Suppose that Assumption 10.1 holds and let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) either with a constant stepsize defined by L = L L f, ) or with a stepsize chosen by the bactracing procedure B1 with parameters s,γ,η), wheres>0,γ 0, 1), and η>1. Then a) the sequence {F x )} 0 is nonincreasing. In addition, F x +1 ) <Fx ) if and only if x is not a stationary point of 10.1); b) G d x ) 0 as,whered is given in 10.16); c) where M is given in 10.15); min n=0,1,..., G dx n ) F x0 ) F opt M +1), 10.17) d) all limit points of the sequence {x } 0 are stationary points of problem 10.1). Proof. a) By Lemma we have that F x ) F x +1 ) M G d x ), 10.18) from which it readily follows that F x ) F x +1 ). If x is not a stationary point of problem 10.1), then G d x ) 0, and hence, by 10.18), F x ) >Fx +1 ). If x is a stationary point of problem 10.1), then G L x )=0, from which it follows that x +1 = x 1 L G L x )=x, and consequently F x )=Fx +1 ). b) Since the sequence {F x )} 0 is nonincreasing and bounded below, it converges. Thus, in particular, F x ) F x +1 ) 0as, which, combined with 10.18), implies that G d x ) 0as. c) Summing the inequality over n =0, 1,...,,weobtain F x 0 ) F x +1 ) M F x n ) F x n+1 ) M G d x n ) G d x n ) M +1) min G dx n ). n=0,1,..., n=0 Using the fact that F x +1 ) F opt, the inequality 10.17) follows. Copyright 017 Society for Industrial and Applied Mathematics

13 10.4. Analysis of the Proximal Gradient Method The Convex Case 81 d) Let x be a limit point of {x } 0. {x j } j 0 converging to x. For any j 0, Then there exists a subsequence G d x) G d x j ) G d x) + G d x j ) d + L f ) x j x + G d x j ), 10.19) where Lemma 10.10a) was used in the second inequality. Since the right-hand side of 10.19) goes to 0 as j, it follows that G d x) =0, which by Theorem 10.7b) implies that x is a stationary point of problem 10.1) Analysis of the Proximal Gradient Method The Convex Case The Fundamental Prox-Grad Inequality The analysis of the proximal gradient method in the case where f is convex is based on the following ey inequality which actually does not assume that f is convex). Theorem fundamental prox-grad inequality). Suppose that f and g satisfy properties A) and B) of Assumption For any x E, y intdomf)) and L>0 satisfying it holds that ft L y)) fy)+ fy),t L y) y + L T Ly) y, 10.0) F x) F T L y)) L x T Ly) L x y + l f x, y), 10.1) where l f x, y) =fx) fy) fy), x y. Proof. Consider the function ϕu) =fy)+ fy), u y + gu)+ L u y. Since ϕ is an L-strongly convex function and T L y) = argmin u E ϕu), it follows by Theorem 5.5b) that Note that by 10.0), ϕx) ϕt L y)) L x T Ly). 10.) ϕt L y)) = fy)+ fy),t L y) y + L T Ly) y + gt L y)) ft L y)) + gt L y)) = F T L y)), and thus 10.) implies that for any x E, ϕx) F T L y)) L x T Ly). Copyright 017 Society for Industrial and Applied Mathematics

14 8 Chapter 10. The Proximal Gradient Method Plugging the expression for ϕx) into the above inequality, we obtain fy)+ fy), x y + gx)+ L x y F T L y)) L x T Ly), which is the same as the desired result: F x) F T L y)) L x T Ly) L x y + fx) fy) fy), x y. Remar Obviously, by the descent lemma, 10.0) is satisfied for L = L f, and hence, for any x E and y intdomf)), theinequality holds. F x) F T Lf y)) L f x T L f y) L f x y + l f x, y) A direct consequence of Theorem is another version of the sufficient decrease lemma Lemma 10.4). This is accomplished by substituting y = x in the fundamental prox-grad inequality. Corollary sufficient decrease lemma second version). Suppose that f and g satisfy properties A) and B) of Assumption For any x intdomf)) for which ft L x)) fx)+ fx),t L x) x + L T Lx) x, it holds that F x) F T L x)) 1 L G Lx) Stepsize Strategies in the Convex Case When f is also convex, we will consider, as in the nonconvex case, both constant and bactracing stepsize strategies. The bactracing procedure, which we will refer to as bactracing procedure B, will be slightly different than the one considered in the nonconvex case, and it will aim to find a constant L satisfying fx +1 ) fx )+ fx ), x +1 x + L x+1 x. 10.3) In the special case where g 0, the proximal gradient method reduces to the gradient method x +1 = x 1 L fx ), and condition 10.3) reduces to fx ) fx +1 ) 1 L fx ), which is similar to the sufficient decrease condition described in Lemma 10.4, and this is why condition 10.3) can also be viewed as a sufficient decrease condition. Copyright 017 Society for Industrial and Applied Mathematics

15 10.4. Analysis of the Proximal Gradient Method The Convex Case 83 Constant. L = L f for all. Bactracing procedure B. The procedure requires two parameters s, η), where s>0andη>1. Define L 1 = s. At iteration 0) the choice of L is done as follows. First, L is set to be equal to L 1. Then, while ft L x )) >fx )+ fx ),T L x ) x + L T L x ) x, we set L := ηl.inotherwords,l is chosen as L = L 1 η i,where i is the smallest nonnegative integer for which the condition is satisfied. ft L 1 η i x )) fx )+ fx ),T L 1 η i x ) x + L T L 1 η i x ) x Remar upper and lower bounds on L ). Under Assumption 10.1 and by the descent lemma Lemma 5.7), it follows that both stepsize rules ensure that the sufficient decrease condition 10.3) is satisfied at each iteration. In addition, the constants L that the bactracing procedure B produces satisfy the following bounds for all 0: s L max{ηl f,s}. 10.4) The inequality s L is obvious. To understand the inequality L max{ηl f,s}, note that there are two options. Either L = s or L >s, and in the latter case there exists an index 0 for which the inequality 10.3) is not satisfied with = and L replacing L. By the descent lemma, this implies in particular that L η η <L f, and we have thus shown that L max{ηl f,s}. We also note that the bounds on L canberewrittenas βl f L αl f, where 1, constant, α = { } s max η, L f, bactracing, β = 1, constant, s L f, bactracing. 10.5) Remar 10.0 monotonicity of the proximal gradient method). Since condition 10.3) holds for both stepsize rules, for any 0, we can invoe the fundamental prox-grad inequality 10.1) with y = x = x,l = L and obtain the inequality F x ) F x +1 ) L x x +1, which in particular implies that F x ) F x +1 ), meaning that the method produces a nonincreasing sequence of function values. Copyright 017 Society for Industrial and Applied Mathematics

16 84 Chapter 10. The Proximal Gradient Method Convergence Analysis in the Convex Case We will assume in addition to Assumption 10.1 that f is convex. We begin by establishing an O1/) rate of convergence of the generated sequence of function values to the optimal value. Such rate of convergence is called a sublinear rate. This is of course an improvement over the O1/ ) rate that was established for the projected subgradient and mirror descent methods. It is also not particularly surprising that an improved rate of convergence can be established since additional properties are assumed on the objective function. Theorem 10.1 O1/) rate of convergence of proximal gradient). Suppose that Assumption 10.1 holds and that in addition f is convex. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Then for any x X and 0, F x ) F opt αl f x 0 x, 10.6) where α =1in the constant stepsize setting and α =max { } s η, L f if the bactracing rule is employed. Proof. For any n 0, substituting L = L n, x = x,andy = x n in the fundamental prox-grad inequality 10.1) and taing into account the fact that in both stepsize rules condition 10.0) is satisfied, we obtain F x ) F x n+1 )) x x n+1 x x n + l f x, x n ) L n L n x x n+1 x x n, where the convexity of f was used in the last inequality. Summing the above inequality over n =0, 1,..., 1 and using the bound L n αl f for all n 0see Remar 10.19), we obtain Thus, αl f 1 F x ) F x n+1 )) x x x x 0. n=0 1 F x n+1 ) F opt ) αl f x x 0 αl f x x αl f x x 0. n=0 By the monotonicity of {F x n )} n 0 see Remar 10.0), we can conclude that 1 F x ) F opt ) F x n+1 ) F opt ) αl f x x 0. n=0 Consequently, F x ) F opt αl f x x 0. Copyright 017 Society for Industrial and Applied Mathematics

17 10.4. Analysis of the Proximal Gradient Method The Convex Case 85 Remar 10.. Note that we did not utilize in the proof of Theorem 10.1 the fact that procedure B produces a nondecreasing sequence of constants {L } 0. This implies in particular that the monotonicity of this sequence of constants is not essential, and we can actually prove the same convergence rate for any bactracing procedure that guarantees the validity of condition 10.3) and the bound L αl f. We can also prove that the generated sequence is Fejér monotone, from which convergence of the sequence to an optimal solution readily follows. Theorem 10.3 Fejér monotonicity of the sequence generated by the proximal gradient method). Suppose that Assumption 10.1 holds and that in addition f is convex. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Then for any x X and 0, x +1 x x x. 10.7) Proof. We will repeat some of the arguments used in the proof of Theorem Substituting L = L, x = x,andy = x in the fundamental prox-grad inequality 10.1) and taing into account the fact that in both stepsize rules condition 10.0) is satisfied, we obtain F x ) F x +1 )) x x +1 x x + l f x, x ) L L x x +1 x x, where the convexity of f was used in the last inequality. The result 10.7) now follows by the inequality F x ) F x +1 ) 0. Thans to the Fejér monotonicity property, we can now establish the convergence of the sequence generated by the proximal gradient method. Theorem 10.4 convergence of the sequence generated by the proximal gradient method). Suppose that Assumption 10.1 holds and that in addition f is convex. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Then the sequence {x } 0 converges to an optimal solution of problem 10.1). Proof. By Theorem 10.3, the sequence is Fejér monotone w.r.t. X. Therefore, by Theorem 8.16, to show convergence to a point in X, it is enough to show that any limit point of the sequence {x } 0 is necessarily in X. Let then x be a limit point of the sequence. Then there exists a subsequence {x j } j 0 converging to x. By Theorem 10.1, F x j ) F opt as j. 10.8) Since F is closed, it is also lower semicontinuous, and hence F x) lim j F x j ) = F opt, implying that x X. Copyright 017 Society for Industrial and Applied Mathematics

18 86 Chapter 10. The Proximal Gradient Method To derive a complexity result for the proximal gradient method, we will assume that x 0 x R for some x X and some constant R>0; for example, if domg) is bounded, then R might be taen as its diameter. By inequality 10.6) it follows that in order to obtain an ε-optimal solution of problem 10.1), it is enough to require that αl f R ε, which is the same as αl f R. ε Thus, to obtain an ε-optimal solution, an order of 1 ε iterations is required, which is an improvement of the result for the projected subgradient method in which an 1 order of ε iterations is needed see, for example, Theorem 8.18). We summarize the above observations in the following theorem. Theorem 10.5 complexity of the proximal gradient method). Under the setting of Theorem 10.1, for any satisfying αlf R, ε it holds that F x ) F opt ε, wherer is an upper bound on x x 0 for some x X. In the nonconvex case meaning when f is not necessarily convex), an O1/ ) rate of convergence of the norm of the gradient mapping was established in Theorem 10.15c). We will now show that with the additional convexity assumption on f, this rate can be improved to O1/). Theorem 10.6 O1/) rate of convergence of the minimal norm of the gradient mapping). Suppose that Assumption 10.1 holds and that in addition f is convex. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Then for any x X and 1, min G αl f x n ) α1.5 L f x 0 x, 10.9) n=0,1,..., β where α = β =1in the constant stepsize setting and α =max { } s η, L f,β = the bactracing rule is employed. Proof. By the sufficient decrease lemma Corollary 10.18), for any n 0, F x n ) F x n+1 )=Fx n ) F T Ln x n )) 1 G Ln x n ) ) L n By Theorem 10.9 and the fact that βl f L n αl f see Remar 10.19), it follows that 1 G Ln x n ) = L n G Ln x n ) L n L βl f G αlf x n ) n α L = β f α G αlf x n ). L f 10.31) s L f if Copyright 017 Society for Industrial and Applied Mathematics

19 10.4. Analysis of the Proximal Gradient Method The Convex Case 87 Therefore, combining 10.30) and 10.31), F x n ) F opt F x n+1 ) F opt + β α L f G αlf x n ). 10.3) Let p be a positive integer. Summing 10.3) over n = p, p +1,...,p 1 yields F x p ) F opt F x p ) F opt + β α L f p 1 n=p G αlf x n ) ) By Theorem 10.1, F x p ) F opt αl f x 0 x p, which, combined with the fact that F x p ) F opt 0 and 10.33), implies βp α L f min G αl f x n ) n=0,1,...,p 1 β α L f p 1 n=p G αlf x n ) αl f x 0 x. p Thus, and also We conclude that for any 1, min n=0,1,..., G αl f x n ) min G αl f x n ) α3 L f x0 x n=0,1,...,p 1 βp 10.34) min G αl f x n ) α3 L f x0 x n=0,1,...,p βp ) α 3 L f x0 x β min{/), +1)/) } = 4α3 L f x0 x β. When we assume further that f is L f -smooth over the entire space E, wecan use Lemma 10.1 to obtain an improved result in the case of a constant stepsize. Theorem 10.7 O1/) rate of convergence of the norm of the gradient mapping under the constant stepsize rule). Suppose that Assumption 10.1 holds and that in addition f is convex and L f -smooth over E. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with a constant stepsize rule in which L L f for all 0. Then for any x X and 0, a) G Lf x +1 ) G Lf x ) ; b) G Lf x ) L f x 0 x +1. Proof. Invoing Lemma 10.1 with x = x, we obtain a). Part b) now follows by substituting α = β = 1 in the result of Theorem 10.6 and noting that by part a), G Lf x ) =min n=0,1,..., G Lf x n ). Copyright 017 Society for Industrial and Applied Mathematics

20 88 Chapter 10. The Proximal Gradient Method 10.5 The Proximal Point Method Consider the problem min gx), 10.36) x E where g : E, ] is a proper closed and convex function. Problem 10.36) is actually a special case of the composite problem 10.1) with f 0. The update step of the proximal gradient method in this case taes the form x +1 =prox 1 L g x ). Taing L = 1 c for some c>0, we obtain the proximal point method. The Proximal Point Method Initialization: pic x 0 E and c>0. General step 0): x +1 =prox cg x ). The proximal point method is actually not a practical algorithm since the general step ass to minimize the function gx) + c x x, which in general is as hard to accomplish as solving the original problem of minimizing g. Since the proximal point method is a special case of the proximal gradient method, we can deduce its main convergence results from the corresponding results on the proximal gradient method. Specifically, since the smooth part f 0is0-smooth,wecan tae any constant stepsize to guarantee convergence and Theorems 10.1 and 10.4 imply the following result. Theorem 10.8 convergence of the proximal point method). Let g : E, ] be a proper closed and convex function. Assume that problem min gx) x E has a nonempty optimal set X, and let the optimal value be given by g opt. Let {x } 0 be the sequence generated by the proximal point method with parameter c>0. Then a) gx ) g opt x0 x c for any x X and 0; b) the sequence {x } 0 converges to some point in X Convergence of the Proximal Gradient Method The Strongly Convex Case In the case where f is assumed to be σ-strongly convex for some σ>0, the sublinear rate of convergence can be improved into a linear rate of convergence, meaning a rate of the form Oq )forsomeq 0, 1). Throughout the analysis of the strongly convex case we denote the unique optimal solution of problem 10.1) by x. Copyright 017 Society for Industrial and Applied Mathematics

21 10.6. Convergence of the Proximal Gradient Method Strongly Convex Case 89 Theorem 10.9 linear rate of convergence of the proximal gradient method strongly convex case). Suppose that Assumption 10.1 holds and that in addition f is σ-strongly convex σ >0). Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Let 1, constant stepsize, α = { } s max η, L f, bactracing. Then for any 0, ) a) x +1 x 1 σ αl f x x ; ) b) x x 1 σ αl f x 0 x ; c) F x +1 ) F opt αl f 1 σ αl f ) +1 x 0 x. Proof. Plugging L = L, x = x,andy = x into the fundamental prox-grad inequality 10.1) and taing into account the fact that in both stepsize rules condition 10.0) is satisfied, we obtain F x ) F x +1 ) L x x +1 L x x + l f x, x ). Since f is σ-strongly convex, it follows by Theorem 5.4ii) that Thus, l f x, x )=fx ) fx ) fx ), x x σ x x. F x ) F x +1 ) L x x +1 L σ x x ) Since x is a minimizer of F, F x ) F x +1 ) 0, and hence, by 10.37) and the fact that L αl f see Remar 10.19), ) x +1 x 1 σl x x 1 σ ) x x, αl f establishing part a). Part b) follows immediately by a). To prove c), note that by 10.37), F x +1 ) F opt L σ x x L x+1 x αl f σ x x = αl f αl f 1 σ αl f 1 σ αl f where part b) was used in the last inequality. ) x x ) +1 x 0 x, Copyright 017 Society for Industrial and Applied Mathematics

22 90 Chapter 10. The Proximal Gradient Method Theorem 10.9 immediately implies that in the strongly convex case, the proximal gradient method requires an order of log 1 ε ) iterations to obtain an ε-optimal solution. Theorem complexity of the proximal gradient method The strongly convex case). Under the setting of Theorem 10.9, for any 1 satisfying ακ log ) 1 + ακ log ε αlf R it holds that F x ) F opt ε, wherer is an upper bound on x 0 x and κ = L f σ. Proof. Let 1. By Theorem 10.9 and the definition of κ, a sufficient condition for the inequality F x ) F opt ε to hold is that αl f 1 1 ) R ε, ακ ), which is the same as log 1 1 ) ) ε log ακ αl f R ) Since log1 x) xfor any 57 x 1, it follows that a sufficient condition for 10.38) to hold is that 1 ) ε ακ log αl f R, namely, that ακ log ) 1 + ακ log ε αlf R ) The Fast Proximal Gradient Method FISTA The Method The proximal gradient method achieves an O1/) rate of convergence in function values to the optimal value. In this section we will show how to accelerate the method in order to obtain a rate of O1/ ) in function values. The method is nown as the fast proximal gradient method, but we will also refer to it as FISTA, which is an acronym for fast iterative shrinage-thresholding algorithm ; see Example for further explanations. The method was devised and analyzed by Bec and Teboulle in the paper [18], from which the convergence analysis is taen. We will assume that f is convex and that it is L f -smooth, meaning that it is L f -smooth over the entire space E. We gather all the required properties in the following assumption. 57 The inequality also holds for x = 1 since in that case the left-hand side is. Copyright 017 Society for Industrial and Applied Mathematics

23 10.7. The Fast Proximal Gradient Method FISTA 91 Assumption A) g : E, ] is proper closed and convex. B) f : E R is L f -smooth and convex. C) The optimal set of problem 10.1) is nonempty and denoted by X. The optimal value of the problem is denoted by F opt. The description of FISTA now follows. FISTA Input: f,g,x 0 ), where f and g satisfy properties A) and B) in Assumption and x 0 E. Initialization: set y 0 = x 0 and t 0 =1. General step: for any =0, 1,,... execute the following steps: a) pic L > 0; b) set x +1 =prox 1 L g c) set t +1 = t ; d) compute y +1 = x +1 + ) y 1 L fy ) ; t 1 t +1 ) x +1 x ). As usual, we will consider two options for the choice of L : constant and bactracing. The bactracing procedure for choosing the stepsize is referred to as bactracing procedure B3 and is identical to procedure B with the sole difference that it is invoed on the vector y rather than on x. Constant. L = L f for all. Bactracing procedure B3. The procedure requires two parameters s, η), where s>0andη>1. Define L 1 = s. At iteration 0) the choice of L is done as follows: First, L is set to be equal to L 1. Then, while recall that T L y) T f,g L y) =prox 1 L g y 1 L fy))), ft L y )) >fy )+ fy ),T L y ) y + L T L y ) y, we set L := ηl. In other words, the stepsize is chosen as L =L 1 η i, where i is the smallest nonnegative integer for which the condition ft L 1 η i y )) fy )+ fy ),T L 1 η i y ) y is satisfied. + L T L 1 η i y ) y Copyright 017 Society for Industrial and Applied Mathematics

24 9 Chapter 10. The Proximal Gradient Method In both stepsize rules, the following inequality is satisfied for any 0: ft L y )) fy )+ fy ),T L y ) y + L T L y ) y ) Remar Since the bactracing procedure B3 is identical to the B procedure only employed on y ), the arguments of Remar are still valid, and we have that where α and β are given in 10.5). βl f L αl f, The next lemma shows an important lower bound on the sequence {t } 0 that will be used in the convergence proof. Lemma Let {t } 0 be the sequence defined by Then t + for all 0. t 0 =1,t +1 = t, 0. Proof. The proof is by induction on. Obviously, for =0,t 0 =1 0+. Suppose that the claim holds for, meaning t +. We will prove that t By the recursive relation defining the sequence and the induction assumption, t +1 = t ) 1+ +) = Convergence Analysis of FISTA Theorem O1/ ) rate of convergence of FISTA). Suppose that Assumption holds. Let {x } 0 be the sequence generated by FISTA for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B3. Then for any x X and 1, F x ) F opt αl f x 0 x +1), where α =1in the constant stepsize setting and α =max { η, s L f } if the bactracing rule is employed. Proof. Let 1. Substituting x = t 1 x +1 t 1 )x, y = y,andl = L in the fundamental prox-grad inequality 10.1), taing into account that inequality Copyright 017 Society for Industrial and Applied Mathematics

25 10.7. The Fast Proximal Gradient Method FISTA ) is satisfied and that f is convex, we obtain that F t 1 x +1 t 1 )x ) F x +1 ) L x+1 t 1 x +1 t 1 )x ) L y t 1 x +1 t 1 )x ) = L t t x +1 x +t 1)x ) L t t y x +t 1)x ) ) By the convexity of F, F t 1 x +1 t 1 )x ) t 1 F x )+1 t 1 )F x ). Therefore, using the notation v n F x n ) F opt for any n 0, F t 1 x +1 t 1 )x ) F x +1 ) 1 t 1 )F x ) F x )) F x +1 ) F x )) =1 t 1 )v v ) ) On the other hand, using the relation y = x t t x x 1 ), t y x +t 1)x ) = t x +t 1 1)x x 1 ) x +t 1)x ) Combining 10.40), 10.41), and 10.4), we obtain that = t 1 x x +t 1 1)x 1 ). 10.4) t t )v t v +1 L u+1 L u, whereweusethenotationu n = t n 1 x n x +t n 1 1)x n 1 ) for any n 0. By theupdateruleoft +1,wehavet t = t 1, and hence Since L L 1, we can conclude that Thus, L t 1 v L t v +1 u +1 u. L 1 t 1v L t v +1 u +1 u. u +1 + L t v +1 u + L 1 t 1 v, and hence, for any 1, u + L 1 t 1 v u 1 + L 0 t 0 v 1 = x 1 x + L 0 F x 1 ) F opt ) 10.43) Substituting x = x, y = y 0,andL = L 0 in the fundamental prox-grad inequality 10.1), taing into account the convexity of f yields L 0 F x ) F x 1 )) x 1 x y 0 x, which, along with the fact that y 0 = x 0, implies the bound x 1 x + L 0 F x 1 ) F opt ) x 0 x. Copyright 017 Society for Industrial and Applied Mathematics

26 94 Chapter 10. The Proximal Gradient Method Combining the last inequality with 10.43), we get L 1 t 1v u + L 1 t 1v x 0 x. Thus, using the bound L 1 αl f, the definition of v, and Lemma 10.33, F x ) F opt L 1 x 0 x t 1 αl f x 0 x +1). Remar alternative choice for t ). A close inspection of the proof of Theorem reveals that the result is correct if {t } 0 is any sequence satisfying the following two properties for any 0: a) t + ; b) t +1 t +1 t.the choice t = + also satisfies these two properties. The validity of a) is obvious; to show b), note that t +1 t +1 = t +1 t +1 1) = ) = = t. = Remar Note that FISTA has an O1/ ) rate of convergence in function values, while the proximal gradient method has an O1/) rate of convergence. This improvement was achieved despite the fact that the dominant computational steps at each iteration of both methods are essentially the same: one gradient evaluation and one prox computation Examples Example Example 10.: Consider the following model, which was already discussed in min x R n fx)+λ x 1, where λ>0andf : R n R is assumed to be convex and L f -smooth. The update formula of the proximal gradient method with constant stepsize 1 L f has the form x +1 = T λ x 1 ) fx ). L f L f As was already noted in Example 10.3, since at each iteration one shrinage/softthresholding operation is performed, this method is also nown as the iterative shrinage-thresholding algorithm ISTA). The general update step of the accelerated proximal gradient method discussed in this section taes the following form: ) a) set x +1 = T λ y 1 L f L f fy ) ; Copyright 017 Society for Industrial and Applied Mathematics

27 10.7. The Fast Proximal Gradient Method FISTA 95 b) set t +1 = t ; c) compute y +1 = x +1 + t 1 t +1 ) x +1 x ). The above scheme truly deserves to be called fast iterative shrinage/thresholding algorithm FISTA) since it is an accelerated method that performs at each iteration a thresholding step. In this boo we adopt the convention and use the acronym FISTA as the name of the fast proximal gradient method for a general nonsmooth part g. Example l 1 -regularized least squares). As a special instance of Example 10.37, consider the problem 1 min x R n Ax b + λ x 1, 10.44) where A R m n, b R m, and λ > 0. The problem fits model 10.1) with fx) = 1 Ax b and gx) = λ x 1. The function f is L f -smooth with L f = A T A, = λ max A T A) see Example 5.). The update step of FISTA has the following form: a) set x +1 = T λ L ) y 1 L A T Ay b) ; b) set t +1 = t ; c) compute y +1 = x +1 + t 1 t +1 ) x +1 x ). The update step of the proximal gradient method, which in this case is the same as ISTA, is x +1 = T λ x 1 ) A T Ax b). L L The stepsizes in both methods can be chosen to be the constant L λ max A T A). To illustrate the difference in the actual performance of ISTA and FISTA, we generated an instance of the problem with λ =1andA R The components of A were independently generated using a standard normal distribution. The true vector is x true = e 3 e 7,andb was chosen as b = Ax true. We ran 00 iterations of ISTA and FISTA in order to solve problem 10.44) with initial vector x = e, the vector of all ones. It is well nown that the l 1 -norm element in the objective function is a regularizer that promotes sparsity, and we thus expect that the optimal solution of 10.44) will be close to the true sparse vector x true. The distances to optimality in terms of function values of the sequences generated by the two methods as a function of the iteration index are plotted in Figure 10.1, where it is apparent that FISTA is far superior to ISTA. In Figure 10. we plot the vectors that were obtained by the two methods. Obviously, the solution produced by 00 iterations of FISTA is much closer to the optimal solution which is very close to e 3 e 7 ) than the solution obtained after 00 iterations of ISTA. Copyright 017 Society for Industrial and Applied Mathematics

28 96 Chapter 10. The Proximal Gradient Method ISTA FISTA 10 0 Fx ) F opt Figure Results of 00 iterations of ISTA and FISTA on an l 1 - regularized least squares problem. 1 ISTA 1 FISTA Figure 10.. Solutions obtained by ISTA left) and FISTA right) MFISTA 58 FISTA is not a monotone method, meaning that the sequence of function values it produces is not necessarily nonincreasing. It is possible to define a monotone version of FISTA, which we call MFISTA, which is a descent method and at the same time preserves the same rate of convergence as FISTA. 58 MFISTA and its convergence analysis are from the wor of Bec and Teboulle [17]. Copyright 017 Society for Industrial and Applied Mathematics

29 10.7. The Fast Proximal Gradient Method FISTA 97 MFISTA Input: f,g,x 0 ), where f and g satisfy properties A) and B) in Assumption and x 0 E. Initialization: set y 0 = x 0 and t 0 =1. General step: for any =0, 1,,... execute the following steps: a) pic L > 0; b) set z =prox 1 L g ) y 1 L fy ) ; c) choose x +1 E such that F x +1 ) min{f z ),Fx )}; d) set t +1 = t ; e) compute y +1 = x +1 + ) t t +1 z x +1 )+ t 1 t +1 x +1 x ). Remar The choice x +1 argmin{f x) :x = x, z } is a very simple rule ensuring the condition F x +1 ) min{f z ),Fx )}. We also note that the convergence established in Theorem only requires the condition F x +1 ) F z ). The convergence result of MFISTA, whose proof is a minor adjustment of the proof of Theorem 10.34, is given below. Theorem O1/ ) rate of convergence of MFISTA). Suppose that Assumption holds. Let {x } 0 be the sequence generated by MFISTA for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B3. Then for any x X and 1, F x ) F opt αl f x 0 x +1), where α =1in the constant stepsize setting and α =max { } s η, L f if the bactracing rule is employed. Proof. Let 1. Substituting x = t 1 x +1 t 1 )x, y = y,andl = L in the fundamental prox-grad inequality 10.1), taing into account that inequality 10.39) is satisfied and that f is convex, we obtain that F t 1 x +1 t 1 )x ) F z ) L z t 1 x +1 t 1 )x ) L y t 1 x +1 t 1 )x ) = L t t z x +t 1)x ) L t t y x +t 1)x ) ) By the convexity of F, F t 1 x +1 t 1 )x ) t 1 F x )+1 t 1 )F x ). Copyright 017 Society for Industrial and Applied Mathematics

30 98 Chapter 10. The Proximal Gradient Method Therefore, using the notation v n F x n ) F opt for any n 0 and the fact that F x +1 ) F z ), it follows that F t 1 x +1 t 1 )x ) F z ) 1 t 1 )F x ) F x )) F x +1 ) F x )) =1 t 1 )v v ) ) On the other hand, using the relation y = x + t 1 t z 1 x t 1 1 )+ t x x 1 ), we have t y x +t 1)x )=t 1 z 1 x +t 1 1)x 1 ) ) Combining 10.45), 10.46), and 10.47), we obtain that t t )v t v +1 L u+1 L u, whereweusethenotationu n = t n 1 z n 1 x +t n 1 1)x n 1 ) for any n 0. Bytheupdateruleoft +1,wehavet t = t 1, and hence Since L L 1, we can conclude that Thus, and hence, for any 1, L t 1v L t v +1 u +1 u. L 1 t 1v L t v +1 u +1 u. u +1 + L t v +1 u + L 1 t 1 v, u + L 1 t 1v u 1 + L 0 t 0v 1 = z 0 x + L 0 F x 1 ) F opt ) ) Substituting x = x, y = y 0,andL = L 0 in the fundamental prox-grad inequality 10.1), taing into account the convexity of f, yields L 0 F x ) F z 0 )) z 0 x y 0 x, which, along with the facts that y 0 = x 0 and F x 1 ) F z 0 ), implies the bound z 0 x + L 0 F x 1 ) F opt ) x 0 x. Combining the last inequality with 10.48), we get L 1 t 1v u + L 1 t 1v x 0 x. Thus, using the bound L 1 αl f, the definition of v, and Lemma 10.33, F x ) F opt L 1 x 0 x t 1 αl f x 0 x +1). Copyright 017 Society for Industrial and Applied Mathematics

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems Amir Beck and Shoham Sabach July 6, 2011 Abstract We consider a general class of convex optimization problems

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Second order forward-backward dynamical systems for monotone inclusion problems

Second order forward-backward dynamical systems for monotone inclusion problems Second order forward-backward dynamical systems for monotone inclusion problems Radu Ioan Boţ Ernö Robert Csetnek March 6, 25 Abstract. We begin by considering second order dynamical systems of the from

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

A proximal-like algorithm for a class of nonconvex programming

A proximal-like algorithm for a class of nonconvex programming Pacific Journal of Optimization, vol. 4, pp. 319-333, 2008 A proximal-like algorithm for a class of nonconvex programming Jein-Shan Chen 1 Department of Mathematics National Taiwan Normal University Taipei,

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

arxiv: v2 [math.oc] 20 Jan 2018

arxiv: v2 [math.oc] 20 Jan 2018 Composite convex minimization involving self-concordant-lie cost functions Quoc Tran-Dinh, Yen-Huan Li and Volan Cevher Laboratory for Information and Inference Systems LIONS) EPFL, Lausanne, Switzerland

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University

More information

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING KATYA SCHEINBERG, DONALD GOLDFARB, AND XI BAI Abstract. We propose new versions of accelerated first order methods for convex

More information

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect

GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, Dedicated to Franco Giannessi and Diethard Pallaschke with great respect GEOMETRIC APPROACH TO CONVEX SUBDIFFERENTIAL CALCULUS October 10, 2018 BORIS S. MORDUKHOVICH 1 and NGUYEN MAU NAM 2 Dedicated to Franco Giannessi and Diethard Pallaschke with great respect Abstract. In

More information

Convergence of Fixed-Point Iterations

Convergence of Fixed-Point Iterations Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and

More information

Search Directions for Unconstrained Optimization

Search Directions for Unconstrained Optimization 8 CHAPTER 8 Search Directions for Unconstrained Optimization In this chapter we study the choice of search directions used in our basic updating scheme x +1 = x + t d. for solving P min f(x). x R n All

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Acceleration Method for Convex Optimization over the Fixed Point Set of a Nonexpansive Mapping

Acceleration Method for Convex Optimization over the Fixed Point Set of a Nonexpansive Mapping Noname manuscript No. will be inserted by the editor) Acceleration Method for Convex Optimization over the Fixed Point Set of a Nonexpansive Mapping Hideaki Iiduka Received: date / Accepted: date Abstract

More information

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1 EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex

More information

c 2013 Society for Industrial and Applied Mathematics

c 2013 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 3, No., pp. 109 115 c 013 Society for Industrial and Applied Mathematics AN ACCELERATED HYBRID PROXIMAL EXTRAGRADIENT METHOD FOR CONVEX OPTIMIZATION AND ITS IMPLICATIONS TO SECOND-ORDER

More information

An Infeasible Interior Proximal Method for Convex Programming Problems with Linear Constraints 1

An Infeasible Interior Proximal Method for Convex Programming Problems with Linear Constraints 1 An Infeasible Interior Proximal Method for Convex Programming Problems with Linear Constraints 1 Nobuo Yamashita 2, Christian Kanzow 3, Tomoyui Morimoto 2, and Masao Fuushima 2 2 Department of Applied

More information

PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS

PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS Fixed Point Theory, 15(2014), No. 2, 399-426 http://www.math.ubbcluj.ro/ nodeacj/sfptcj.html PROPERTIES OF A CLASS OF APPROXIMATELY SHRINKING OPERATORS AND THEIR APPLICATIONS ANDRZEJ CEGIELSKI AND RAFA

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

Lecture 5: September 15

Lecture 5: September 15 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS

More information

Hedy Attouch, Jérôme Bolte, Benar Svaiter. To cite this version: HAL Id: hal

Hedy Attouch, Jérôme Bolte, Benar Svaiter. To cite this version: HAL Id: hal Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods Hedy Attouch, Jérôme Bolte, Benar Svaiter To cite

More information

Optimal Newton-type methods for nonconvex smooth optimization problems

Optimal Newton-type methods for nonconvex smooth optimization problems Optimal Newton-type methods for nonconvex smooth optimization problems Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint June 9, 20 Abstract We consider a general class of second-order iterations

More information

Primal and Dual Predicted Decrease Approximation Methods

Primal and Dual Predicted Decrease Approximation Methods Primal and Dual Predicted Decrease Approximation Methods Amir Beck Edouard Pauwels Shoham Sabach March 22, 2017 Abstract We introduce the notion of predicted decrease approximation (PDA) for constrained

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Proximal gradient methods

Proximal gradient methods ELE 538B: Large-Scale Optimization for Data Science Proximal gradient methods Yuxin Chen Princeton University, Spring 08 Outline Proximal gradient descent for composite functions Proximal mapping / operator

More information

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES Fenghui Wang Department of Mathematics, Luoyang Normal University, Luoyang 470, P.R. China E-mail: wfenghui@63.com ABSTRACT.

More information

Functions. Chapter Continuous Functions

Functions. Chapter Continuous Functions Chapter 3 Functions 3.1 Continuous Functions A function f is determined by the domain of f: dom(f) R, the set on which f is defined, and the rule specifying the value f(x) of f at each x dom(f). If f is

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

On the acceleration of the double smoothing technique for unconstrained convex optimization problems On the acceleration of the double smoothing technique for unconstrained convex optimization problems Radu Ioan Boţ Christopher Hendrich October 10, 01 Abstract. In this article we investigate the possibilities

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Lecture 6 : Projected Gradient Descent

Lecture 6 : Projected Gradient Descent Lecture 6 : Projected Gradient Descent EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Consider the following update. x l+1 = Π C (x l α f(x l )) Theorem Say f : R d R is (m, M)-strongly

More information

Stochastic model-based minimization under high-order growth

Stochastic model-based minimization under high-order growth Stochastic model-based minimization under high-order growth Damek Davis Dmitriy Drusvyatskiy Kellie J. MacPhee Abstract Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

On proximal-like methods for equilibrium programming

On proximal-like methods for equilibrium programming On proximal-lie methods for equilibrium programming Nils Langenberg Department of Mathematics, University of Trier 54286 Trier, Germany, langenberg@uni-trier.de Abstract In [?] Flam and Antipin discussed

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

GENERALIZED CANTOR SETS AND SETS OF SUMS OF CONVERGENT ALTERNATING SERIES

GENERALIZED CANTOR SETS AND SETS OF SUMS OF CONVERGENT ALTERNATING SERIES Journal of Applied Analysis Vol. 7, No. 1 (2001), pp. 131 150 GENERALIZED CANTOR SETS AND SETS OF SUMS OF CONVERGENT ALTERNATING SERIES M. DINDOŠ Received September 7, 2000 and, in revised form, February

More information

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,...

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Proximal-like contraction methods for monotone variational inequalities in a unified framework

Proximal-like contraction methods for monotone variational inequalities in a unified framework Proximal-like contraction methods for monotone variational inequalities in a unified framework Bingsheng He 1 Li-Zhi Liao 2 Xiang Wang Department of Mathematics, Nanjing University, Nanjing, 210093, China

More information

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,...

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Convex Functions and Optimization

Convex Functions and Optimization Chapter 5 Convex Functions and Optimization 5.1 Convex Functions Our next topic is that of convex functions. Again, we will concentrate on the context of a map f : R n R although the situation can be generalized

More information

Exercise Solutions to Functional Analysis

Exercise Solutions to Functional Analysis Exercise Solutions to Functional Analysis Note: References refer to M. Schechter, Principles of Functional Analysis Exersize that. Let φ,..., φ n be an orthonormal set in a Hilbert space H. Show n f n

More information

Convex Analysis and Economic Theory AY Elementary properties of convex functions

Convex Analysis and Economic Theory AY Elementary properties of convex functions Division of the Humanities and Social Sciences Ec 181 KC Border Convex Analysis and Economic Theory AY 2018 2019 Topic 6: Convex functions I 6.1 Elementary properties of convex functions We may occasionally

More information

L p Spaces and Convexity

L p Spaces and Convexity L p Spaces and Convexity These notes largely follow the treatments in Royden, Real Analysis, and Rudin, Real & Complex Analysis. 1. Convex functions Let I R be an interval. For I open, we say a function

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Primal and Dual Variables Decomposition Methods in Convex Optimization

Primal and Dual Variables Decomposition Methods in Convex Optimization Primal and Dual Variables Decomposition Methods in Convex Optimization Amir Beck Technion - Israel Institute of Technology Haifa, Israel Based on joint works with Edouard Pauwels, Shoham Sabach, Luba Tetruashvili,

More information

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

Local strong convexity and local Lipschitz continuity of the gradient of convex functions Local strong convexity and local Lipschitz continuity of the gradient of convex functions R. Goebel and R.T. Rockafellar May 23, 2007 Abstract. Given a pair of convex conjugate functions f and f, we investigate

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems Optimization Methods and Software ISSN: 1055-6788 (Print) 1029-4937 (Online) Journal homepage: http://www.tandfonline.com/loi/goms20 An accelerated non-euclidean hybrid proximal extragradient-type algorithm

More information