The Proximal Gradient Method

Size: px

Start display at page:

Download "The Proximal Gradient Method"

Ashley Green
5 years ago
Views:

1 Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product, andtheeuclideannorm =, The Composite Model In this chapter we will be mostly concerned with the composite model whereweassumethefollowing. min{f x) fx)+gx)}, 10.1) x E Assumption A) g : E, ] is proper closed and convex. B) f : E, ] is proper and closed, domf) is convex, domg) intdomf)), and f is L f -smooth over intdomf)). C) The optimal set of problem 10.1) is nonempty and denoted by X. The optimal value of the problem is denoted by F opt. Three special cases of the general model 10.1) are gathered in the following example. Example 10.. stam Smooth unconstrained minimization. If g 0 and domf) =E, then 10.1) reduces to the unconstrained smooth minimization problem min fx), x E where f : E R is an L f -smooth function. 69 Copyright 017 Society for Industrial and Applied Mathematics

2 70 Chapter 10. The Proximal Gradient Method Convex constrained smooth minimization. If g = δ C,whereC is a nonempty closed and convex set, then 10.1) amounts to the problem of minimizing a differentiable function over a nonempty closed and convex set: min x C fx), where here f is L f -smooth over intdomf)) and C intdomf)). l 1 -regularized minimization. Taing gx) =λ x 1 for some λ>0, 10.1) amounts to the l 1 -regularized problem min {fx)+λ x 1} x E with f being an L f -smooth function over the entire space E. 10. The Proximal Gradient Method To understand the idea behind the method for solving 10.1) we are about to study, we begin by revisiting the projected gradient method for solving 10.1) in the case where g = δ C with C being a nonempty closed and convex set. In this case, the problem taes the form min{fx) :x C}. 10.) The general update step of the projected gradient method for solving 10.) taes the form x +1 = P C x t fx )), where t is the stepsize at iteration. It is easy to verify that the update step can be also written as see also Section 9.1 for a similar discussion on the projected subgradient method) { x +1 =argmin x C fx )+ fx ), x x + 1 } x x. t That is, the next iterate is the minimizer over C of the sum of the linearization of the smooth part around the current iterate plus a quadratic prox term. Bac to the more general model 10.1), it is natural to generalize the above idea and to define the next iterate as the minimizer of the sum of the linearization of f around x, the nonsmooth function g, and a quadratic prox term: { x +1 =argmin x E fx )+ fx ), x x + gx)+ 1 } x x. 10.3) t After some simple algebraic manipulation and cancellation of constant terms, we obtain that 10.3) can be rewritten as { x +1 =argmin x E t gx)+ 1 x x t fx )) }, which by the definition of the proximal operator is the same as x +1 =prox t gx t fx )). Copyright 017 Society for Industrial and Applied Mathematics

3 10.. The Proximal Gradient Method 71 The above method is called the proximal gradient method, as it consists of a gradient step followed by a proximal mapping. From now on, we will tae the stepsizes as t = 1 L, leading to the following description of the method. The Proximal Gradient Method Initialization: pic x 0 intdomf)). General step: for any =0, 1,,... execute the following steps: a) pic L > 0; b) set x +1 =prox 1 L g ) x 1 L fx ). The general update step of the proximal gradient method can be compactly written as x +1 = T f,g L x ), where T f,g L by : intdomf)) E L >0) is the so-called prox-grad operator defined T f,g L x) prox 1 L g x 1 L fx) ). When the identities of f and g are clear from the context, we will often omit the superscripts f,g and write T L ) instead of T f,g L ). Later on, we will consider two stepsize strategies, constant and bactracing, where the meaning of bactracing slightly changes under the different settings that will be considered, and hence several bactracing procedures will be defined. Example The table below presents the explicit update step of the proximal gradient method when applied to the three particular models discussed in Example The exact assumptions on the models are described in Example 10.. Model Update step Name of method min x E fx) x +1 = x t fx ) gradient min x C fx) x +1 = P C x t fx )) projected gradient min x E {fx)+λ x 1 } x +1 = T λt x t fx )) ISTA The third method is nown as the iterative shrinage-thresholding algorithm ISTA) in the literature, since at each iteration a soft-thresholding operation also nown as shrinage ) is performed. 54 Here we use the facts that prox t g 0 = I, prox t δ C = P C and prox t λ 1 = T λt, where g 0 x) 0. Copyright 017 Society for Industrial and Applied Mathematics

4 7 Chapter 10. The Proximal Gradient Method 10.3 Analysis of the Proximal Gradient Method The Nonconvex Case Sufficient Decrease To establish the convergence of the proximal gradient method, we will prove a sufficient decrease lemma for composite functions. Lemma 10.4 sufficient decrease lemma). Suppose that f and g satisfy properties A) and B) of Assumption Let F = f + g and T L T f,g L. Then for any x intdomf)) and L L f, ) the following inequality holds: F x) F T L x)) L L f L G f,g L x), 10.4) where G f,g L : intdomf)) E is the operator defined by Gf,g L x) =Lx T Lx)) for all x intdomf)). Proof. For the sae of simplicity, we use the shorthand notation x + = T L x). By the descent lemma Lemma 5.7), we have that fx + ) fx)+ fx), x + x + L f x x ) By the second prox theorem Theorem 6.39), since x + =prox 1 L x g 1 L fx)),we have x 1L fx) x+, x x + 1 L gx) 1 L gx+ ), from which it follows that fx), x + x L x + x + gx) gx + ), which, combined with 10.5), yields fx + )+gx + ) fx)+gx)+ L + L ) f x + x. Hence, taing into account the definitions of x +,G f,g L x) and the identities F x) = fx)+gx),fx + )=fx + )+gx + ), the desired result follows The Gradient Mapping The operator G f,g L that appears in the right-hand side of 10.4) is an important mapping that can be seen as a generalization of the notion of the gradient. Definition 10.5 gradient mapping). Suppose that f and g satisfy properties A) and B) of Assumption Then the gradient mapping is the operator 55 The analysis of the proximal gradient method in Sections 10.3 and 10.4 mostly follows the presentation of Bec and Teboulle in [18] and [19]. Copyright 017 Society for Industrial and Applied Mathematics

5 10.3. Analysis of the Proximal Gradient Method The Nonconvex Case 73 G f,g L : intdomf)) E defined by for any x intdomf)). G f,g L x) L x T f,g L x) ) When the identities of f and g will be clear from the context, we will use the notation G L instead of G f,g L. With the terminology of the gradient mapping, the update step of the proximal gradient method can be rewritten as x +1 = x 1 L G L x ). In the special case where L = L f, the sufficient decrease inequality 10.4) taes a simpler form. Corollary Under the setting of Lemma 10.4, the following inequality holds for any x intdomf)): F x) F T Lf x)) 1 GLf x). L f The next result shows that the gradient mapping is a generalization of the usual gradient operator x fx) in the sense that they coincide when g 0 and that, for a general g, the points in which the gradient mapping vanishes are the stationary points of the problem of minimizing f + g. Recall see Definition 3.73) that a point x domg) is a stationary point of problem 10.1) if and only if fx ) gx ) and that this condition is a necessary optimality condition for local optimal points see Theorem 3.7). Theorem Let f and g satisfy properties A) and B) of Assumption 10.1 and let L>0. Then a) G f,g0 L x) = fx) for any x intdomf)), whereg 0x) 0; b) for x intdomf)), it holds that G f,g L x )=0 if and only if x is a stationary point of problem 10.1). Proof. a) Since prox 1 L g0y) =y for all y E, it follows that G f,g0 L f,g0 x) =Lx TL x x)) = L prox 1 x 1 )) L g0 L fx) = L x x 1L )) fx) = fx). b) G f,g L x )=0 if and only if x =prox 1 L g x 1 L fx ) ). By the second prox theorem Theorem 6.39), the latter relation holds if and only if x 1 L fx ) x 1 L gx ), Copyright 017 Society for Industrial and Applied Mathematics

6 74 Chapter 10. The Proximal Gradient Method that is, if and only if fx ) gx ), which is exactly the condition for stationarity. If in addition f is convex, then stationarity is a necessary and sufficient optimality condition Theorem 3.7b)), which leads to the following corollary. Corollary 10.8 necessary and sufficient optimality condition under convexity). Let f and g satisfy properties A) and B) of Assumption 10.1, andlet L>0. Suppose that in addition f is convex. Then for x domg), G f,g L x )=0 if and only if x is an optimal solution of problem 10.1). We can thin of the quantity G L x) as an optimality measure in the sense that it is always nonnegative, and equal to zero if and only if x is a stationary point. The next result establishes important monotonicity properties of G L x) w.r.t. the parameter L. Theorem 10.9 monotonicity of the gradient mapping). Suppose that f and g satisfy properties A) and B) of Assumption 10.1 and let G L G f,g L. Suppose that L 1 L > 0. Then G L1 x) G L x) 10.6) and G L1 x) G L x) 10.7) L 1 L for any x intdomf)). Proof. Recall that by the second prox theorem Theorem 6.39), for any v, w E and L>0, the following inequality holds: v prox 1 L g v), prox 1 L g v) w 1 ) L g prox 1 L g v) 1 L gw). x 1 L fx) ) = T L x) Plugging L = L 1, v = x 1 L 1 fx), and w =prox 1 L g into the last inequality, it follows that x 1 fx) T L1 x),t L1 x) T L x) 1 gt L1 x)) 1 gt L x)) L 1 L 1 L 1 or 1 G L1 x) 1 1 fx), G L x) 1 G L1 x) 1 gt L1 x)) 1 gt L x)). L 1 L 1 L L 1 L 1 L 1 Exchanging the roles of L 1 and L yields the following inequality: 1 G L x) 1 1 fx), G L1 x) 1 G L x) 1 gt L x)) 1 gt L1 x)). L L L 1 L L L Multiplying the first inequality by L 1 and the second by L and adding them, we obtain 1 G L1 x) G L x), G L x) 1 G L1 x) 0, L L 1 Copyright 017 Society for Industrial and Applied Mathematics

7 10.3. Analysis of the Proximal Gradient Method The Nonconvex Case 75 which after some expansion of terms can be seen to be the same as 1 G L1 x) G L x) + 1 ) G L1 x),g L x). L 1 L L 1 L Using the Cauchy Schwarz inequality, we obtain that 1 G L1 x) G L x) + 1 ) G L1 x) G L x). 10.8) L 1 L L 1 L Note that if G L x) =0, then by the last inequality, G L1 x) =0, implying that in this case the inequalities 10.6) and 10.7) hold trivially. Assume then that G L x) 0 and define t = GL 1 G L x). Then, by 10.8), 1 1 t + 1 ) t L 1 L 1 L L Since the roots of the quadratic function on the left-hand side of the above inequality are t =1, L1 L, we obtain that 1 t L 1, L showing that G L x) G L1 x) L 1 G L x). L A straightforward result of the nonexpansivity of the prox operator and the L f -smoothness of f over intdomf)) is that G L ) is Lipschitz continuous with constant L + L f. Indeed, for any x, y intdomf)), G L x) G L y) = L x prox 1 L g x 1 ) L fx) y +prox1 y L g 1 ) L fy) L x y + L prox 1 L g x 1 ) L fx) prox 1 y L g 1 ) L fy) L x y + L x 1L ) fx) y 1 ) L fy) L x y + fx) fy) L + L f ) x y. In particular, for L = L f, we obtain the inequality G Lf x) G Lf y) 3L f x y. The above discussion is summarized in the following lemma. Lemma Lipschitz continuity of the gradient mapping). Let f and g satisfy properties A) and B) of Assumption LetG L = G f,g L.Then a) G L x) G L y) L + L f ) x y for any x, y intdomf)); b) G Lf x) G Lf y) 3L f x y for any x, y intdomf)). Copyright 017 Society for Industrial and Applied Mathematics

8 76 Chapter 10. The Proximal Gradient Method Lemma below shows that when f is assumed to be convex and L f - 3 smooth over the entire space, then the operator 4L f G Lf is firmly nonexpansive. A direct consequence is that G Lf is Lipschitz continuous with constant 4L f 3. 3 Lemma firm nonexpansivity of 4L f G Lf ). Let f be a convex and L f - smooth function L f > 0), and let g : E, ] be a proper closed and convex function. Then a) the gradient mapping G Lf G f,g L f satisfies the relation GLf x) G Lf y), x y 3 G Lf x) G Lf y) 10.9) 4L f for any x, y E; b) G Lf x) G Lf y) 4L f 3 x y for any x, y E. Proof. Part b) is a direct consequence of a) and the Cauchy Schwarz inequality. We will therefore prove a). To simplify the presentation, we will use the notation L = L f. By the firm nonexpansivity of the prox operator Theorem 6.4a)), it follows that for any x, y E, T L x) T L y), where T L T f,g L x 1 ) L G L x) x 1 L G L x) x 1L f x) ) y 1L ) f y) T L x) T L y), is the prox-grad mapping. Since T L = I 1 L G L, we obtain that y 1 ) L G L y), x 1L ) f x) y 1L ) f y) ) y 1 ) L G L y), which is the same as x 1 ) L G Lx) y 1 ) L G Ly), G L x) fx)) G L y) fy)) 0. Therefore, G L x) G L y), x y 1 L G Lx) G L y) + fx) fy), x y 1 L G Lx) G L y), fx) fy). Since f is L-smooth, it follows from Theorem 5.8 equivalence between i) and iv)) that fx) fy), x y 1 L fx) fy). Consequently, L G L x) G L y), x y G L x) G L y) + fx) fy) G L x) G L y), fx) fy). Copyright 017 Society for Industrial and Applied Mathematics

9 10.3. Analysis of the Proximal Gradient Method The Nonconvex Case 77 From the Cauchy Schwarz inequality we get L G L x) G L y), x y G L x) G L y) + fx) fy) G L x) G L y) fx) fy) ) By denoting α = G L x) G L y) and β = f x) f y), the right-hand side of 10.10) reads as α + β αβ and satisfies α + β αβ = 3 α ) 4 α + β 3 4 α, which, combined with 10.10), yields the inequality Thus, 10.9) holds. L G L x) G L y), x y 3 4 G Lx) G L y). The next result shows a different ind of a monotonicity property of the gradient mapping norm under the setting of Lemma the norm of the gradient mapping does not increase if a prox-grad step is employed on its argument. Lemma 10.1 monotonicity of the norm of the gradient mapping w.r.t. the prox-grad operator). 56 Let f be a convex and L f -smooth function L f > 0), and let g : E, ] be a proper closed and convex function. Then for any x E, G Lf T Lf x)) G Lf x), where G Lf G f,g L f and T Lf T f,g L f. Proof. Let x E. We will use the shorthand notation x + = T Lf x). By Theorem 5.8 equivalence between i) and iv)), it follows that fx + ) fx) L f fx + ) fx), x + x ) Denoting a = fx + ) fx) andb = x + x, inequality 10.11) can be rewritten as a L f a, b, which is the same as a L f b L f 4 b and as 1 a 1 L f b 1 b. Using the triangle inequality, 1 a b L f 1 a b + 1 L f b + 1 b b. 56 Lemma 10.1 is a minor variation of Lemma.4 from Necoara and Patrascu [88]. Copyright 017 Society for Industrial and Applied Mathematics

10 78 Chapter 10. The Proximal Gradient Method Plugging the expressions for a and b into the above inequality, we obtain that x 1 fx) x fx + ) L f L f x+ x. Combining the above inequality with the nonexpansivity of the prox operator Theorem 6.4b)), we finally obtain G Lf T Lf x)) = G Lf x + ) = L f x + T Lf x + ) = L f T Lf x) T Lf x + ) = L f prox 1 L g x 1 ) fx) prox 1 f L f L g x + 1 fx )) + f L f L f x 1 fx) x fx + ) L f L f which is the desired result. L f x + x = L f T Lf x) x = G Lf x), Convergence of the Proximal Gradient Method The Nonconvex Case We will now analyze the convergence of the proximal gradient method under the validity of Assumption Note that we do not assume at this stage that f is convex. The two stepsize strategies that will be considered are constant and bactracing. Constant. L = L Lf, ) for all. Bactracing procedure B1. The procedure requires three parameters s, γ, η), where s>0,γ 0, 1), and η>1. The choice of L is done as follows. First, L is set to be equal to the initial guess s. Then, while F x ) F T L x )) < γ L G L x ), we set L := ηl. In other words, L is chosen as L = sη i,wherei is the smallest nonnegative integer for which the condition is satisfied. F x ) F T sη i x )) γ sη i G sη i x ) Remar Note that the bactracing procedure is finite under Assumption Indeed, plugging x = x into 10.4), weobtain F x ) F T L x )) L L f L GL x ). 10.1) Copyright 017 Society for Industrial and Applied Mathematics

11 10.3. Analysis of the Proximal Gradient Method The Nonconvex Case 79 If L L f 1 γ),then L Lf L γ, and hence, by 10.1), theinequality F x ) F T L x )) γ L G Lx ) L f 1 γ). holds, implying that the bactracing procedure must end when L We can also compute an upper bound on L : either L is equal to s, orthe bactracing procedure is invoed, meaning that L η did not satisfy the bactracing condition, which by the above discussion implies that L ηl f η < L f 1 γ), so that L < 1 γ). To summarize, in the bactracing procedure B1, theparameterl satisfies { } ηl f L max s, ) 1 γ) The convergence of the proximal gradient method in the nonconvex case is heavily based on the sufficient decrease lemma Lemma 10.4). We begin with the following lemma showing that consecutive function values of the sequence generated by the proximal gradient method decrease by at least a constant times the squared norm of the gradient mapping. Lemma sufficient decrease of the proximal gradient method). Suppose that Assumption 10.1 holds. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize defined by L = L L f, ) or with a stepsize chosen by the bactracing procedure B1 with parameters s, γ, η), wheres>0,γ 0, 1),η >1. Then for any 0, F x ) F x +1 ) M G d x ), 10.14) where and M = L L f L), constant stepsize, { max γ ηl s, f 1 γ) }, bactracing, L, constant stepsize, d = s, bactracing ) 10.16) Proof. The result for the constant stepsize setting follows by plugging L = L and x = x into 10.4). As for the case where the bactracing procedure is used, by its definition we have F x ) F x +1 ) γ L G L x ) γ { max s, ηl f 1 γ) } G L x ), where the last inequality follows from the upper bound on L given in 10.13). The result for the case where the bactracing procedure is invoed now follows by Copyright 017 Society for Industrial and Applied Mathematics

12 80 Chapter 10. The Proximal Gradient Method the monotonicity property of the gradient mapping Theorem 10.9) along with the bound L s, which imply the inequality G L x ) G s x ). We are now ready to prove the convergence of the norm of the gradient mapping to zero and that limit points of the sequence generated by the method are stationary points of problem 10.1). Theorem convergence of the proximal gradient method nonconvex case). Suppose that Assumption 10.1 holds and let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) either with a constant stepsize defined by L = L L f, ) or with a stepsize chosen by the bactracing procedure B1 with parameters s,γ,η), wheres>0,γ 0, 1), and η>1. Then a) the sequence {F x )} 0 is nonincreasing. In addition, F x +1 ) <Fx ) if and only if x is not a stationary point of 10.1); b) G d x ) 0 as,whered is given in 10.16); c) where M is given in 10.15); min n=0,1,..., G dx n ) F x0 ) F opt M +1), 10.17) d) all limit points of the sequence {x } 0 are stationary points of problem 10.1). Proof. a) By Lemma we have that F x ) F x +1 ) M G d x ), 10.18) from which it readily follows that F x ) F x +1 ). If x is not a stationary point of problem 10.1), then G d x ) 0, and hence, by 10.18), F x ) >Fx +1 ). If x is a stationary point of problem 10.1), then G L x )=0, from which it follows that x +1 = x 1 L G L x )=x, and consequently F x )=Fx +1 ). b) Since the sequence {F x )} 0 is nonincreasing and bounded below, it converges. Thus, in particular, F x ) F x +1 ) 0as, which, combined with 10.18), implies that G d x ) 0as. c) Summing the inequality over n =0, 1,...,,weobtain F x 0 ) F x +1 ) M F x n ) F x n+1 ) M G d x n ) G d x n ) M +1) min G dx n ). n=0,1,..., n=0 Using the fact that F x +1 ) F opt, the inequality 10.17) follows. Copyright 017 Society for Industrial and Applied Mathematics

13 10.4. Analysis of the Proximal Gradient Method The Convex Case 81 d) Let x be a limit point of {x } 0. {x j } j 0 converging to x. For any j 0, Then there exists a subsequence G d x) G d x j ) G d x) + G d x j ) d + L f ) x j x + G d x j ), 10.19) where Lemma 10.10a) was used in the second inequality. Since the right-hand side of 10.19) goes to 0 as j, it follows that G d x) =0, which by Theorem 10.7b) implies that x is a stationary point of problem 10.1) Analysis of the Proximal Gradient Method The Convex Case The Fundamental Prox-Grad Inequality The analysis of the proximal gradient method in the case where f is convex is based on the following ey inequality which actually does not assume that f is convex). Theorem fundamental prox-grad inequality). Suppose that f and g satisfy properties A) and B) of Assumption For any x E, y intdomf)) and L>0 satisfying it holds that ft L y)) fy)+ fy),t L y) y + L T Ly) y, 10.0) F x) F T L y)) L x T Ly) L x y + l f x, y), 10.1) where l f x, y) =fx) fy) fy), x y. Proof. Consider the function ϕu) =fy)+ fy), u y + gu)+ L u y. Since ϕ is an L-strongly convex function and T L y) = argmin u E ϕu), it follows by Theorem 5.5b) that Note that by 10.0), ϕx) ϕt L y)) L x T Ly). 10.) ϕt L y)) = fy)+ fy),t L y) y + L T Ly) y + gt L y)) ft L y)) + gt L y)) = F T L y)), and thus 10.) implies that for any x E, ϕx) F T L y)) L x T Ly). Copyright 017 Society for Industrial and Applied Mathematics

14 8 Chapter 10. The Proximal Gradient Method Plugging the expression for ϕx) into the above inequality, we obtain fy)+ fy), x y + gx)+ L x y F T L y)) L x T Ly), which is the same as the desired result: F x) F T L y)) L x T Ly) L x y + fx) fy) fy), x y. Remar Obviously, by the descent lemma, 10.0) is satisfied for L = L f, and hence, for any x E and y intdomf)), theinequality holds. F x) F T Lf y)) L f x T L f y) L f x y + l f x, y) A direct consequence of Theorem is another version of the sufficient decrease lemma Lemma 10.4). This is accomplished by substituting y = x in the fundamental prox-grad inequality. Corollary sufficient decrease lemma second version). Suppose that f and g satisfy properties A) and B) of Assumption For any x intdomf)) for which ft L x)) fx)+ fx),t L x) x + L T Lx) x, it holds that F x) F T L x)) 1 L G Lx) Stepsize Strategies in the Convex Case When f is also convex, we will consider, as in the nonconvex case, both constant and bactracing stepsize strategies. The bactracing procedure, which we will refer to as bactracing procedure B, will be slightly different than the one considered in the nonconvex case, and it will aim to find a constant L satisfying fx +1 ) fx )+ fx ), x +1 x + L x+1 x. 10.3) In the special case where g 0, the proximal gradient method reduces to the gradient method x +1 = x 1 L fx ), and condition 10.3) reduces to fx ) fx +1 ) 1 L fx ), which is similar to the sufficient decrease condition described in Lemma 10.4, and this is why condition 10.3) can also be viewed as a sufficient decrease condition. Copyright 017 Society for Industrial and Applied Mathematics

15 10.4. Analysis of the Proximal Gradient Method The Convex Case 83 Constant. L = L f for all. Bactracing procedure B. The procedure requires two parameters s, η), where s>0andη>1. Define L 1 = s. At iteration 0) the choice of L is done as follows. First, L is set to be equal to L 1. Then, while ft L x )) >fx )+ fx ),T L x ) x + L T L x ) x, we set L := ηl.inotherwords,l is chosen as L = L 1 η i,where i is the smallest nonnegative integer for which the condition is satisfied. ft L 1 η i x )) fx )+ fx ),T L 1 η i x ) x + L T L 1 η i x ) x Remar upper and lower bounds on L ). Under Assumption 10.1 and by the descent lemma Lemma 5.7), it follows that both stepsize rules ensure that the sufficient decrease condition 10.3) is satisfied at each iteration. In addition, the constants L that the bactracing procedure B produces satisfy the following bounds for all 0: s L max{ηl f,s}. 10.4) The inequality s L is obvious. To understand the inequality L max{ηl f,s}, note that there are two options. Either L = s or L >s, and in the latter case there exists an index 0 for which the inequality 10.3) is not satisfied with = and L replacing L. By the descent lemma, this implies in particular that L η η <L f, and we have thus shown that L max{ηl f,s}. We also note that the bounds on L canberewrittenas βl f L αl f, where 1, constant, α = { } s max η, L f, bactracing, β = 1, constant, s L f, bactracing. 10.5) Remar 10.0 monotonicity of the proximal gradient method). Since condition 10.3) holds for both stepsize rules, for any 0, we can invoe the fundamental prox-grad inequality 10.1) with y = x = x,l = L and obtain the inequality F x ) F x +1 ) L x x +1, which in particular implies that F x ) F x +1 ), meaning that the method produces a nonincreasing sequence of function values. Copyright 017 Society for Industrial and Applied Mathematics

16 84 Chapter 10. The Proximal Gradient Method Convergence Analysis in the Convex Case We will assume in addition to Assumption 10.1 that f is convex. We begin by establishing an O1/) rate of convergence of the generated sequence of function values to the optimal value. Such rate of convergence is called a sublinear rate. This is of course an improvement over the O1/ ) rate that was established for the projected subgradient and mirror descent methods. It is also not particularly surprising that an improved rate of convergence can be established since additional properties are assumed on the objective function. Theorem 10.1 O1/) rate of convergence of proximal gradient). Suppose that Assumption 10.1 holds and that in addition f is convex. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Then for any x X and 0, F x ) F opt αl f x 0 x, 10.6) where α =1in the constant stepsize setting and α =max { } s η, L f if the bactracing rule is employed. Proof. For any n 0, substituting L = L n, x = x,andy = x n in the fundamental prox-grad inequality 10.1) and taing into account the fact that in both stepsize rules condition 10.0) is satisfied, we obtain F x ) F x n+1 )) x x n+1 x x n + l f x, x n ) L n L n x x n+1 x x n, where the convexity of f was used in the last inequality. Summing the above inequality over n =0, 1,..., 1 and using the bound L n αl f for all n 0see Remar 10.19), we obtain Thus, αl f 1 F x ) F x n+1 )) x x x x 0. n=0 1 F x n+1 ) F opt ) αl f x x 0 αl f x x αl f x x 0. n=0 By the monotonicity of {F x n )} n 0 see Remar 10.0), we can conclude that 1 F x ) F opt ) F x n+1 ) F opt ) αl f x x 0. n=0 Consequently, F x ) F opt αl f x x 0. Copyright 017 Society for Industrial and Applied Mathematics

17 10.4. Analysis of the Proximal Gradient Method The Convex Case 85 Remar 10.. Note that we did not utilize in the proof of Theorem 10.1 the fact that procedure B produces a nondecreasing sequence of constants {L } 0. This implies in particular that the monotonicity of this sequence of constants is not essential, and we can actually prove the same convergence rate for any bactracing procedure that guarantees the validity of condition 10.3) and the bound L αl f. We can also prove that the generated sequence is Fejér monotone, from which convergence of the sequence to an optimal solution readily follows. Theorem 10.3 Fejér monotonicity of the sequence generated by the proximal gradient method). Suppose that Assumption 10.1 holds and that in addition f is convex. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Then for any x X and 0, x +1 x x x. 10.7) Proof. We will repeat some of the arguments used in the proof of Theorem Substituting L = L, x = x,andy = x in the fundamental prox-grad inequality 10.1) and taing into account the fact that in both stepsize rules condition 10.0) is satisfied, we obtain F x ) F x +1 )) x x +1 x x + l f x, x ) L L x x +1 x x, where the convexity of f was used in the last inequality. The result 10.7) now follows by the inequality F x ) F x +1 ) 0. Thans to the Fejér monotonicity property, we can now establish the convergence of the sequence generated by the proximal gradient method. Theorem 10.4 convergence of the sequence generated by the proximal gradient method). Suppose that Assumption 10.1 holds and that in addition f is convex. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Then the sequence {x } 0 converges to an optimal solution of problem 10.1). Proof. By Theorem 10.3, the sequence is Fejér monotone w.r.t. X. Therefore, by Theorem 8.16, to show convergence to a point in X, it is enough to show that any limit point of the sequence {x } 0 is necessarily in X. Let then x be a limit point of the sequence. Then there exists a subsequence {x j } j 0 converging to x. By Theorem 10.1, F x j ) F opt as j. 10.8) Since F is closed, it is also lower semicontinuous, and hence F x) lim j F x j ) = F opt, implying that x X. Copyright 017 Society for Industrial and Applied Mathematics

18 86 Chapter 10. The Proximal Gradient Method To derive a complexity result for the proximal gradient method, we will assume that x 0 x R for some x X and some constant R>0; for example, if domg) is bounded, then R might be taen as its diameter. By inequality 10.6) it follows that in order to obtain an ε-optimal solution of problem 10.1), it is enough to require that αl f R ε, which is the same as αl f R. ε Thus, to obtain an ε-optimal solution, an order of 1 ε iterations is required, which is an improvement of the result for the projected subgradient method in which an 1 order of ε iterations is needed see, for example, Theorem 8.18). We summarize the above observations in the following theorem. Theorem 10.5 complexity of the proximal gradient method). Under the setting of Theorem 10.1, for any satisfying αlf R, ε it holds that F x ) F opt ε, wherer is an upper bound on x x 0 for some x X. In the nonconvex case meaning when f is not necessarily convex), an O1/ ) rate of convergence of the norm of the gradient mapping was established in Theorem 10.15c). We will now show that with the additional convexity assumption on f, this rate can be improved to O1/). Theorem 10.6 O1/) rate of convergence of the minimal norm of the gradient mapping). Suppose that Assumption 10.1 holds and that in addition f is convex. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Then for any x X and 1, min G αl f x n ) α1.5 L f x 0 x, 10.9) n=0,1,..., β where α = β =1in the constant stepsize setting and α =max { } s η, L f,β = the bactracing rule is employed. Proof. By the sufficient decrease lemma Corollary 10.18), for any n 0, F x n ) F x n+1 )=Fx n ) F T Ln x n )) 1 G Ln x n ) ) L n By Theorem 10.9 and the fact that βl f L n αl f see Remar 10.19), it follows that 1 G Ln x n ) = L n G Ln x n ) L n L βl f G αlf x n ) n α L = β f α G αlf x n ). L f 10.31) s L f if Copyright 017 Society for Industrial and Applied Mathematics

19 10.4. Analysis of the Proximal Gradient Method The Convex Case 87 Therefore, combining 10.30) and 10.31), F x n ) F opt F x n+1 ) F opt + β α L f G αlf x n ). 10.3) Let p be a positive integer. Summing 10.3) over n = p, p +1,...,p 1 yields F x p ) F opt F x p ) F opt + β α L f p 1 n=p G αlf x n ) ) By Theorem 10.1, F x p ) F opt αl f x 0 x p, which, combined with the fact that F x p ) F opt 0 and 10.33), implies βp α L f min G αl f x n ) n=0,1,...,p 1 β α L f p 1 n=p G αlf x n ) αl f x 0 x. p Thus, and also We conclude that for any 1, min n=0,1,..., G αl f x n ) min G αl f x n ) α3 L f x0 x n=0,1,...,p 1 βp 10.34) min G αl f x n ) α3 L f x0 x n=0,1,...,p βp ) α 3 L f x0 x β min{/), +1)/) } = 4α3 L f x0 x β. When we assume further that f is L f -smooth over the entire space E, wecan use Lemma 10.1 to obtain an improved result in the case of a constant stepsize. Theorem 10.7 O1/) rate of convergence of the norm of the gradient mapping under the constant stepsize rule). Suppose that Assumption 10.1 holds and that in addition f is convex and L f -smooth over E. Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with a constant stepsize rule in which L L f for all 0. Then for any x X and 0, a) G Lf x +1 ) G Lf x ) ; b) G Lf x ) L f x 0 x +1. Proof. Invoing Lemma 10.1 with x = x, we obtain a). Part b) now follows by substituting α = β = 1 in the result of Theorem 10.6 and noting that by part a), G Lf x ) =min n=0,1,..., G Lf x n ). Copyright 017 Society for Industrial and Applied Mathematics

20 88 Chapter 10. The Proximal Gradient Method 10.5 The Proximal Point Method Consider the problem min gx), 10.36) x E where g : E, ] is a proper closed and convex function. Problem 10.36) is actually a special case of the composite problem 10.1) with f 0. The update step of the proximal gradient method in this case taes the form x +1 =prox 1 L g x ). Taing L = 1 c for some c>0, we obtain the proximal point method. The Proximal Point Method Initialization: pic x 0 E and c>0. General step 0): x +1 =prox cg x ). The proximal point method is actually not a practical algorithm since the general step ass to minimize the function gx) + c x x, which in general is as hard to accomplish as solving the original problem of minimizing g. Since the proximal point method is a special case of the proximal gradient method, we can deduce its main convergence results from the corresponding results on the proximal gradient method. Specifically, since the smooth part f 0is0-smooth,wecan tae any constant stepsize to guarantee convergence and Theorems 10.1 and 10.4 imply the following result. Theorem 10.8 convergence of the proximal point method). Let g : E, ] be a proper closed and convex function. Assume that problem min gx) x E has a nonempty optimal set X, and let the optimal value be given by g opt. Let {x } 0 be the sequence generated by the proximal point method with parameter c>0. Then a) gx ) g opt x0 x c for any x X and 0; b) the sequence {x } 0 converges to some point in X Convergence of the Proximal Gradient Method The Strongly Convex Case In the case where f is assumed to be σ-strongly convex for some σ>0, the sublinear rate of convergence can be improved into a linear rate of convergence, meaning a rate of the form Oq )forsomeq 0, 1). Throughout the analysis of the strongly convex case we denote the unique optimal solution of problem 10.1) by x. Copyright 017 Society for Industrial and Applied Mathematics

21 10.6. Convergence of the Proximal Gradient Method Strongly Convex Case 89 Theorem 10.9 linear rate of convergence of the proximal gradient method strongly convex case). Suppose that Assumption 10.1 holds and that in addition f is σ-strongly convex σ >0). Let {x } 0 be the sequence generated by the proximal gradient method for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B. Let 1, constant stepsize, α = { } s max η, L f, bactracing. Then for any 0, ) a) x +1 x 1 σ αl f x x ; ) b) x x 1 σ αl f x 0 x ; c) F x +1 ) F opt αl f 1 σ αl f ) +1 x 0 x. Proof. Plugging L = L, x = x,andy = x into the fundamental prox-grad inequality 10.1) and taing into account the fact that in both stepsize rules condition 10.0) is satisfied, we obtain F x ) F x +1 ) L x x +1 L x x + l f x, x ). Since f is σ-strongly convex, it follows by Theorem 5.4ii) that Thus, l f x, x )=fx ) fx ) fx ), x x σ x x. F x ) F x +1 ) L x x +1 L σ x x ) Since x is a minimizer of F, F x ) F x +1 ) 0, and hence, by 10.37) and the fact that L αl f see Remar 10.19), ) x +1 x 1 σl x x 1 σ ) x x, αl f establishing part a). Part b) follows immediately by a). To prove c), note that by 10.37), F x +1 ) F opt L σ x x L x+1 x αl f σ x x = αl f αl f 1 σ αl f 1 σ αl f where part b) was used in the last inequality. ) x x ) +1 x 0 x, Copyright 017 Society for Industrial and Applied Mathematics

22 90 Chapter 10. The Proximal Gradient Method Theorem 10.9 immediately implies that in the strongly convex case, the proximal gradient method requires an order of log 1 ε ) iterations to obtain an ε-optimal solution. Theorem complexity of the proximal gradient method The strongly convex case). Under the setting of Theorem 10.9, for any 1 satisfying ακ log ) 1 + ακ log ε αlf R it holds that F x ) F opt ε, wherer is an upper bound on x 0 x and κ = L f σ. Proof. Let 1. By Theorem 10.9 and the definition of κ, a sufficient condition for the inequality F x ) F opt ε to hold is that αl f 1 1 ) R ε, ακ ), which is the same as log 1 1 ) ) ε log ακ αl f R ) Since log1 x) xfor any 57 x 1, it follows that a sufficient condition for 10.38) to hold is that 1 ) ε ακ log αl f R, namely, that ακ log ) 1 + ακ log ε αlf R ) The Fast Proximal Gradient Method FISTA The Method The proximal gradient method achieves an O1/) rate of convergence in function values to the optimal value. In this section we will show how to accelerate the method in order to obtain a rate of O1/ ) in function values. The method is nown as the fast proximal gradient method, but we will also refer to it as FISTA, which is an acronym for fast iterative shrinage-thresholding algorithm ; see Example for further explanations. The method was devised and analyzed by Bec and Teboulle in the paper [18], from which the convergence analysis is taen. We will assume that f is convex and that it is L f -smooth, meaning that it is L f -smooth over the entire space E. We gather all the required properties in the following assumption. 57 The inequality also holds for x = 1 since in that case the left-hand side is. Copyright 017 Society for Industrial and Applied Mathematics

23 10.7. The Fast Proximal Gradient Method FISTA 91 Assumption A) g : E, ] is proper closed and convex. B) f : E R is L f -smooth and convex. C) The optimal set of problem 10.1) is nonempty and denoted by X. The optimal value of the problem is denoted by F opt. The description of FISTA now follows. FISTA Input: f,g,x 0 ), where f and g satisfy properties A) and B) in Assumption and x 0 E. Initialization: set y 0 = x 0 and t 0 =1. General step: for any =0, 1,,... execute the following steps: a) pic L > 0; b) set x +1 =prox 1 L g c) set t +1 = t ; d) compute y +1 = x +1 + ) y 1 L fy ) ; t 1 t +1 ) x +1 x ). As usual, we will consider two options for the choice of L : constant and bactracing. The bactracing procedure for choosing the stepsize is referred to as bactracing procedure B3 and is identical to procedure B with the sole difference that it is invoed on the vector y rather than on x. Constant. L = L f for all. Bactracing procedure B3. The procedure requires two parameters s, η), where s>0andη>1. Define L 1 = s. At iteration 0) the choice of L is done as follows: First, L is set to be equal to L 1. Then, while recall that T L y) T f,g L y) =prox 1 L g y 1 L fy))), ft L y )) >fy )+ fy ),T L y ) y + L T L y ) y, we set L := ηl. In other words, the stepsize is chosen as L =L 1 η i, where i is the smallest nonnegative integer for which the condition ft L 1 η i y )) fy )+ fy ),T L 1 η i y ) y is satisfied. + L T L 1 η i y ) y Copyright 017 Society for Industrial and Applied Mathematics

24 9 Chapter 10. The Proximal Gradient Method In both stepsize rules, the following inequality is satisfied for any 0: ft L y )) fy )+ fy ),T L y ) y + L T L y ) y ) Remar Since the bactracing procedure B3 is identical to the B procedure only employed on y ), the arguments of Remar are still valid, and we have that where α and β are given in 10.5). βl f L αl f, The next lemma shows an important lower bound on the sequence {t } 0 that will be used in the convergence proof. Lemma Let {t } 0 be the sequence defined by Then t + for all 0. t 0 =1,t +1 = t, 0. Proof. The proof is by induction on. Obviously, for =0,t 0 =1 0+. Suppose that the claim holds for, meaning t +. We will prove that t By the recursive relation defining the sequence and the induction assumption, t +1 = t ) 1+ +) = Convergence Analysis of FISTA Theorem O1/ ) rate of convergence of FISTA). Suppose that Assumption holds. Let {x } 0 be the sequence generated by FISTA for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B3. Then for any x X and 1, F x ) F opt αl f x 0 x +1), where α =1in the constant stepsize setting and α =max { η, s L f } if the bactracing rule is employed. Proof. Let 1. Substituting x = t 1 x +1 t 1 )x, y = y,andl = L in the fundamental prox-grad inequality 10.1), taing into account that inequality Copyright 017 Society for Industrial and Applied Mathematics

25 10.7. The Fast Proximal Gradient Method FISTA ) is satisfied and that f is convex, we obtain that F t 1 x +1 t 1 )x ) F x +1 ) L x+1 t 1 x +1 t 1 )x ) L y t 1 x +1 t 1 )x ) = L t t x +1 x +t 1)x ) L t t y x +t 1)x ) ) By the convexity of F, F t 1 x +1 t 1 )x ) t 1 F x )+1 t 1 )F x ). Therefore, using the notation v n F x n ) F opt for any n 0, F t 1 x +1 t 1 )x ) F x +1 ) 1 t 1 )F x ) F x )) F x +1 ) F x )) =1 t 1 )v v ) ) On the other hand, using the relation y = x t t x x 1 ), t y x +t 1)x ) = t x +t 1 1)x x 1 ) x +t 1)x ) Combining 10.40), 10.41), and 10.4), we obtain that = t 1 x x +t 1 1)x 1 ). 10.4) t t )v t v +1 L u+1 L u, whereweusethenotationu n = t n 1 x n x +t n 1 1)x n 1 ) for any n 0. By theupdateruleoft +1,wehavet t = t 1, and hence Since L L 1, we can conclude that Thus, L t 1 v L t v +1 u +1 u. L 1 t 1v L t v +1 u +1 u. u +1 + L t v +1 u + L 1 t 1 v, and hence, for any 1, u + L 1 t 1 v u 1 + L 0 t 0 v 1 = x 1 x + L 0 F x 1 ) F opt ) 10.43) Substituting x = x, y = y 0,andL = L 0 in the fundamental prox-grad inequality 10.1), taing into account the convexity of f yields L 0 F x ) F x 1 )) x 1 x y 0 x, which, along with the fact that y 0 = x 0, implies the bound x 1 x + L 0 F x 1 ) F opt ) x 0 x. Copyright 017 Society for Industrial and Applied Mathematics

26 94 Chapter 10. The Proximal Gradient Method Combining the last inequality with 10.43), we get L 1 t 1v u + L 1 t 1v x 0 x. Thus, using the bound L 1 αl f, the definition of v, and Lemma 10.33, F x ) F opt L 1 x 0 x t 1 αl f x 0 x +1). Remar alternative choice for t ). A close inspection of the proof of Theorem reveals that the result is correct if {t } 0 is any sequence satisfying the following two properties for any 0: a) t + ; b) t +1 t +1 t.the choice t = + also satisfies these two properties. The validity of a) is obvious; to show b), note that t +1 t +1 = t +1 t +1 1) = ) = = t. = Remar Note that FISTA has an O1/ ) rate of convergence in function values, while the proximal gradient method has an O1/) rate of convergence. This improvement was achieved despite the fact that the dominant computational steps at each iteration of both methods are essentially the same: one gradient evaluation and one prox computation Examples Example Example 10.: Consider the following model, which was already discussed in min x R n fx)+λ x 1, where λ>0andf : R n R is assumed to be convex and L f -smooth. The update formula of the proximal gradient method with constant stepsize 1 L f has the form x +1 = T λ x 1 ) fx ). L f L f As was already noted in Example 10.3, since at each iteration one shrinage/softthresholding operation is performed, this method is also nown as the iterative shrinage-thresholding algorithm ISTA). The general update step of the accelerated proximal gradient method discussed in this section taes the following form: ) a) set x +1 = T λ y 1 L f L f fy ) ; Copyright 017 Society for Industrial and Applied Mathematics

27 10.7. The Fast Proximal Gradient Method FISTA 95 b) set t +1 = t ; c) compute y +1 = x +1 + t 1 t +1 ) x +1 x ). The above scheme truly deserves to be called fast iterative shrinage/thresholding algorithm FISTA) since it is an accelerated method that performs at each iteration a thresholding step. In this boo we adopt the convention and use the acronym FISTA as the name of the fast proximal gradient method for a general nonsmooth part g. Example l 1 -regularized least squares). As a special instance of Example 10.37, consider the problem 1 min x R n Ax b + λ x 1, 10.44) where A R m n, b R m, and λ > 0. The problem fits model 10.1) with fx) = 1 Ax b and gx) = λ x 1. The function f is L f -smooth with L f = A T A, = λ max A T A) see Example 5.). The update step of FISTA has the following form: a) set x +1 = T λ L ) y 1 L A T Ay b) ; b) set t +1 = t ; c) compute y +1 = x +1 + t 1 t +1 ) x +1 x ). The update step of the proximal gradient method, which in this case is the same as ISTA, is x +1 = T λ x 1 ) A T Ax b). L L The stepsizes in both methods can be chosen to be the constant L λ max A T A). To illustrate the difference in the actual performance of ISTA and FISTA, we generated an instance of the problem with λ =1andA R The components of A were independently generated using a standard normal distribution. The true vector is x true = e 3 e 7,andb was chosen as b = Ax true. We ran 00 iterations of ISTA and FISTA in order to solve problem 10.44) with initial vector x = e, the vector of all ones. It is well nown that the l 1 -norm element in the objective function is a regularizer that promotes sparsity, and we thus expect that the optimal solution of 10.44) will be close to the true sparse vector x true. The distances to optimality in terms of function values of the sequences generated by the two methods as a function of the iteration index are plotted in Figure 10.1, where it is apparent that FISTA is far superior to ISTA. In Figure 10. we plot the vectors that were obtained by the two methods. Obviously, the solution produced by 00 iterations of FISTA is much closer to the optimal solution which is very close to e 3 e 7 ) than the solution obtained after 00 iterations of ISTA. Copyright 017 Society for Industrial and Applied Mathematics

28 96 Chapter 10. The Proximal Gradient Method ISTA FISTA 10 0 Fx ) F opt Figure Results of 00 iterations of ISTA and FISTA on an l 1 - regularized least squares problem. 1 ISTA 1 FISTA Figure 10.. Solutions obtained by ISTA left) and FISTA right) MFISTA 58 FISTA is not a monotone method, meaning that the sequence of function values it produces is not necessarily nonincreasing. It is possible to define a monotone version of FISTA, which we call MFISTA, which is a descent method and at the same time preserves the same rate of convergence as FISTA. 58 MFISTA and its convergence analysis are from the wor of Bec and Teboulle [17]. Copyright 017 Society for Industrial and Applied Mathematics

29 10.7. The Fast Proximal Gradient Method FISTA 97 MFISTA Input: f,g,x 0 ), where f and g satisfy properties A) and B) in Assumption and x 0 E. Initialization: set y 0 = x 0 and t 0 =1. General step: for any =0, 1,,... execute the following steps: a) pic L > 0; b) set z =prox 1 L g ) y 1 L fy ) ; c) choose x +1 E such that F x +1 ) min{f z ),Fx )}; d) set t +1 = t ; e) compute y +1 = x +1 + ) t t +1 z x +1 )+ t 1 t +1 x +1 x ). Remar The choice x +1 argmin{f x) :x = x, z } is a very simple rule ensuring the condition F x +1 ) min{f z ),Fx )}. We also note that the convergence established in Theorem only requires the condition F x +1 ) F z ). The convergence result of MFISTA, whose proof is a minor adjustment of the proof of Theorem 10.34, is given below. Theorem O1/ ) rate of convergence of MFISTA). Suppose that Assumption holds. Let {x } 0 be the sequence generated by MFISTA for solving problem 10.1) with either a constant stepsize rule in which L L f for all 0 or the bactracing procedure B3. Then for any x X and 1, F x ) F opt αl f x 0 x +1), where α =1in the constant stepsize setting and α =max { } s η, L f if the bactracing rule is employed. Proof. Let 1. Substituting x = t 1 x +1 t 1 )x, y = y,andl = L in the fundamental prox-grad inequality 10.1), taing into account that inequality 10.39) is satisfied and that f is convex, we obtain that F t 1 x +1 t 1 )x ) F z ) L z t 1 x +1 t 1 )x ) L y t 1 x +1 t 1 )x ) = L t t z x +t 1)x ) L t t y x +t 1)x ) ) By the convexity of F, F t 1 x +1 t 1 )x ) t 1 F x )+1 t 1 )F x ). Copyright 017 Society for Industrial and Applied Mathematics

30 98 Chapter 10. The Proximal Gradient Method Therefore, using the notation v n F x n ) F opt for any n 0 and the fact that F x +1 ) F z ), it follows that F t 1 x +1 t 1 )x ) F z ) 1 t 1 )F x ) F x )) F x +1 ) F x )) =1 t 1 )v v ) ) On the other hand, using the relation y = x + t 1 t z 1 x t 1 1 )+ t x x 1 ), we have t y x +t 1)x )=t 1 z 1 x +t 1 1)x 1 ) ) Combining 10.45), 10.46), and 10.47), we obtain that t t )v t v +1 L u+1 L u, whereweusethenotationu n = t n 1 z n 1 x +t n 1 1)x n 1 ) for any n 0. Bytheupdateruleoft +1,wehavet t = t 1, and hence Since L L 1, we can conclude that Thus, and hence, for any 1, L t 1v L t v +1 u +1 u. L 1 t 1v L t v +1 u +1 u. u +1 + L t v +1 u + L 1 t 1 v, u + L 1 t 1v u 1 + L 0 t 0v 1 = z 0 x + L 0 F x 1 ) F opt ) ) Substituting x = x, y = y 0,andL = L 0 in the fundamental prox-grad inequality 10.1), taing into account the convexity of f, yields L 0 F x ) F z 0 )) z 0 x y 0 x, which, along with the facts that y 0 = x 0 and F x 1 ) F z 0 ), implies the bound z 0 x + L 0 F x 1 ) F opt ) x 0 x. Combining the last inequality with 10.48), we get L 1 t 1v u + L 1 t 1v x 0 x. Thus, using the bound L 1 αl f, the definition of v, and Lemma 10.33, F x ) F opt L 1 x 0 x t 1 αl f x 0 x +1). Copyright 017 Society for Industrial and Applied Mathematics

6. Proximal gradient method

L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping