arxiv: v1 [math.oc] 10 Oct 2018

Size: px

Start display at page:

Download "arxiv: v1 [math.oc] 10 Oct 2018"

Magdalene Watts
5 years ago
Views:

1 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv: v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University of Iowa, Iowa ity, IA 54 October 0, 08 Abstract Error bound condition has recently gained revived interest in optimization. It has been leveraged to derive faster convergence for many popular algorithms, including subgradient methods, proximal gradient method and accelerated proximal gradient method. However, it is still unclear whether the Frank-Wolfe (FW) method can enjoy faster convergence under error bound condition. In this short note, we give an affirmative answer to this question. We show that the FW method (with a line search for the step size) for optimization over a strongly convex set is automatically adaptive to the error bound condition of the problem. In particular, the iteration complexity of FW can be characterized by O(max(/ǫ θ,log(/ǫ))) where θ [0,] is a constant that characterizes the error bound condition. Our results imply that if the constrained set is characterized by a strongly convex function and the objective function can achieve a smaller value outside the considered domain, then the FW method enjoys a fast rate of O(/t ).. Introduction In this draft, we consider the following constrained convex optimization problem: minf(x) () x Ω where f(w) is a smooth function and Ω E is a bounded strongly convex set. We assume that linear optimization over Ω is much more cheaper than projection onto Ω, which makes the FW method more suitable for solving the above problem than gradient methods. The goal of this paper is to show that the FW method is automatically adaptive to an error bound condition of the optimization problem. Below, we will first review the FW method and the error bound condition. In next section, we will prove that the FW method is automatically adaptive to the error bound condition. The original FW method, introduced by Frank and Wolfe (956) (a.k.a. onditional Graident method (Levitin and Polyak, 966)), is a projection-free fist-order method for minimizing smooth convex objective functions over a convex set. In recent years, the FW method has gained an increasing interest in large-scale optimization and machine learning (e.g., (Garber and Hazan, 05; Freund and Grigas, 06; Nesterov, 08; Narasimhan, 08)). Many existing works have shown the convergence rate of the standard FW method is O(/t) even for strongly convex objectives (larkson, 008; Hazan, 008; Jaggi, 03), and in general the rate could not be improved. Under different assumptions or for some c Y. Xu & T. Yang.

2 Xu Yang special cases, a series of works tried to get faster rates of the FW method and its variants(levitin and Polyak, 966; Demyanov and Rubinov, 970; Dunn, 979; Guélat and Marcotte, 986; Beck and Teboulle, 004; Garber and Hazan, 03; Lan, 03; Lacoste-Julien and Jaggi, 03; Garber and Hazan, 05; Lacoste-Julien and Jaggi, 05; Lan and Zhou, 06). For example, for minimizing smooth and strongly convex objective functions over a strongly convex set, Garber and Hazan (05) showed that the FM method enjoyed fast rate of O(/t ). In this paper, we first consider the FW method shown in Algorithm, where L f denotes a smoothness constant of f(x) with respect to such that f(x) f(y) + f(y) (x y) + L f x y holds for any x,y Ω. Note that both options for selecting the step size have been considered in the literature (Jaggi, 03; Garber and Hazan, 05). Option I requires evaluating the objective function but does not need to know the smoothness constant. Option II could be cheaper but requires knowing the Lipschitz constant of the gradient. Our analysis applies to both options. In the sequel, we will focus on option I, with which we have f(x t+ ) f(x t +η(y t x t )), η [0,] f(x t )+η(y t x t ) f(x t )+ η L f y t x t, η [0,] () Note that for option II, the second inequality above still holds. We consider the following definition of error bound condition for the optimization problem (). Definition (Hölderian error bound (HEB)) A function f(x) is said to satisfy a HEB condition on Ω if there exist θ [0,] and 0 < c < such that for any x Ω min x w c(f(x) f ) θ. (3) w Ω where Ω denotes the optimal set of min x Ω f(x) and f denotes the optimal objective value. It is notable that θ = 0 is a trivial condition since it always hold due to that Ω is a compact set. The above HEB condition has been considered for deriving faster convergence of subgradient methods (Yang and Lin, 08), proximal gradient method (Liu and Yang, 07), accelerated gradient method (Xu et al., 06), and stochastic subgradient methods (Xu et al., 07a). It has been shown that many problems satisfy the above condition (Xu et al., 06, 07a,b; Liu and Yang, 07; Yang and Lin, 08). For example, when functions are semialgebraic and regular (for instance, continuous), the above inequality is known to hold on any compact set (c.f. (Bolte et al., 07) and references therein). The last definition in this section is regarding the strongly convex set. Definition A convex set Ω is a α-strongly convex with respect to if for any x,y Ω, any γ [0,] and any vector z E such that z =, it holds that γx+( γ)y+γ( γ) α x y z Ω. Remark. Many previous works (e.g., (Levitin and Polyak, 966; Demyanov and Rubinov, 970; Dunn, 979; Garber and Hazan, 05)) considered this condition of feasible set when studying the FW method.

3 Algorithm Frank-Wolfe Method Initilization: x 0 Ω for t = 0,...,T do ompute y t argmin y Ω f(x t ) y Option I: Set η t = argmin η [0,] f(x t +η(y t x t )) Option II: Set η t = argmin η [0,] η(y t x t ) f(x t )+ η L f y t x t ompute x t+ = x t +η t (y t x t ) end for. Adaptive onvergence of the FW method In this section, we show that the FW method is automatically adaptive to the HEB condition, enjoying a faster convergence rate than the standard O(/t) rate without the knowledge of the HEB condition. We first prove the following lemma. Lemma 3 Assume f(x) obeys the HEB condition on Ω with θ [0,], then it holds that f(x) c (f(x) f ) θ. Proof Let x denote the optimal solution in Ω that is closest to x measured in. By convexity of f( ), we have Thus, As a result, f(x ) f(x)+ f(x) (x x). f(x) f(x ) f(x) x x c(f(x) f ) θ f(x). f(x) c (f(x) f ) θ. The second lemma is from (Garber and Hazan, 05). Lemma 4 For the FW method given in Algorithm, for t = 0,..., we have { ( f(x t+ ) f (f(x t ) f )max, α f(x )} t). 8L f Finally, we prove the following theorem. Theorem For every t, we have if θ [0,) (t+k) /( θ) f(x t ) f ρ t (f(x 0 ) f ) otherwise 3

4 Xu Yang { where k max θ {L }, θ, max f D (+k) θ ( ) }, θ M, = θ θ( θ ), and ρ = max{, α 8cL f }. Remark. In order to find an ǫ-approximate solution x t such that f(x t ) f ǫ, the iteration complexity of FW method is O(max(/ǫ θ,log(/ǫ))) with θ [0,]. Proof When θ =, the conclusion is trivial, which follows directly from Lemma 4. Next, we prove for θ [0,). Let β = θ. h t = f(x t ) f. ombining Lemma 3 and Lemma 4, we have { h t+ h t max, α } { } h β t = h t max 8cL f, Mhβ t (4) We prove by induction that h t (t+k) /β. For the case t =, following () we have h h 0 ( η)+ L fη D { Lf D } max,h 0 L fd, η [0,], where we use the fact that h 0 = f(x 0 ) f(x ) f(x ) (x 0 x ) + L f x 0 x = L f x 0 x L fd ]. As long as /(+k) /β L f D /, we have the conclusion holds for t =. Next, we consider t. First assume that the max operation in (4) gives /, i.e., h t+ h t (t+k) /β where the last inequality holds as long as (t++k) /β (t++k) /β (t+k) /β (t++k) /β, (t++k) /β (t+k) /β, t, i.e., + t+k β, t, i.e., k β β. Next, consider the case that the max operation is the second argument. In this case, if h t, the same conclusion holds under the above condition of k. Otherwise, (t+k) /β h t. We have (t+k) /β ( ( ) ) h t+ h t ( Mh β t ) β (t+k) /β M t+k (t+k+) /β (t+k+) /β (t+k) /β (t+k+) /β (+ t+k ( ( M ) )( t+k ) ) β t+k To show the last inequality holds, we can set = β ( β)( β ) > and ( /M) /β. To see this, we need to show that log(+ x) β log(+x) 0, 0 x β. 4

5 In fact, due to + x β(+x) 0, 0 x β, it gives + x (+x) /β holds for all 0 x β. Plugging x = /(t +k) β into this inequality, we get what we want + /(t+k) (+ t+k )/β. 3. Examples Lastly, we give examples exhibiting the HEB condition with θ = /. In particular, let us consider min f(x) (5) g(x) r where g(x) is a non-negative, strongly and smooth function. It is shown that Ω = {x : g(x) r} is a strongly convex set (Garber and Hazan, 05). Lemma 5 Assume that min x f(x) < min g(x) r f(x) and there exists a x 0 such that g(x 0 ) < r, then the above problem satisfies HEB with θ = /. Proof We set Ω = {x : g(x) r} and Ω = argmin g(x) r f(x), and we define an indicator function as follows, { 0 if x Ω, I Ω (x) = + if x / Ω. Then the problem of (5) can be written as min x f(x) := f(x)+iω (x), and thus we also have Ω = argmin x f(x). We only need to consider any fixed x Ω. By the condition of g(x 0 ) < r and orollary 8.. of (Rockafellar, 970), there exists λ 0 such that f(x ) =min f(x) = f(x ) = minf(x) = min x x {f(x)+λ (g(x) r)} x Ω f(x )+λ (g(x ) r) f(x ), (6) where the first inequality is due to x Ω ; the second inequality uses the fact that x Ω hence g(x ) r 0. Then, equality holds for (6), which implies f(x ) +λ (g(x ) r) = f(x ), that is, λ (g(x ) r) = 0. (7) On the other hand, let u argmin x f(x), then based on the assumption of min x f(x) < min g(x) r f(x) we know u / Ω hence u / Ω. By (6), we also know f(u ) < minf(x) = min x Ω x {f(x)+λ (g(x) r)} f(u )+λ (g(u ) r), 5

6 Xu Yang which implies λ (g(u ) r) > 0. (8) Since u / Ω, then g(u ) r > 0. In order to have (8), we need λ > 0. Thus, by (7) we have g(x ) r = 0. (9) For any such λ > 0, then by Theorem 8. of (Rockafellar, 970), we also have Ω = {x : g(x) = r} argmin x {f(x)+λ (g(x) r)}. (0) Since g(x) is strongly convex, f(x) is convex and λ > 0, then f(x)+λ (g(x) r) is also stronglyconvex, implyingthat v = argmin x {f(x)+λ (g(x) r)} isauniqueconstant. Due to λ > 0, g(v ) is also a constant (Li and Pong, 07). By (0) we have g(v ) = g(x ) = r. Therefore, Ω = argmin x {f(x)+λ (g(x) r)}. () By the strong convexity of f(x)+λ (g(x) r) we know for any x Ω and x Ω Ω, c x x f(x)+λ (g(x) r) [f(x )+λ (g(x ) r)], where c > 0. Since λ > 0, g(x) r 0 and g(x ) r = 0, we get Therefore, for any x Ω which implies θ = /. c x x f(x) f(x ). min x w c(f(x) f ) /, w Ω References Amir Beck and Marc Teboulle. A conditional gradient method with linear rate of convergence for solving convex linear systems. Mathematical Methods of Operations Research, 59():35 47, 004. Jérôme Bolte, Trong Phong Nguyen, Juan Peypouquet, and Bruce W Suter. From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming, 65():47 507, 07. Kenneth L larkson. oresets, sparse greedy approximation, and the frank-wolfe algorithm. In Proceedings of the nineteenth annual AM-SIAM symposium on Discrete algorithms (SODA), pages Society for Industrial and Applied Mathematics,

7 Vladimir Fedorovich Demyanov and Aleksandr Moiseevich Rubinov. Approximate methods in optimization problems, volume 3. Elsevier Publishing ompany, 970. Joseph Dunn. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on ontrol and Optimization, 7():87, 979. Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(-):95 0, 956. Robert M Freund and Paul Grigas. New analysis and results for the frank wolfe method. Mathematical Programming, 55(-):99 30, 06. Dan Garber and Elad Hazan. Playing non-linear games with linear oracles. In Foundations of omputer Science (FOS), 03 IEEE 54th Annual Symposium on, pages 40 48, 03. Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In Proceedings of the 3nd International onference on Machine Learning (IML), pages , 05. Jacques Guélat and Patrice Marcotte. Some comments on wolfe s away step. Mathematical Programming, 35():0 9, 986. Elad Hazan. Sparse approximate solutions to semidefinite programs. In Latin American symposium on theoretical informatics, pages Springer, 008. Martin Jaggi. Revisiting frank-wolfe: projection-free sparse convex optimization. In Proceedings of the 30th International onference on Machine Learning (IML), pages , 03. Simon Lacoste-Julien and Martin Jaggi. An affine invariant linear convergence analysis for frank-wolfe algorithms. arxiv preprint arxiv:3.7864, 03. Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of frank-wolfe optimization variants. In Advances in Neural Information Processing Systems (NIPS), pages , 05. Guanghui Lan. The complexity of large-scale convex programming under a linear optimization oracle. arxiv preprint arxiv: , 03. Guanghui Lan and Yi Zhou. onditional gradient sliding for convex optimization. SIAM Journal on Optimization, 6(): , 06. ES Levitin and BT Polyak. onstrained minimization methods. USSR omputational Mathematics and Mathematical Physics, 6(5): 50, 966. Guoyin Li and Ting Kei Pong. alculus of the exponent of kurdyka lojasiewicz inequality and its applications to linear convergence of first-order methods. Foundations of omputational Mathematics, pages 34, 07. 7

8 Xu Yang Mingrui Liu and Tianbao Yang. Adaptive accelerated gradient converging method under hölderian error bound condition. In Advances in Neural Information Processing Systems, pages , 07. Harikrishna Narasimhan. Learning with complex loss functions and constraints. In International onference on Artificial Intelligence and Statistics, pages , 08. Yu Nesterov. omplexity bounds for primal-dual methods minimizing the model of objective function. Mathematical Programming, 7(-):3 330, 08. R Tyrrell Rockafellar. onvex Analysis. Princeton University Press, 970. Yi Xu, Yan Yan, Qihang Lin, and Tianbao Yang. Homotopy smoothing for non-smooth problems with lower complexity than O(/ǫ). In Advances in Neural Information Processing Systems (NIPS), pages 08 6, 06. Yi Xu, Qihang Lin, and Tianbao Yang. Stochastic convex optimization: Faster local growth implies faster global convergence. In Proceedings of the 34th International onference on Machine Learning (IML), pages , 07a. Yi Xu, Mingrui Liu, Qihang Lin, and Tianbao Yang. ADMM without a fixed penalty parameter: Faster convergence with new adaptive penalization. In Advances in Neural Information Processing Systems 30 (NIPS), pages 67 77, 07b. Tianbao Yang and Qihang Lin. Rsg: Beating subgradient method without smoothness and strong convexity. Journal of Machine Learning Research, 9(6), 08. 8

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the