Pairwise Away Steps for the Frank-Wolfe Algorithm

Size: px

Start display at page:

Download "Pairwise Away Steps for the Frank-Wolfe Algorithm"

Stuart Booth
5 years ago
Views:

1 Pairwise Away Steps for the Frank-Wolfe Algorithm Héctor Allende Department of Informatics Universidad Federico Santa María, Chile Ricardo Ñanculef Department of Informatics Universidad Federico Santa María, Chile Emanuele Frandi Department of Science and High Technology University of Insubria, Italy Claudio Sartori Department of Computer Science and Engineering University of Bologna, Italy Abstract Recently, there has been a renewed interest in the machine learning community for variants of a sparse greedy approximation algorithm for concave optimization known as the Frank-Wolfe (FW) method. This algorithm has been successfully applied for example to train large-scale instances of non-linear Support Vector Machines (SVMs). In this paper, we investigate an improvement of the FW method based on a new way to perform away steps, a classic strategy used to guarantee the linear convergence of the original FW procedure. On the theoretical side, we present some results about the convergence rate of the algorithm. On the practical side, we assess the performance of the method on several SVM problems. We conclude that our method is most of the times faster than traditional FW methods, and works well even in the cases where standard away steps slow down the algorithm. 1 Introduction Consider the following optimization problem maximize α g(α) subject to α S := { α R m : } α i = 1, α i 0, (1) i where g is convex but not necessarily strongly or strictly convex. This problem encompasses several models used in machine learning [4, 9]. The FW method computes a sequence of approximations α 1, α 1,..., α k to a solution of problem (1) by iterating until convergence two simple steps: it first finds the largest coordinate of the gradient, i.e. i argmax i g(α k ) i, and then it moves the current solution in the direction d F k W = (e i α k ) seeking for the best feasible improvement of the objective function [4, 11, 16]. It can be shown that the convergence of this method can be boosted by considering a slight variant [2, 6, 8]: instead of always moving α k towards the ascent vertex e i, we can consider moving α k away from the (descent) vertex e j argmin j:αj>0 g(α k ) j. This variant, known as the Modified Frank-Wolfe (MFW) method, is summarized in Algorithm 1. It has been shown that MFW asymptotically exhibits linear convergence to the solution of problem (1) under some assumptions on the form of the objective function and the feasible set [1, 2, 8, 17]. In addition, the MFW algorithm has the potential to compute sparser solutions in practice, since in contrast to the FW method, it allows setting a coordinate of α k to zero at each step. However, MFW often fails to improve the running times of FW and sometimes indeed is slower [2, 12, 17]. 1

2 Algorithm 1: MFW method for problem (1). 1 Compute an initial estimate α 0. 2 Set I 0 = {i : α 0,i 0}. 3 for k = 0, 1,... do 4 Search for i argmax i g(α k ) i and define d F k W = e i α k. 5 Search for j argmin j Ik g(α k ) j and define d A k = α k e j. 6 if g(α k ) T d F k W g(α k ) T d A k then 7 Perform a line-search to find λ fw argmax λ [0,1] g(α k + λd F k W ). 8 Perform the FW step α = α k + λ fw (e i α k ). 9 Update I k by I = I k {i }. 10 else 11 Perform a line-search to find λ away argmax λ [0,1] g(α k + λd A k ). 12 Clip the line-search parameter, λ away = min(λ away, α k,j /(1 α k,j )) 13 Perform the AWAY step α = α k + λ away (α k e j ). 14 if λ away = α k,j /(1 α k,j ), then I = I \ {j }. 2 New Away Steps for FW Here we define a new type of away step. At each iteration, we find the ascent vertex e i and the away vertex e j, where i argmax i g(α k ) i and j argmin j:αj>0 g(α k ) j, just as in the MFW method. However, instead of considering the update α = α k + λ (α k e j ), we propose a pairwise away step of the form α = α k + λ (e i e j ), (2) where λ is determined by a line-search. That is, instead of exploring the away direction d MFW k = (α k e j ), our algorithm explores the direction d SWAP k = (e i e j ). The method, referred to as the SWAP method, is summarized in Algorithm 2 and depicted in Fig. (1). Note that this away step allows to get away from e j and move toward e i in the same iteration. Algorithm 2: The SWAP method for problem (1). 1 Set k = 0. 2 Compute an initial estimate α 0 and set I 0 = {i : α 0,i 0}. 3 for k = 0, 1,... do 4 Search for i argmax i g(α k ) i (ascent direction). 5 Search for j argmin j:αk,j ɛ w g(α k ) j (descent direction). 6 Perform a line-search to find λ swap argmax λ [0,αk,j ] g (α k + λ(e i e j )). 7 Perform a line-search to find λ fw argmax λ [0,1] g (α k + λ(e i α k )). 8 Compute δ swap = g (α k + λ swap (e i e j )) g(α k ) (improvement of a SWAP step). 9 Compute δ fw = g (α k + λ fw (e i α k )) g(α k ) (improvement of a toward step). 10 Compute δ k = max (δ swap, δ fw ) (the best improvement). 11 if (δ k = δ swap ) then 12 If λ swap = α k,j mark the iteration as a SWAP-drop step. 13 If λ swap = λ swap mark the iteration as a SWAP-add step. 14 Perform the SWAP step α = α k + λ swap (e i e j ). 15 Set I = I k {i }. 16 If a SWAP-drop step was performed, I = I \ {j }. 17 else 18 Mark the iteration as a FW step. 19 Perform the FW step α = α k + λ fw (e i α k ). 20 Set I = I k {i }. 2

3 z e 3 d FW α FW α SWAP d SWAP d MFW α k α MFW x e 1 e 2 y Figure 1: A sketch of the search directions used by FW, MFW and SWAP methods. 3 Convergence Analysis Here we state some results about the convergence of the proposed method. Proofs are omitted for space constraints. Note that the stopping criterion in Proposition 3, d (α k ) = max i g(α k ) i α T k g(α k) ε, corresponds to a primal-dual measure of optimality, as considered in [2, 4, 11]. Proposition 1. (Global Convergence) Suppose g is Lipschitz-continuous on the feasible set. Then, Algorithm 2 produces a sequence of iterates {α k } k such that g(α k ) converges to g(α ), where α is a solution of problem (1). If α is unique, {α k } k converges to α. Proposition 2. (Linear Convergence) Suppose g is twice continuously differentiable and that there is a solution α of (1) satisfying the strong sufficient condition of Robinson in [13]. Then, for sufficiently large k, any iteration marked as SWAP-add or FW in Algorithm 2 produces an iterate α k satisfying the inequality for some constant M > 1. g(α ) g(α ) g(α ) g(α k ) ( 1 1 ) M Proposition 3. Suppose g is twice continuously differentiable and suppose we stop the algorithm by using the stopping condition d (α k ) = max i g(α k ) i α T k g(α k) ε for some ε (0, 1). Let K be the number of unclipped iterations performed by Algorithm 2. Then, where Q, M are constants independent of m and ε. 4 Experiments (3) K Q + M ε, (4) In this section, we evaluate the performance of the proposed method, specialized to the problem of training L 2 Support Vector Machines [5, 6, 15]. We present experiments performed on well-known 3

4 classification data available in several public repositories [3, 7]. In order to provide an idea of the size of each problem, we specify the size m of the training set and the number of classes K. In the case of multi-category classification problems, we adopt a one-versus-one approach (OVO) [10]. For the initialization of all the methods, that is, the computation of a starting solution α 0, we adopted the method proposed in [15]. In this approach, the starting solution is obtained by solving problem on a random subset of p training patterns. The indices of α 0 corresponding to other data points are set to zero. We used p = 20 points for initialization and the stopping criterion of Proposition 3 with ε = 10 6 for all the algorithms. In all the experiments presented in this paper, SVMs were trained using a RBF (Gaussian) kernel k(x 1, x 2 ) = exp ( x 1 x 2 2 /2σ 2) with scale parameter σ 2. Parameter σ 2 was determined using the default method employed in [15], i.e. it was set to the average squared distance among training patterns. Parameter C was determined on the logarithmic grid [2 0, 2 12 ] using a validation set consisting in a randomly computed 30% of the training-set. We also adopted the LRR caching strategy designed in [14] to avoid the computation of recently used kernel values and the probabilistic speed-up described in [14] to accelerate the search for i. Dataset K m FW MFW SWAP Adult a1a E E E-01 Adult a5a E E E+00 Adult a8a E E E+03 Web w1a E E E-01 Web w5a E E E+00 Web w8a E E E+00 Protein E E E+03 Usps-Ext E E E+01 Shuttle E E E-01 Kdd-10pc E E E-01 Table 1: Running times (seconds) of the different methods on some classification problems. Dataset K m FW MFW SWAP Adult a1a Adult a5a Adult a8a Web w1a Web w5a Web w8a Protein Usps-Ext Shuttle Kdd-10pc Table 2: Testing accuracies of the different methods on some classification problems. 5 Conclusions We presented a variant of the FW method for the general problem of maximizing a concave function on the unit simplex, introducing a novel way to perform away steps in the FW method. On the theoretical side, we showed that the method converges globally and that the unclipped optimization steps provide a linear rate of convergence. We also stated that the method achieves a primal-dual gap d (α k ) = max i g(α k ) i α T k g(α k) [4, 11] lower than a given tolerance ε in O(1/ε) unclipped iterations, independently of m, the dimensionality of the feasible space and the number of examples in SVM problems. On the experimental side, we showed that the proposed method was faster than the FW and MFW methods on several SVM training problems. We observed that the SWAP method performed better than FW and MFW in those cases in which classic away steps effectively boost the convergence of the FW method, and proved to be a robust alternative to MFW in the cases where classic away steps failed. 4

5 References [1] S. Damla Ahipasaoglu, Sun Peng, and Michael Todd. Linear convergence of a modified Frank-Wolfe algorithm for computing minimum volume enclosing ellipsoids. Optimization Methods and Software, 23(1):5 19, [2] Héctor Allende, Emanuele Frandi, Ricardo Ñanculef, and Claudio Sartori. Novel Frank-Wolfe methods for SVM learning. Technical report, [3] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, [4] Kenneth Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. In Proceedings of SODA 08, pages SIAM, [5] Emanuele Frandi, Maria Grazia Gasparo, Stefano Lodi, Ricardo Ñanculef, and Claudio Sartori. A new algorithm for training SVMs using approximate minimal enclosing balls. In Proceedings of the 15th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science, pages Springer, [6] Emanuele Frandi, Maria Grazia Gasparo, Stefano Lodi, Ricardo Ñanculef, and Claudio Sartori. Training support vector machines using Frank-Wolfe methods. International Journal of Pattern Recognition and Artificial Intelligence, 27(3), [7] Andrew Frank and Arthur Asuncion. The UCI KDD Archive [8] Jacques Guélat and Patrice Marcotte. Some comments on Wolfe s away step. Mathematical Programming, 35: , [9] Bernd Gärtner and Martin Jaggi. Coresets for polytope distance. In John Hershberger and Efi Fogel, editors, Symposium on Computational Geometry, pages ACM, [10] Thomas Hofmann, Bernhard Schölkopf, and Alexander Smola. Kernel methods in machine learning. Annals of Statistics, 36(3): , [11] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning (to appear), [12] Hua Ouyang and Alexander Gray. Fast stochastic Frank-Wolfe algorithms for nonlinear SVMs. In SDM, pages , [13] Stephen Robinson. Generalized equations and their solutions, part II: Applications to nonlinear programming. In Optimality and Stability in Mathematical Programming, volume 19 of Mathematical Programming Studies, pages Springer Berlin Heidelberg, [14] Ivor Tsang, Andras Kocsor, and James Kwok. LibCVM Toolkit [15] Ivor Tsang, James Kwok, and Pak-Ming Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6: , [16] Philip Wolfe. Convergence theory in nonlinear programming. In J. Abadie, editor, Integer and Nonlinear Programming, pages North-Holland, Amsterdam, [17] Emre Alper Yildirim. Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization, 19(3): ,

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets

Journal of Machine Learning Research 8 (27) 291-31 Submitted 1/6; Revised 7/6; Published 2/7 Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Gaëlle Loosli Stéphane Canu