Universitatea Politehnica Bucureşti Facultatea de Automatică şi Calculatoare Departamentul de Automatică şi Ingineria Sistemelor

Size: px

Start display at page:

Download "Universitatea Politehnica Bucureşti Facultatea de Automatică şi Calculatoare Departamentul de Automatică şi Ingineria Sistemelor"

Jonathan Doyle
6 years ago
Views:

1 Universitatea Politehnica Bucureşti Facultatea de Automatică şi Calculatoare Departamentul de Automatică şi Ingineria Sistemelor TEZĂ DE ABILITARE Metode de Descreştere pe Coordonate pentru Optimizare Rară (Coordinate Descent Methods for Sparse Optimization) Ion Necoară 013

2 . ii

3 Contents 1 Rezumat Contributiile acestei teze Principalele publicatii pe algoritmi de optimizare pe coordonate Summary 6.1 Contributions of this thesis Main publications on coordinate descent algorithms Random coordinate descent methods for linearly constrained smooth optimization Introduction Problem formulation Previous work Random block coordinate descent method Convergence rate in expectation Design of probabilities Comparison with full projected gradient method Convergence rate for strongly convex case Convergence rate in probability Random pairs sampling Generalizations Parallel coordinate descent algorithm Optimization problems with general equality constraints Applications Recovering approximate primal solutions from full dual gradient Random coordinate descent methods for singly linearly constrained smooth optimization Introduction Problem formulation Random block coordinate descent method Convergence rate in expectation Choices for probabilities Worst case analysis between (RCD) and full projected gradient Convergence rate in probability Convergence rate for strongly convex case Extensions Generalization of algorithm (RCD) to more than blocks Extension to different local norms iii

4 Contents iv 4.9 Numerical experiments Random coordinate descent method for linearly constrained composite optimization Introduction Problem formulation Previous work Random coordinate descent algorithm Convergence rate in expectation Convergence rate for strongly convex functions Convergence rate in probability Generalizations Complexity analysis and comparison with other approaches Numerical experiments Support vector machine Chebyshev center of a set of points Random generated problems with l 1 -regularization term Random coordinate descent methods for nonconvex composite optimization Introduction Unconstrained minimization of composite objective functions Problem formulation An 1-random coordinate descent algorithm Convergence Linear convergence for objective functions with error bound Constrained minimization of composite objective functions Problem formulation A -random coordinate descent algorithm Convergence Constrained minimization of smooth objective functions Numerical experiments Distributed random coordinate descent methods for composite optimization Introduction Problem formulation Motivating practical applications Distributed and parallel coordinate descent method Sublinear convergence for smooth convex minimization Linear convergence for error bound convex minimization Conditions for generalized error bound functions Case 1: f strongly convex and Ψ convex Case : Ψ indicator function of a polyhedral set Case 3: Ψ polyhedral function Case 4: dual formulation Convergence analysis under sparsity conditions Distributed implementation Comparison with other approaches Numerical simulations

5 Contents v 8 Parallel coordinate descent algorithm for separable constraints optimization: application to MPC Introduction Parallel coordinate descent algorithm (PCDM) for separable constraints minimization Parallel Block-Coordinate Descent Method Application of PCDM to distributed suboptimal MPC MPC for networked systems: terminal cost and no end constraints Distributed synthesis for a terminal cost Stability of the MPC scheme Distributed implementation of MPC scheme based on PCDM Numerical Results Quadruple tank process Implementation of the MPC scheme using MPI Implementation of the MPC scheme using Siemens S7-100 PLC Implementation of MPC scheme for random networked systems Future Work Huge-scale sparse optimization: theory, algorithms and applications Methodology and capacity to generate results Optimization based control for distributed networked systems Methodology and capacity to generate results Optimization based control for resource-constrained embedded systems Methodology and capacity to generate results Bibliography 174

6 Chapter 1 Rezumat 1.1 Contributiile acestei teze Principala problema de optimizare considerata in aceasta teza este de urmatoarea forma: min F (x) (= f(x) + Ψ(x)) (1.1) x Rn s.t.: Ax = b, unde f este o functie neteda (gradient Lipschitz), Ψ este o functie de regularizare simpla (minimizarea sumei dintre aceasta functie si una patratica este usoara) si matricea A R m n este de obicei rara data de structura unui graf asociat problemei. O alta caracteristica a problemei este dimensiunea foarte mare, adica n este de ordinul milioanelor sau miliardelor. Presupunem de asemenea ca variabila de decizie x poate fi descompusa in (blocuri) componente x = [x T 1 x T... x T N ]T, unde x i R n i si i n i = n. De notat este faptul ca acesta problema de optimizare este foarte generala si apare in multe aplicatii din inginerie: Ψ este functia indicator a unei multimi convexe X care poate fi scrisa de obicei ca un produs Cartezian X = X 1 X X N, unde X i R n i. Aceasta problema este cunoscuta sub numele de problema de optimizare separabila cu restrictii de cuplare liniare si apare in multe aplicatii din control si estimare distribuita [13,6,65,100,11], optimizare in retea [9,, 8, 98, 110, 11], computer vision [10, 44], etc. Ψ este fie functia indicator a unei multimi convexe X = X 1 X X N sau norma 1, notatata x 1 (pentru a obtine solutie rara) iar matricea A = 0. Aceasta problema apare in control predictiv distribuit [61, 103], procesare de imagine [14, 1, 47, 105], clasificare [99, 13, 14], data mining [16, 86, 119], etc. Ψ este functia indicator a unei multimi convexe X = X 1 X X N iar A = a T, adica avem o singura restrictie liniara de cuplare. Aceasta problema apare in ierarhizarea paginilor (problema Google) [59, 76], control [39, 83, 84, 104], invatare [16 18, 109, 111], truss topology design [4], etc. Se observa ca (1.1) se incadreaza in clasa de probleme de optimizare de mari dimensiuni cu date si/sau solutii rare. Abordarea standard pentru rezolvarea problemei de optimizare de dimensiuni foarte mari (1.1) se bazeaza pe descompunere. Metodele de descompunere reprezinta o unealta eficienta pentru rezolvarea acestui tip de problema datorita faptului ca acestea permit impartirea problemei originale de dimensiuni mari in subprobleme mici care sunt apoi coordonate de o 1

7 1.1 Contributiile acestei teze problema master. Metodele de descompunere se impart in doua clase: descompunere primala si duala. In metodele de descompunere primala problema originala este tratata direct, in timp ce in metodele duale restrictiile de cuplare sunt mutate in cost folosind multiplicatorii Lagrange, dupa care se rezolva problema duala. In activitatea mea de cercetare din ultimii 7 ani am dezvoltat si analizat algoritmi apartinand ambelor clase de metode de descompunere. Din cunostintele mele am fost printre primii cercetatori care au folosit tehnicile de netezire in descompunerea duala pentru a obtine rate de convergenta mai rapide pentru algoritmii duali propusi (vezi lucrarile [64, 65, 71, 7, 90, 91, 110]). Totusi, in aceasta teza am optat pentru prezentarea celor mai recente rezultate obtinute de mine pentru metodele de descompunere primala, si anume metodele de descrestere pe coordonate (vezi lucrarile [59 61, 65, 67, 70, 84]). Principalele contributii ale acestei teze, pe capitole, sunt urmatoarele: Capitol 3: In acest capitol dezvoltam metode aleatoare de descrestere pe coordonate pentru minimizarea problemelor de optimizare convexa de dimensiuni foarte mari supuse la constrangeri liniare de cuplare si avand functia obiectiv cu gradient Lipschitz pe coordonate. Deoarece avem constrangeri de cuplare in problema de optimizare, trebuie sa definim un algoritm care actualizeaza doua (blocuri) componente pe iteratie. Demonstram ca pentru aceste metode se obtine o ϵ-aproximativ solutie in valoarea medie a functiei obiectiv in cel mult O( 1 ) iteratii. Pe de alta parte, complexitatea numerica a fiecarei iteratii este mult mai mica ϵ decat a metodelor bazate pe intreg gradientul. Ne concentram de asemenea atentia pe alegerea optima a probabilitatilor pentru a face acesti algoritmi sa convearga rapid si demonstram ca aceasta conduce la rezolvarea unei probleme SDP rare si de mici dimensiuni. Analiza ratei de convergenta in probabilitate este de asemenea data in acest capitol. Pentru functii obiectiv tari convexe aratam ca noii algoritmi converg liniar. Extindem de asemena algoritmul principal, in care se actualizeaza o singura componenta (bloc), la un algorithm paralel in care se updateaza mai multe (blocuri de) componente pe iteratie si aratam ca pentru aceasta versiune paralela rata de convergenta depinde liniar de numarul de (blocuri) componente actualizate. Testele numerice confirma ca pe probleme de optimizare de largi dimensiuni, pentru care calcularea unei componente a gradientului este usoara din punct de vedere numeric, noile metode propuse sunt mult mai eficiente decat metodele bazate pe intreg gradientul. Acest capitol se bazeaza pe articolele [67, 68]. Capitol 4: In acest capitol dezvoltam metode aleatoare de descrestere pe coordonate pentru minimizarea problemelor de optimizare convexa multi-agent avand functia obiectiv cu gradient Lipschitz pe coordonate si cu o singura constrangere de cuplare. Datorita prezentei constrangerii de cuplare in problema de optimizare, algoritmii prezentati sunt de descrestere pe doua coordonate. Pentru astfel de metode demonstram ca in valoarea medie a functiei obiectiv putem obtine 1 o ϵ-aproximativ solutie in cel mult O( ) iteratii, unde λ λ (Q)ϵ (Q) este cea de-a doua valoare proprie a unei matrici Q definita in termeni de probabilitatile alese si numarul de blocuri. Pe de alta parte, complexitatea numerica per iteratie a metodelor noastre este mult mai mica decat a celor bazate pe intreg gradientul iar fiecare iteratie poate fi calculata intr-un mod distribuit. Analizam de asemenea posibilitatea alegerii optime a probabilitatilor si aratam ca aceasta analiza conduce la rezolvarea unei probleme SDP rare. Pentru metodele dezvoltate prezentam si ratele de convergenta in probabilitate. In cazul functiilor tari convexe aratam ca noii algoritmi au convergenta liniara. Prezentam de asemenea o versiune paralela a algoritmului principal, unde actualizam mai multe (blocuri de) componente pe iteratie, pentru care derivam de asemenea rata de convergenta. Algoritmii dezvoltati au fost implementati in Matlab pentru rezolvarea problemei Google iar rezultatele din simulari arata superioritatea acestora fata de metodele

8 1.1 Contributiile acestei teze 3 bazate pe informatie de intreg gradient. Acest capitol se bazeaza pe lucrarile [58, 59, 69]. Capitol 5: In acest capitol propunem o varianta a unei metode aleatoare de descrestere pe coordonate pentru rezolvarea problemelor de optimizare convexa cu funtia obiectiv de tip composite (compusa dintr-o functie convexa, cu gradient Lipschitz pe coordonate si o functie convexa, cu structura simpla, dar posibil nediferentiabila) si constrangeri liniare de cuplare. Daca partea neteda a functiei obiectiv are gradient Lipschitz pe coordonate, atunci metoda propusa alege aleator doua (blocuri) componente si obtine o ϵ-aproximativ solutie in valoarea medie a functiei obiectiv in O(N /ϵ) iteratii, unde N este numarul de (blocuri) componente. Pentru probleme de optimizare avand complexitate numerica mica pentru evaluarea unei componente a gradientului, metoda propusa este mai eficienta decat metodele bazate pe intreg gradientul. Analiza ratei de convergenta in probabilitate este de asemenea data in acest capitol. Pentru functii obiectiv tari convexe aratam ca noii algoritmi converg liniar. Algoritmul propus a fost implementat in cod C si testat pe date reale de SVM si pe problema gasirii centrului Chebyshev corespunzator unei multimi de puncte. Experimentele numerice confirma ca pe problemele de dimensiuni mari metoda noastra este mai eficienta decat metodele bazate pe intreg gradientul sau metodele greedy de descrestere pe coordonate. Acest capitol se bazeaza pe lucrarea [70]. Capitol 6: In acest capitol analizam noi metode aleatoare de descrestere pe coordonate pentru rezolvarea problemelor de optimizare neconvexe cu funtia obiectiv de tip composite: compusa dintr-o functie neconvexa dar cu gradient Lipschitz pe coordonate si o functie convexa, cu structura simpla, dar posibil nediferentiabila. De asemenea abordam ambele cazuri: neconstrans dar si cu constrangeri liniare de cuplare. Pentru problemele de optimizare cu structura definita mai sus, propunem metode aleatoare de descrestere pe coordonate si analizam proprietatile de convergenta ale acestora. In cazul general, demonstram pentru sirurile generate de noii algoritmi convergenta asimptotica la punctele stationare si rata de convergenta subliniara in valoarea medie a unei anumite functii masura de optimalitate. In plus, daca functia obiectiv satisface o anumita conditie de marginire a erorii de optimalitate, derivam convergenta locala liniara in valoarea medie a functiei obiectiv. Prezentam de asemenea experimente numerice pentru evaluarea performantelor practice ale algoritmilor propusi pe binecunoscuta problema de complementaritate a valorii proprii. Din experimentele numerice se observa ca pe problemele de dimensiuni mari metoda noastra este mai eficienta decat metodele bazate pe intreg gradientul. Acest capitol se bazeaza pe lucrarile [84, 85]. Capitol 7: In acest capitol propunem o versiune distribuita a unei metode aleatoare de descrestere pe coordonate pentru minimizarea unei functii obiectiv de tip composite: compusa dintr-o functie neteda convexa, partial separabila si una total separabila, convexa, dar posibil nediferentiabila. Sub ipoteza de gradient Lipschitz a partii netede, aceasta metoda are o rata de convergenta subliniara. Rata de convergenta liniara se obtine pentru o clasa nou introdusa de functii obiectiv ce satisface o conditie generalizata de marginire a erorii de optimalitate. Aratam ca in noua clasa de functii se regasesc functii deja studiate, cum ar fi clasa de functii tari convexe sau clasa de functii ce satisface conditia de marginire a erorii de optimalitate clasica. Demonstram de asemenea, ca estimarile teoretice ale ratelor de convergenta depind liniar de numarul de (blocuri) componente alese aleator si de o masura a separabilitatii functiei obiectiv. Algoritmul propus a fost implementat in cod C si testat pe problema lasso constransa. Experimentele numerice confirma ca prin paralelizare se poate accelera substantial rata de convergenta a metodei clasice de descrestere pe coordonate. Acest capitol se bazeaza pe lucrarea [60].

9 1. Principalele publicatii pe algoritmi de optimizare pe coordonate 4 Capitol 8: In acest capitol propunem un algoritm paralel de descrestere pe coordonate pentru rezolvarea problemelor de optimizare convexa cu restrictii separabile ce pot aparea de exemplu in controlul predictiv distribuit bazat pe model (MPC) pentru sisteme liniare de tip retea. Algoritmul nostru se bazeaza pe actualizarea in paralel pe coordonate si are iteratia foarte simpla. Demonstram rata de convergenta liniara (subliniara) pentru sirul generat de noul algoritm sub ipoteze standard pentru functia obiectiv. Mai mult, algoritmul foloseste informatie locala pentru actualizarea componentelor variabilei de decizie si astfel este adecvat pentru implementare distribuita. Avand, de asemenea complexitatea iteratiei mica, este potrivit pentru controlul de tip embedded. Propunem o metoda de control de tip MPC bazata pe acest algoritm, pentru care fiecare subsistem din retea poate calcula intrari fezabile si stabilizatoare folosind calcule ieftine si distribuite. Metoda de control propusa a fost implementata pe un PLC Siemens in scopul controlului unei instalatii reale cu patru rezervoare. Acest capitol se bazeaza pe lucrarea [61]. 1. Principalele publicatii pe algoritmi de optimizare pe coordonate Rezultatele prezentate in aceasta teza au fost acceptate spre publicare in reviste ISI de top sau conferinte de prestigiu. O parte din rezultate (Capitolul 7) au fost trimise recent la reviste. Prezentam mai jos lista de publicatii pe care se bazeaza aceasta teza. Articole in reviste ISI I. Necoara and D. Clipici, Distributed random coordinate descent methods for composite minimization, partially accepted in SIAM Journal of Optimization, pp. 1 40, December 013. A. Patrascu and I. Necoara, Random coordinate descent methods for l0 regularized convex optimization, accepted in IEEE Transactions on Automatic Control, to appear, 014. I. Necoara, Random coordinate descent algorithms for multi-agent convex optimization over networks, IEEE Transactions on Automatic Control, vol. 58, no. 8, pp. 1 1, 013. A. Patrascu, I. Necoara, Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization, Journal of Global Optimization, DOI: /s , 1-3, 014. I. Necoara. A. Patrascu, A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints, Computational Optimization and Applications, vol. 57, no., pp , 014. I. Necoara, D. Clipici, Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed MPC, Journal of Process Control, vol. 3, no. 3, pp , 013. I. Necoara, V. Nedelcu and I. Dumitrache, Parallel and distributed optimization methods for estimation and control in networks, Journal of Process Control, vol. 1, no. 5, pp , 011.

10 1. Principalele publicatii pe algoritmi de optimizare pe coordonate 5 Articole in pregatire I. Necoara, Y. Nesterov and F. Glineur, A random coordinate descent method on large optimization problems with linear constraints, Technical Report, University Politehnica Bucharest, June 014. Articole in conferinte I. Necoara, Y. Nesterov and F. Glineur, A random coordinate descent method on large optimization problems with linear constraints, The Fourth International Conference on Continuous Optimization, Lisbon, 013. I. Necoara, A. Patrascu, A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints, The Fourth International Conference on Continuous Optimization, Lisbon, 013. I. Necoara, D. Clipici, A computationally efficient parallel coordinate descent algorithm for MPC implementation on a PLC, in Proceedings of 1th European Control Conference, 013 A. Patrascu, I. Necoara, A random coordinate descent algorithm for large-scale sparse nonconvex optimization, in Proceedings of 1th European Control Conference, 013. I. Necoara, Suboptimal distributed MPC based on a block-coordinate descent method with feasibility and stability guarantees, in Proceedings of 51th IEEE Conference on Decision and Control, 01. I. Necoara, A random coordinate descent method for large-scale resource allocation problems, in Proceedings of 51th IEEE Conference on Decision and Control, 01. I. Necoara, A Patrascu, A random coordinate descent algorithm for singly linear constrained smooth optimization, in Proceedings of 0th Mathematical Theory of Network and Systems, 01.

11 Chapter Summary.1 Contributions of this thesis The main optimization problem of interest considered in this thesis has the following form: min F (x) (= f(x) + Ψ(x)) (.1) x Rn s.t.: Ax = b, where f is a smooth function (i.e. with Lipschitz gradient), Ψ is a simple convex function (i.e. minimization of the sum of this function with a quadratic term is easy) and matrix A R m n is usually sparse according to some graph structure. Another characteristic of this problem is its very large dimension, i.e. n is very large, in particular we deal with millions or even billions of variables. We further assume that the decision variable x can be decomposed in (block) components x = [x T 1 x T... x T N ]T, where x i R n i and i n i = n. Note that this problem is very general and appears in many engineering applications: Ψ is the indicator function of some convex set X that can be written usually as a Cartesian product X = X 1 X X N, where X i R n i. This problem is known in the literature as separable optimization problem with linear coupling constraints and appears in many applications from distributed control and estimation [13,6,65,100,11], network optimization [9,, 8, 98, 110, 11], computer vision [10, 44], etc. Ψ is either the indicator function of some convex set X = X 1 X X N or 1- norm x 1 (in order to induce sparsity in the solution) and matrix A = 0. This problem appears in distributed model predictive control [61,103], image processing [14,1,47,105], classification [99, 13, 14], data mining [16, 86, 119], etc. Ψ is the indicator function of some convex set X = X 1 X X N and A = a T, i.e. a single linear coupled constraint. This problem appears is page ranking (also known as Google problem) [59, 76], control [39, 83, 84, 104], learning [16 18, 109, 111], truss topology design [4], etc. We notice that (.1) belongs to the class of large-scale optimization problems with sparse data/solutions. The standard approach for solving the large-scale optimization problem (.1) is to use decomposition. Decomposition methods represent a powerful tool for solving these type of problems due to their ability of dividing the original large-scale problem into smaller subproblems which are coordinated by a master problem. Decomposition methods can be 6

12 .1 Contributions of this thesis 7 divided in two main classes: primal and dual decomposition. While in the primal decomposition methods the optimization problem is solved using the original formulation and variables, in dual decomposition the constraints are moved into the cost using the Lagrange multipliers and the dual problem is solved. In the last 7 years I have pursued both approaches in my research. From my knowledge I am one of the first researchers that used smoothing techniques in Lagrangian dual decomposition in order to obtain faster convergence rates for the corresponding algorithms (see e.g. the papers [64, 65, 71, 7, 90, 91, 110]). In this thesis however, I have opted to present some of my recent results on primal decomposition, namely coordinate descent methods (see e.g. the papers [59 61, 65, 67, 70, 84]). The main contributions of this thesis, by chapters, are as follows: Chapter 3: In this chapter we develop random (block) coordinate descent methods for minimizing large-scale convex problems with linearly coupled constraints and prove that it obtains in expectation an ϵ-accurate solution in at most O( 1) iterations. Since we have ϵ coupled constraints in the problem, we need to devise an algorithm that updates randomly two (block) components per iteration. However, the numerical complexity per iteration of the new methods is usually much cheaper than that of methods based on full gradient information. We focus on how to choose the probabilities to make the randomized algorithm to converge as fast as possible and we arrive at solving sparse SDPs. Analysis for rate of convergence in probability is also provided. For strongly convex functions the new methods converge linearly. We also extend the main algorithm, where we update two (block) components per iteration, to a parallel random coordinate descent algorithm, where we update more than two (block) components per iteration and we show that for this parallel version the convergence rate depends linearly on the number of (block) components updated. Numerical tests confirm that on large optimization problems with cheap coordinate derivatives the new methods are much more efficient than methods based on full gradient. This chapter is based on papers [67,68]. Chapter 4: In this chapter we develop randomized block-coordinate descent methods for minimizing multi-agent convex optimization problems with a single linear coupled constraint over networks and prove that they obtain in expectation an ϵ accurate solution in at most 1 O( ) iterations, where λ λ (Q)ϵ (Q) is the second smallest eigenvalue of a matrix Q that is defined in terms of the probabilities and the number of blocks. However, the computational complexity per iteration of our methods is much simpler than the one of a method based on full gradient information and each iteration can be computed in a completely distributed way. We focus on how to choose the probabilities to make these randomized algorithms to converge as fast as possible and we arrive at solving a sparse SDP. Analysis for rate of convergence in probability is also provided. For strongly convex functions our distributed algorithms converge linearly. We also extend the main algorithm to a parallel random coordinate descent method and to problems with more general linearly coupled constraints for which we also derive rate of convergence. The new algorithms were implemented in Matlab and applied for solving the Google problem, and the simulation results show the superiority of our approach compared to methods based on full gradient. This chapter is based on papers [58, 59, 69]. Chapter 5: In this chapter we propose a variant of the random coordinate descent method for solving linearly constrained convex optimization problems with composite objective functions. If the smooth part of the objective function has Lipschitz continuous gradient, then we prove that our method obtains an ϵ-optimal solution in O(N /ϵ) iterations, where N is the number of blocks. For the class of problems with cheap coordinate derivatives we show that the new method

13 .1 Contributions of this thesis 8 is faster than methods based on full-gradient information. Analysis for the rate of convergence in probability is also provided. For strongly convex functions our method converges linearly. The proposed algorithm was implemented in code C and tested on real data from SVM and on the problem of finding the Chebyshev center for a set of points. Extensive numerical tests confirm that on very large problems, our method is much more numerically efficient than methods based on full gradient information or coordinate descent methods based on greedy index selection. This chapter is based on paper [70]. Chapter 6: In this chapter we analyze several new methods for solving nonconvex optimization problems with the objective function formed as a sum of two terms: one is nonconvex and smooth, and another is convex but simple and its structure is known. Further, we consider both cases: unconstrained and linearly constrained nonconvex problems. For optimization problems of the above structure, we propose random coordinate descent algorithms and analyze their convergence properties. For the general case, when the objective function is nonconvex and composite we prove asymptotic convergence for the sequences generated by our algorithms to stationary points and sublinear rate of convergence in expectation for some optimality measure. Additionally, if the objective function satisfies an error bound condition we derive a local linear rate of convergence for the expected values of the objective function. We also present extensive numerical experiments on the eigenvalue complementarity problem for evaluating the performance of our algorithms in comparison with state-of-the-art methods. From the numerical experiments we can observe that on large optimization problems the new methods are much more efficient than methods based on full gradient. This chapter is based on papers [84,85]. Chapter 7: In this chapter we propose a distributed version of a randomized (block) coordinate descent method for minimizing the sum of a partially separable smooth convex function and a fully separable non-smooth convex function. Under the assumption of block Lipschitz continuity of the gradient of the smooth function, this method is shown to have a sublinear convergence rate. Linear convergence rate of the method is obtained for the newly introduced class of generalized error bound functions. We prove that the new class of generalized error bound functions encompasses both global/local error bound functions and smooth strongly convex functions. We also show that the theoretical estimates on the convergence rate depend on the number of blocks chosen randomly and a natural measure of separability of the objective function. The new algorithm was implemented in code C and tested on the constrained lasso problem. Numerical experiments show that by parallelization we can accelerate substantially the rate of convergence of the classical random coordinate descent method. This chapter is based on paper [60]. Chapter 8: In this chapter we propose a parallel coordinate descent algorithm for solving smooth convex optimization problems with separable constraints that may arise e.g. in distributed model predictive control (MPC) for linear networked systems. Our algorithm is based on block coordinate descent updates in parallel and has a very simple iteration. We prove (sub)linear rate of convergence for the new algorithm under standard assumptions for smooth convex optimization. Further, our algorithm uses local information and thus is suitable for distributed implementations. Moreover, it has low iteration complexity, which makes it appropriate for embedded control. An MPC scheme based on this new parallel algorithm is derived, for which every subsystem in the network can compute feasible and stabilizing control inputs using distributed and cheap computations. For ensuring stability of the MPC scheme, we use a terminal cost formulation derived from a distributed synthesis. The proposed control method was q implemented on a PLC Siemens

14 . Main publications on coordinate descent algorithms 9 for controlling a four tank process. This chapter is based on paper [61].. Main publications on coordinate descent algorithms Most of the material that is presented in this thesis has been published, or accepted for publication, in top journals or conference proceedings. Some of the material (Chapter 7) has been submitted for publication recently. We detail below the main publications from this thesis. Articles in ISI journals I. Necoara and D. Clipici, Distributed random coordinate descent methods for composite minimization, partially accepted in SIAM Journal of Optimization, pp. 1 40, December 013. A. Patrascu and I. Necoara, Random coordinate descent methods for l0 regularized convex optimization, accepted in IEEE Transactions on Automatic Control, to appear, 014. I. Necoara, Random coordinate descent algorithms for multi-agent convex optimization over networks, IEEE Transactions on Automatic Control, vol. 58, no. 8, pp. 1 1, 013. A. Patrascu, I. Necoara, Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization, Journal of Global Optimization, DOI: /s , 1-3, 014. I. Necoara. A. Patrascu, A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints, Computational Optimization and Applications, vol. 57, no., pp , 014. I. Necoara, D. Clipici, Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed MPC, Journal of Process Control, vol. 3, no. 3, pp , 013. I. Necoara, V. Nedelcu and I. Dumitrache, Parallel and distributed optimization methods for estimation and control in networks, Journal of Process Control, vol. 1, no. 5, pp , 011. Articles in preparation I. Necoara, Y. Nesterov and F. Glineur, A random coordinate descent method on large optimization problems with linear constraints, Technical Report, University Politehnica Bucharest, June 014. Articles in conferences I. Necoara, Y. Nesterov and F. Glineur, A random coordinate descent method on large optimization problems with linear constraints, The Fourth International Conference on Continuous Optimization, Lisbon, 013. I. Necoara, A. Patrascu, A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints, The Fourth International Conference on Continuous Optimization, Lisbon, 013.

15 . Main publications on coordinate descent algorithms 10 I. Necoara, D. Clipici, A computationally efficient parallel coordinate descent algorithm for MPC implementation on a PLC, in Proceedings of 1th European Control Conference, 013 A. Patrascu, I. Necoara, A random coordinate descent algorithm for large-scale sparse nonconvex optimization, in Proceedings of 1th European Control Conference, 013. I. Necoara, Suboptimal distributed MPC based on a block-coordinate descent method with feasibility and stability guarantees, in Proceedings of 51th IEEE Conference on Decision and Control, 01. I. Necoara, A random coordinate descent method for large-scale resource allocation problems, in Proceedings of 51th IEEE Conference on Decision and Control, 01. I. Necoara, A Patrascu, A random coordinate descent algorithm for singly linear constrained smooth optimization, in Proceedings of 0th Mathematical Theory of Network and Systems, 01.

16 Chapter 3 Random coordinate descent methods for linearly constrained smooth optimization In this chapter we develop random block coordinate descent methods for minimizing large-scale convex problems with linearly coupled constraints and prove that it obtains in expectation an ϵ- accurate solution in at most O( 1 ) iterations. Since we have coupled constraints in the problem, ϵ we need to devise an algorithm that updates randomly two (block) components per iteration. However, the numerical complexity per iteration of the new methods is usually much cheaper than that of methods based on full gradient information. We focus on how to choose the probabilities to make the randomized algorithm to converge as fast as possible and we arrive at solving sparse SDPs. Analysis for rate of convergence in probability is also provided. For strongly convex functions the new methods converge linearly. We also extend the main algorithm, where we update two (block) components per iteration, to a parallel random coordinate descent algorithm, where we update more than two (block) components per iteration and we show that for this parallel version the convergence rate depends linearly on the number of (block) components updated. Numerical tests confirm that on large optimization problems with cheap coordinate derivatives the new methods are much more efficient than methods based on full gradient. This chapter is based on papers [67, 68]. 3.1 Introduction The performance of a network composed of interconnected subsystems can be increased if the traditionally separated subsystems are jointly optimized. Recently, parallel and distributed optimization methods have emerged as a powerful tool for solving large network optimization problems: e.g. resource allocation [3, 34, 11], telecommunications [8, 110], coordination in multi-agent systems [11], estimation in sensor networks [65], distributed control [65], image processing [1], traffic equilibrium problems [8], network flow [8] and other areas [5, 89, 10]. In this chapter we propose efficient distributed algorithms with cheap iteration for solving large separable convex problems with linearly coupled constraints that arise in network applications. For a centralized setup and problems of moderate size there exist many iterative algorithms to solve them such as Newton, quasi-newton or projected gradient methods. However, the problems that we consider in this chapter have the following features: the size of data is very large so that usual methods based on whole gradient computations are prohibitive. Moreover the incomplete structure of information (e.g. the data are distributed over the all nodes of the network, so that at a given time we need to work only with the data available then) may also be an obstacle for 11

17 3.1 Introduction 1 whole gradient computations. In this case, an appropriate way to approach these problems is through (block) coordinate descent methods. (Block) coordinate descent methods, early variants of which can be traced back to a paper of Schwartz from 1870 [101], have recently become popular in the optimization community due to their low cost per iteration and good scalability properties. Much of this work is motivated by problems in networked systems, largely since such systems are a popular framework with which we can model different problems in a wide range of applications [5, 76, 77, 89, 93, 110, 10]. The main differences in all variants of coordinate descent methods consist in the criterion of choosing at each iteration the coordinate over which we minimize the objective function and the complexity of this choice. Two classical criteria used often in these algorithms are the cyclic [8] and the greedy descent coordinate search [107], which significantly differs by the amount of computations required to choose the appropriate index. For cyclic coordinate search estimates on the rate of convergence were given recently in [6], while for the greedy coordinate search (e.g. Gauss-Southwell rule) the convergence rate is given e.g. in [107]. Another interesting approach is based on random choice rule, where the coordinate search is random. Recent complexity results on random coordinate descent methods for smooth convex functions were obtained by Nesterov in [76]. The extension to composite functions was given in [93]. However, in most of the previous work the authors considered optimization models where the constraint set is decoupled (i.e. characterized by Cartesian product). In this chapter we develop random block coordinate descent methods suited for large optimization problems in networks where the information cannot be gather centrally, but rather the information is distributed over the network. Moreover, we focus on optimization problems with linearly coupled constraints (i.e. the constraint set is coupled). Due to the coupling in the constraints we introduce a -block variant of random coordinate descent method, that involves at each iteration the closed form solution of an optimization problem only with respect to block variables while keeping all the other variables fixed. We prove for the new algorithm an expected convergence rate of order O( 1 ) for the function values, where k is the number of iterations. We k focus on how to design the probabilities to make this algorithm to converge as fast as possible and we prove that this problem can be recast as a sparse SDP. We also show that for functions with cheap coordinate derivatives the new method is faster than schemes based on full gradient information or based on greedy coordinate descent. Analysis for rate of convergence in probability is also provided. For strongly convex functions we prove that the new method converges linearly. We also extend the algorithm to a scheme where we can choose more than -block components per iteration and we show that the number of components appears directly in the convergence rate of this algorithm. While the most obvious benefit of randomization is that it can lead to faster algorithms, either in worst case complexity analysis and/or numerical implementation, there are also other benefits of our algorithm that are at least as important. For example, the use of randomization leads to a simpler algorithm that is easier to analyze, produces a more robust output and can often be organized to exploit modern computational architectures (e.g distributed and parallel computer architectures). The chapter is organized as follows. In Section 3. we introduce the optimization model analyzed in this chapter and the main assumptions. In Section 3.4 we present and analyze a random -block coordinate descent method for solving our optimization problem. We derive the convergence rate in expectation in Section 3.5 and we also provide means to choose the probability distribution. We also compare in Section 3.6 our algorithm with the full projected gradient method and other existing methods and show that on problems with cheap coordinate derivatives our method has better arithmetic complexity. In Sections 3.7 and 3.8 we analyze the convergence rate for strongly

18 3. Problem formulation 13 convex functions and in probability, respectively. In Section 3.10 we extend our algorithm to more than a pair of indexes and analyze the convergence rate of the new scheme. 3. Problem formulation We work in the space R n composed by column vectors. For x, y R n denote the standard Euclidian inner product x, y = n i=1 x iy i and Euclidian norm x = x, x 1/. We use the same notation, and for spaces of different dimension. The inner product on the space of symmetric matrices is denoted with W 1, W 1 = trace(w 1 W ) for all W 1, W symmetric matrices. We decompose the full space R Nn = Π N i=1r n. We also define the corresponding partition of the identity matrix: I Nn = [U 1 U N ], where U i R Nn n. Then for any x R Nn we write x = i U ix i. We denote e R N the vector with all entries 1 and e i R N the vector with all entries zero except the component i equal to 1. Furthermore, we define: U = [I n I n ] = e T I n R n Nn and V i = [0 U i 0] = e T U i R Nn Nn, where denotes the Kronecker product. Given a vector ν = [ν 1 ν N ] T R N, we define the vector ν p = [ν p 1 ν p N ]T for any integer p and diag(ν) denotes the diagonal matrix with the entries ν i on the diagonal. For a positive semidefinite matrix W R N N we consider the following order on its eigenvalues 0 λ 1 λ λ N and the notation x W = xt W x for any x. We consider large network optimization problems where each agent in the network is associated a local variable so that their sum is fixed and we need to minimize a separable convex objective function: { f = min f 1(x 1 ) + + f N (x N ) x i R n (3.1) s.t.: x x N = 0. Optimization problems with linearly coupled constraints (3.1) arise in many areas such as resource allocation in economic systems [34] or distributed computer systems [45], in signal processing [1], in traffic equilibrium and network flow [8] or distributed control [65]. For problem (3.1) we associate a network composed of several nodes V = {1,, N} that can exchange information according to a communication graph G = (V, E), where E denotes the set of edges, i.e. (i, j) E V V models that node j sends information to node i. We assume that the graph G is undirected and connected. The local information structure imposed by the graph G should be considered as part of the problem formulation. Note that constraints of the form α 1 x α N x N = b, where α i R, can be easily handled in our framework by a change of coordinates. The goal of this chapter is to devise a distributed algorithm that iteratively solves the convex problem (3.1) by passing the estimate of the optimizer only between neighboring nodes. There is great interest in designing such distributed algorithms, since centralized algorithms scale poorly with the number of nodes and are less resilient to failure of the central node. Let us define the extended subspace: { } N S = x R Nn : x i = 0, that has the orthogonal complement the subspace T = {u R Nn : u 1 = = u N }. We also use the notation: N x = [x T 1 x T N] T = U i x i R Nn, f(x) = f 1 (x 1 ) + + f N (x N ). The basic assumption considered in this chapter is the following: i=1 i=1

19 3.3 Previous work 14 Assumption 3..1 We assume that the functions f i are convex and have Lipschitz continuous gradient, with Lipschitz constants L i > 0, i.e.: f i (x i ) f i (y i ) L i x i y i x i, y i R n, i V. (3.) From the Lipschitz property of the gradient (3.), the following inequality holds (see e.g. Section in [75]): f i (x i + d i ) f i (x i ) + f i (x i ), d i + L i d i x i, d i R n. (3.3) The following inequality, which is central in our derivations below, is a straightforward consequence of (3.3) and holds for all x R Nn and d i, d j R n : f(x+u i d i +U j d j ) f(x)+ f i (x i ), d i + L i d i + f j (x j ), d j + L j d j. (3.4) We denote with X the set of optimal solutions for problem (3.1). The optimality conditions for optimization problem (3.1) become: x is optimal solution for the convex problem (3.1) if and only if N x i = 0, f i (x i ) = f j (x j) i j V. i=1 3.3 Previous work We briefly review some well-known methods from the literature for solving the optimization model (3.1). In [3, 11] distributed weighted gradient methods were proposed to solve a similar problem as in (3.1), in particular the authors in [11] consider strongly convex functions f i with positive definite Hessians. These papers propose a class of center-free algorithms (in these papers the term center-free refers to the absence of a coordinator) with the following iteration: x k+1 i = x k i j N i w ij ( fj (x k j ) f i (x k i ) ) i V, k 0, (3.5) where N i denotes the set of neighbors of node i in the graph G. Under the strong convexity assumption and provided that the weights w ij are chosen as a solution of a certain SDP, linear rate of convergence is obtained. Note however, that this method requires at each iteration the computation of the full gradient and the iteration complexity is O (N(n + n f )), where n f is the number of operations for evaluating the gradient of any function f i for all i V. In [107] Tseng studied optimization problems with linearly coupled constraints and composite objective functions of the form f + h, where h is convex nonsmooth function, and developed a block coordinate descent method based on the Gauss-Southwell choice rule. The principal requirement for this method is that at each iteration a subset of indexes I needs to be chosen with respect to the Gauss-Southwell rule and then the update direction is a solution of the following QP problem: d H (x; I) = arg min s: j I s j=0 f(x), s + 1 s H + h(x + s), where H is a positive definite matrix chosen at the initial step of the algorithm. Using this direction and choosing an appropriate step size α k, the next iterate is defined as: x k+1 = x k +

20 3.4 Random block coordinate descent method 15 α k d H (x k ; I k ). The total complexity per iteration of this method is O (Nn + n f ). In [107], the authors proved, for the particular case of a single linear constraint and the nonsmooth part h of the objective function is piece-wise linear and separable, that after k iterations a sublinear convergence rate of order O( Nn LR0 ) is attained for the function values, where L = max k i V L i and R 0 is the Euclidian distance of the starting iterate to the set of optimal solutions. In [5] a -coordinate descent method is developed for minimizing a smooth function subject to a single linear equality constraint and additional bound constraints on the decision variables. In the convex case, when all the variables are lower bounded but not upper bounded, ) the author, while the shows that the sequence of function values converges at a sublinear rate O ( Nn LR 0 k complexity per iteration is at least O (Nn + n f ). A random coordinate descent algorithm for an optimization model with smooth objective function an separable constraints was analyzed by Nesterov in [76], where a complete rate analysis is provided. The main feature of his randomized algorithm is the cheap iteration complexity of order O(n f + n + ln N), while still keeping sublinear rate of convergence. The generalization of this algorithm to composite objective function has been studied in [89, 93]. However, none of these papers studied the application of random coordinate descent algorithms to smooth convex problems with linearly coupled constraints. In this chapter we develop a random coordinate descent method for this type of optimization model as described in (3.1). 3.4 Random block coordinate descent method In this section we devise a randomized block coordinate descent algorithm for solving the separable convex problem (3.1) and analyze its convergence. We present a distributed method where only neighbors need to communicate with each other. At a certain iteration having a feasible estimate x S of the optimizer, we choose randomly a pair (i, j) E with probability p ij > 0. Since we assume an undirected graph G = (V, E) associated to problem (3.1) (the generalization of the present scheme to directed graphs is straightforward), we consider p ij = p ji. We assume that the graph G is connected. For a feasible x S and a randomly chosen pair of indexes (i, j), with i < j, we define the next feasible step x + R n as follows: x + = x + U i d i + U j d j. Derivation of the directions d i and d j is based on the inequality (3.4): f(x + ) f(x) + f i (x i ), d i + f j (x j ), d j + L i d i + L j d j. (3.6) Minimizing the right hand side of inequality (3.6), but imposing additionally feasibility for the next iterate x + (i.e. we require d i + d j = 0), we arrive at the following local minimization problem: [d T i d T j ] T = arg min s i,s j R n : s i +s j =0 f i(x i ), s i + f j (x j ), s j + L i s i + L j s j that has the closed form solution d i = 1 L i + L j ( f i (x i ) f j (x j )), d j = d i. (3.7)

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012 Outline