Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012

Outline Introduction Randomized Block (i,j) Coordinate Descent Method RCD Method in Strongly Convex Case Random Pairs Sampling Extensions Numerical experiment

Introduction Coordinate Descent Method consider Q: How to choose? a) cyclic. (difficult to prove convergence) b) maximal descent. convergence rate is trivial (worse than simple Gradient Method in general) c) random. (faster, simpler, robust, distributed and parallel, etc. )

Introduction Randomized(block) coordinate descent methods a) The first analysis of this method, when applied to the problem of minimizing a smooth convex function, was performed by Nesterov (2010) [1]. b) The extension to composite functions was given by Richtárik and Takáč (2011) [2] [1] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, Core Discussion Paper, 2010. [2] P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, submitted to Mathematical Programming, 2011.

Problem formulation Minimize a separable convex objective function with linearly coupled constraints. Extension to problems with non-separate objective function and general linear constraints.

Motivation of Formulation Applications in Resource allocation in economic systems, Distributed computer systems, Traffic equilibrium problems, Network flow, etc. Dual problem corresponding to an optimization of a sum of convex functions. Finding a point in the intersection of some convex sets.

Notations (2.1) becomes min Nn x f ( x) s.t. Ux = 0 N * * : = 0 i = 0 i= 1 KKT Ux x f( x ) = U λ ( f ( x ),..., f ( x ) ) = ( λ,..., λ ) * T * T * T T T 1 1 N N * * i( i ) j( j) [1 : N] f x = f x i j

Notations Consider the subspace It s orthogonal complement Define extended norm induced by G: (for the gradients), Cauchy-Schwartz inequality:

Notations Partition of the identity matrix U i 0n n = I n n 0 n n ----i th entry N n x= Ux, x, i= 1 i i i n f( Uα) = f ( α), α. i= 1 i + x = x+ d = U ( x + d ) N i= 1 + f( x ) = f ( x + d ) N i i i i i i i x, d i i n

Basic Assumption All f i are convex, f i are Lipschitz continuous(with Lipschitz constants ), i.e.: L > i 0 Graph (V,E) is undirected and connected, with N notes V={1,,N}. use (, i j) E as chosen coordinates.

Randomized Block (i,j) Coordinate Recall Descent Method min f( x) f ( x )+ + f ( x ) Choose randomly a pair probability pij ( = pji ) > 0 Define Nn = 1 1 x N N s.t. x + + x = 0. 1 N (, i j) E with

Randomized Block (i,j) Coordinate Descent Method Consider feasibility of i.e. we require. d i + d = 0 Minimize the right hand side adding feasibility j Get the following decrease in f

Randomized Block (i,j) Coordinate Descent Method Each iteration: compute only, full gradient methods:. depends on random variable: Define the expected value:

Randomized Block (i,j) Coordinate Key Inequality : Descent Method where G = (0 + ( e e )( e e )) I T T Nn Nn ij N i j i j n

Randomized Block (i,j) Coordinate Descent Method Introduce the distance which measures the size of the level set of f given by. x 0 Convergence results:

Proof Randomized Block (i,j) Coordinate convexity : Descent Method and key inequality: obtain take expectation in (denoting ),

Design of the probability Uniform probabilities: Dependent on the Lipschitz constants: Design the probability since

Design of the probability Recall convergence rate: Idea: searching for to optimize. i.e. is assumed constant such that for.

Design of the probability Using the relaxation from semidefinite programming: 2 2 2 where, R = R1 R N and (,, ) T are multipliers in Lagrange Relaxation.

Design of the probability Note Convergence rate under designed probability

Comparison with full gradient method Consider a particular case: a) a complete graph b) probability, N 1 1 1 L L L 1 N 1 L L = 1 L 1 1 N 1 L L L N N upper bound (BCD method)

Comparison with full gradient method Full gradient method similarly, (full) (random)

Strongly Convex Case Strongly convex w.r.t parameter with convexity minimizing over x and key inequality:

Strongly Convex Case Similarly, choose the optimal probability by solving the following SDP:

Rate of convergence in probability The proof use a similar reasoning as Theorem 1 in [14] and is derived from Markov inequality. [14] P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, submitted to Mathematical Programming, 2011.

Rate of convergence in probability

Random pairs sampling ( RCD) (, i j ) method needs to choose a pair of coordinates at each iteration. So we need a fast procedure to generate random pairs. Given probability distribution redefine (, i j) then divide [0,1] into into a indices vector such that: n p subintervals: np = E Remark :

Random pairs sampling Clearly, the width of interval probability, p ij l l Sampling Algorithm Description l equals the

Generalizations Extension of ( RCD) (, i j ) to more than one pair. The same rate of convergence will be obtained for previous sections. ( RCD) M as

Generalizations Extension of ( RCD) (, i j ) to nonseparable objective functions with general equality constraints. f has component-wise Lipschitz continuous gradient:

Generalizations arg min As + A s i i j j Assuming

Generalizations Similar convergence rate: Similar choosing the probability:

Google Problem Goal:

Thank you!