Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Similar documents
Random Block Coordinate Descent Methods for Linearly Constrained Optimization over Networks

A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara

On Nesterov s Random Coordinate Descent Algorithms

Inverse Power Method for Non-linear Eigenproblems

Coordinate Descent Methods on Huge-Scale Optimization Problems

On Nesterov s Random Coordinate Descent Algorithms - Continued

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

Constrained optimization

Constrained Optimization

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Coordinate Descent and Ascent Methods

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Solving Dual Problems

Gradient Descent. Dr. Xiaowei Huang

Math 273a: Optimization Convex Conjugacy

Semidefinite Programming Basics and Applications

IE 5531 Midterm #2 Solutions

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

MIT Algebraic techniques and semidefinite optimization February 14, Lecture 3

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Convex Optimization M2

Linear Analysis Lecture 5

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Sufficient Conditions for Finite-variable Constrained Minimization

Coordinate gradient descent methods. Ion Necoara

Dual Ascent. Ryan Tibshirani Convex Optimization

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Constrained Optimization and Lagrangian Duality

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Applications of Linear Programming

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Linear Convergence under the Polyak-Łojasiewicz Inequality

Lecture: Introduction to LP, SDP and SOCP

Convex Optimization Algorithms for Machine Learning in 10 Slides

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Max Margin-Classifier

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

5. Duality. Lagrangian

CS-E4830 Kernel Methods in Machine Learning

You should be able to...

Numerical Optimization

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

Convex relaxation. In example below, we have N = 6, and the cut we are considering

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

A projection algorithm for strictly monotone linear complementarity problems.

Convex relaxation. In example below, we have N = 6, and the cut we are considering

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Linear Convergence under the Polyak-Łojasiewicz Inequality

Lecture 18: Optimization Programming

Generalized Power Method for Sparse Principal Component Analysis

Combinatorial Types of Tropical Eigenvector

ICS-E4030 Kernel Methods in Machine Learning

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Conditional Gradient (Frank-Wolfe) Method

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Lecture #21. c T x Ax b. maximize subject to

Tutorial on Convex Optimization: Part II

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Math 273a: Optimization Subgradients of convex functions

Randomized Gossip Algorithms

A Continuation Approach Using NCP Function for Solving Max-Cut Problem

Fenchel Duality between Strong Convexity and Lipschitz Continuous Gradient

Lecture 24 November 27

Dual Decomposition for Inference

Chapter 2. Optimization. Gradients, convexity, and ALS

Lecture: Smoothing.

Linear Methods for Regression. Lijun Zhang

Delsarte s linear programming bound

Lecture Note 5: Semidefinite Programming for Stability Analysis

Algorithms for Scientific Computing

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

Convex Optimization Lecture 16

Convex Optimization and Modeling

1 Review: symmetric matrices, their eigenvalues and eigenvectors

Lecture 3: Semidefinite Programming

Lasso: Algorithms and Extensions

5. Subgradient method

Utility, Fairness and Rate Allocation

Big Data Analytics: Optimization and Randomization

6-1 The Positivstellensatz P. Parrilo and S. Lall, ECC

Sparse Gaussian conditional random fields

Optimal Scaling of a Gradient Method for Distributed Resource Allocation 1

Optimization methods

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Convex Optimization Boyd & Vandenberghe. 5. Duality

Generalization to inequality constrained problem. Maximize

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Transcription:

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012

Outline Introduction Randomized Block (i,j) Coordinate Descent Method RCD Method in Strongly Convex Case Random Pairs Sampling Extensions Numerical experiment

Introduction Coordinate Descent Method consider Q: How to choose? a) cyclic. (difficult to prove convergence) b) maximal descent. convergence rate is trivial (worse than simple Gradient Method in general) c) random. (faster, simpler, robust, distributed and parallel, etc. )

Introduction Randomized(block) coordinate descent methods a) The first analysis of this method, when applied to the problem of minimizing a smooth convex function, was performed by Nesterov (2010) [1]. b) The extension to composite functions was given by Richtárik and Takáč (2011) [2] [1] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, Core Discussion Paper, 2010. [2] P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, submitted to Mathematical Programming, 2011.

Problem formulation Minimize a separable convex objective function with linearly coupled constraints. Extension to problems with non-separate objective function and general linear constraints.

Motivation of Formulation Applications in Resource allocation in economic systems, Distributed computer systems, Traffic equilibrium problems, Network flow, etc. Dual problem corresponding to an optimization of a sum of convex functions. Finding a point in the intersection of some convex sets.

Notations (2.1) becomes min Nn x f ( x) s.t. Ux = 0 N * * : = 0 i = 0 i= 1 KKT Ux x f( x ) = U λ ( f ( x ),..., f ( x ) ) = ( λ,..., λ ) * T * T * T T T 1 1 N N * * i( i ) j( j) [1 : N] f x = f x i j

Notations Consider the subspace It s orthogonal complement Define extended norm induced by G: (for the gradients), Cauchy-Schwartz inequality:

Notations Partition of the identity matrix U i 0n n = I n n 0 n n ----i th entry N n x= Ux, x, i= 1 i i i n f( Uα) = f ( α), α. i= 1 i + x = x+ d = U ( x + d ) N i= 1 + f( x ) = f ( x + d ) N i i i i i i i x, d i i n

Basic Assumption All f i are convex, f i are Lipschitz continuous(with Lipschitz constants ), i.e.: L > i 0 Graph (V,E) is undirected and connected, with N notes V={1,,N}. use (, i j) E as chosen coordinates.

Randomized Block (i,j) Coordinate Recall Descent Method min f( x) f ( x )+ + f ( x ) Choose randomly a pair probability pij ( = pji ) > 0 Define Nn = 1 1 x N N s.t. x + + x = 0. 1 N (, i j) E with

Randomized Block (i,j) Coordinate Descent Method Consider feasibility of i.e. we require. d i + d = 0 Minimize the right hand side adding feasibility j Get the following decrease in f

Randomized Block (i,j) Coordinate Descent Method Each iteration: compute only, full gradient methods:. depends on random variable: Define the expected value:

Randomized Block (i,j) Coordinate Key Inequality : Descent Method where G = (0 + ( e e )( e e )) I T T Nn Nn ij N i j i j n

Randomized Block (i,j) Coordinate Descent Method Introduce the distance which measures the size of the level set of f given by. x 0 Convergence results:

Proof Randomized Block (i,j) Coordinate convexity : Descent Method and key inequality: obtain take expectation in (denoting ),

Design of the probability Uniform probabilities: Dependent on the Lipschitz constants: Design the probability since

Design of the probability Recall convergence rate: Idea: searching for to optimize. i.e. is assumed constant such that for.

Design of the probability Using the relaxation from semidefinite programming: 2 2 2 where, R = R1 R N and (,, ) T are multipliers in Lagrange Relaxation.

Design of the probability Note Convergence rate under designed probability

Comparison with full gradient method Consider a particular case: a) a complete graph b) probability, N 1 1 1 L L L 1 N 1 L L = 1 L 1 1 N 1 L L L N N upper bound (BCD method)

Comparison with full gradient method Full gradient method similarly, (full) (random)

Strongly Convex Case Strongly convex w.r.t parameter with convexity minimizing over x and key inequality:

Strongly Convex Case Similarly, choose the optimal probability by solving the following SDP:

Rate of convergence in probability The proof use a similar reasoning as Theorem 1 in [14] and is derived from Markov inequality. [14] P. Richtarik and M. Takac, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, submitted to Mathematical Programming, 2011.

Rate of convergence in probability

Random pairs sampling ( RCD) (, i j ) method needs to choose a pair of coordinates at each iteration. So we need a fast procedure to generate random pairs. Given probability distribution redefine (, i j) then divide [0,1] into into a indices vector such that: n p subintervals: np = E Remark :

Random pairs sampling Clearly, the width of interval probability, p ij l l Sampling Algorithm Description l equals the

Generalizations Extension of ( RCD) (, i j ) to more than one pair. The same rate of convergence will be obtained for previous sections. ( RCD) M as

Generalizations Extension of ( RCD) (, i j ) to nonseparable objective functions with general equality constraints. f has component-wise Lipschitz continuous gradient:

Generalizations arg min As + A s i i j j Assuming

Generalizations Similar convergence rate: Similar choosing the probability:

Google Problem Goal:

Thank you!