Preconditioning via Diagonal Scaling

Similar documents
Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Stochastic Matrix-Free Equilibration

Dual Ascent. Ryan Tibshirani Convex Optimization

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Introduction to Alternating Direction Method of Multipliers

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Homework 5. Convex Optimization /36-725

Truncated Newton Method

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

ADMM and Fast Gradient Methods for Distributed Optimization

10725/36725 Optimization Homework 4

LOCAL LINEAR CONVERGENCE OF ADMM Daniel Boley

The Alternating Direction Method of Multipliers

Lecture 25: November 27

arxiv: v1 [math.oc] 23 May 2017

Distributed Optimization via Alternating Direction Method of Multipliers

Iterative Methods for Solving A x = b

A Unified Approach to Proximal Algorithms using Bregman Distance

An ADMM algorithm for optimal sensor and actuator selection

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Homework 4. Convex Optimization /36-725

Conditional Gradient (Frank-Wolfe) Method

Convex Optimization Algorithms for Machine Learning in 10 Slides

Linear Algebra Massoud Malek

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Accelerated primal-dual methods for linearly constrained convex problems

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Chapter 2. Optimization. Gradients, convexity, and ALS

Unconstrained optimization

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Coordinate Descent and Ascent Methods

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

A Distributed Method for Optimal Capacity Reservation

An Optimization-based Approach to Decentralized Assignability

EE364b Convex Optimization II May 30 June 2, Final exam

Chapter 3 Transformations

Algorithms for nonlinear programming problems II

Interior Point Algorithms for Constrained Convex Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Nonlinear Optimization for Optimal Control

Lab 1: Iterative Methods for Solving Linear Systems

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Sparse Gaussian conditional random fields

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Final exam.

Applications of Linear Programming

Algorithms for nonlinear programming problems II

10-725/ Optimization Midterm Exam

10-725/36-725: Convex Optimization Prerequisite Topics

Yinyu Ye, MS&E, Stanford MS&E310 Lecture Note #06. The Simplex Method

Matrix Derivatives and Descent Optimization Methods

You should be able to...

Notes on PCG for Sparse Linear Systems

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Gradient Descent. Dr. Xiaowei Huang

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Written Examination

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

minimize x subject to (x 2)(x 4) u,

Properties of Matrices and Operations on Matrices

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

CS137 Introduction to Scientific Computing Winter Quarter 2004 Solutions to Homework #3

12. Interior-point methods

Parallel Coordinate Optimization

Line Search Methods for Unconstrained Optimisation

Additional Homework Problems

Lecture 23: November 19

Math 5593 Linear Programming Week 1

Optimization methods

Linear Regression (continued)

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Subgradient Method. Ryan Tibshirani Convex Optimization

Conjugate Gradient Method

Computational Linear Algebra

Adaptive Primal Dual Optimization for Image Processing and Learning

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

Fantope Regularization in Metric Learning

ICS-E4030 Kernel Methods in Machine Learning

Linear algebra issues in Interior Point methods for bound-constrained least-squares problems

MATH 304 Linear Algebra Lecture 19: Least squares problems (continued). Norms and inner products.

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Optimization Theory. A Concise Introduction. Jiongmin Yong

Lecture 9: Numerical Linear Algebra Primer (February 11st)

11. Equality constrained minimization

Stabilization and Acceleration of Algebraic Multigrid Method

Stochastic dynamical modeling:

Constrained Optimization and Lagrangian Duality

Convex Optimization of Graph Laplacian Eigenvalues

Research Article Residual Iterative Method for Solving Absolute Value Equations

Transcription:

Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time. However, for larger problems as well as stochastic problems, one needs to use first-order methods such as stochastic gradient descent (SGD), the alternating direction method of multipliers (ADMM), and conjugate gradient (CG) in order to attain a modest accuracy in a reasonable number of iterations. The condition number of a matrix A, denoted by κ(a), is defined as the ratio of its maximum singular value to its minimum singular value. Both theoretical analysis and practical evidence suggest that the precision and convergence rate of the first-order methods can depend significantly on the condition number of the matrices involved in the problems. As an example, the CG algorithm for solving the linear system Ax = b achieves a faster rate of convergence when the condition number of A is smaller. Hence, it is desirable to decrease the condition number of matrix A by applying a transformation to it; this process is called preconditioning. A special case of preconditioning is called diagonal scaling. Here, we are interested in finding diagonal matrices D and E to minimize the condition number of the matrix A = DAE, in order to accelerate first-order methods. For example, in applying CG to solve Ax = b, we can solve A x = Db instead, and recover x = E 1 x, while taking advantage of the small condition number of A. In our numerical experiments and from theoretical analysis, we have concluded that preconditioning can improve the performance of the first-order methods in two different ways. First, it can significantly accelerate the linear algebra operations. For example in [OCPB13], each step of the algorithm involves running CG which can be done remarkably faster if the appropriate preconditioning is applied. The second effect of the preconditioning, which should be distinguished from the first one, is decreasing the number of iterations for achieving desired accuracy by following different intermediate points in ADMM. In this report, we first discuss heuristics for diagonal scaling. Next, we motivate preconditioning by an example, and then we study preconditioning for a specific splitting form in ADMM called graph projection splitting. Finally we examine the performance of our methods by some numerical examples. 1

2 Matrix equilibration Let A R m n be given. Matrix equilibration is a heuristic method to find diagonal matrices D R m m and E R n n to decrease the condition number of DAE. The basic idea in equilibration is to find D and E such that all columns of DAE have equal l p norms and all rows of DAE have equal l p norms. (The parameter p 1 can be selected in our algorithm.) We show that this can be formulated as a convex optimization problem. Consider the problem i,j A ij p e x i e y j minimize subject to i x i = 0 j y j = 0, with variables x R m and y R n. The optimality conditions for this problem can be shown to be equivalent to equilibration of DAE for D = diag(e x ) and E = diag(e y ). Although matrix equilibration is a convex problem, it is computationally favorable to solve this problem with heuristic iterative methods, rather than using interior point methods. There are several methods for matrix equilibration, here we mention a few famous ones. We refer interested readers to [Bra10] for a thorough discussion on matrix equilibration. Sinkhorn-Knopp equilibration algorithm was originally designed to convert matrices with nonnegative entries to doubly stochastic matrices [Sin64]. However, with a slight modification, it can be used to perform equilibration in any l p norm. Ruiz equilibration algorithm was originally proposed to equilibrate square matrices [Rui01]. Here, we made a small modification to it so that A can be rectangular. Matrix free algorithms are the methods that obtain information about a matrix only through matrix-vector products. Matrix-free methods are useful when a matrix is represented as an operator with no access to the matrix entries [Bra10]. Algorithm 1 Sinkhorn-Knopp Initialize d 1 = 1 m,d 2 = 1 n. while r 1 > ǫ 1 or r 2 > ǫ 2 do (d 1 ) i := (Ad 2 ) 1 i. (d 2 ) j := (A T d 1 ) 1 j. B := diag(d 1 )Adiag(d 2 ). r 1 = max i B i:,r min i B i: 2 = max i B :j. min i B :j end while return D = diag(d 1 ),E = diag(d 2 ). Algorithm 2 Ruiz Initialize d 1 = 1 m,d 2 = 1 n,b = A. while r 1 > ǫ 1 or r 2 > ǫ 2 do (d 1 ) i := (d 1 ) i ( B i: p ) 1/2. (d 2 ) j := (d 2 ) j (m/n) 1/2p ( B :j p ) 1/2. B := diag(d 1 )Adiag(d 2 ). r 1 = max i B i:,r min i B i: 2 = max i B :j. min i B :j end while return D = diag(d 1 ),E = diag(d 2 ). Although these algorithms work extremely well in practice, there is no theoretical guarantee that they decrease the condition number. In fact, in our numerical experiments, we observed that in some cases they might slightly increase the condition number. 2

3 Example In this section, we study one simple example to motivate diagonal scaling for ADMM. Consider the consensus problem minimize f(x) + g(z) (1) subject to x = z, with variables x,z R n, where f(x) = (1/2) Ax b 2 2, with A Rm n, b R m, and g : R n R {+ } is a convex, closed, and proper function. This includes several popular problems such as lasso, support vector machine, and non-negative least squares. Instead of solving (1) directly, we can use ADMM to solve the equivalent problem minimize f(x) + g(z) subject to Fx = Fz, (2) where F R n n is invertible. The augmented Lagrangian for this problem will be L F (x,z,y) = f(x)+g(z)+(fy) T (x z)+(1/2) x z 2 F2. (3) DefiningFy asthenewdualvariable,weseethat(3)isidenticaltotheaugmentedlagrangian for (1) except that a different norm is used to augment the Lagrangian. Notice that taking F = ρi is equivalent to scaling the step size in ADMM by ρ. Hence, choosing the best matrix F (that results in the fastest convergence) is at least asdifficult as finding the optimal step size ρ in [BPC + 10]. Now, a question of our interest here is the following: by expanding the class of matrices F from multiples of identity to diagonal matrices, how much can we accelerate the ADMM? To answer this question we note that it can be shown that ADMM achieves linear convergence in this case and (an upper bound for) the rate of convergence is κ(f(a T A) 1 F T ). Hence choosing F such that κ(f(a T A) 1 F T ) is smaller can hopefully result in faster convergence for ADMM. To illustrate this effect, we used ADMM to solve the lasso problem with and without presence of matrix F. We generated a random matrix A with size m = 7500 and n = 2500, and used ADMM to solve (2) in three different cases with the stopping tolerance set to 10 4. First, we performed ADMM directly on (1) without presence of F. It took 7416 iterations for the algorithm to converge. For the second case, we took D = ρ I, where ρ = λ min (A)λ max (A) which maximizes the upper bound on the convergence rate of the algorithm (ρ = 7.2 in this example). In this case, it took 1019 iterations until the algorithm converged. Finally we used D which equilibrates (A T A) 1. In the last case it took only 8 iterations for the algorithm to converge. This shows the important effect of preconditioning on the convergence rate of the ADMM for solving the lasso problem. 3

4 Graph projection splitting Consider the following problem which is in canonical graph form [PB13] minimize f(y) + g(x) subject to Ax = y. (4) Graph projection splitting [PB13] is a form of ADMM to solve this problem serially. The essential idea is to define z = (x,y) and φ(z) = f(y)+g(x), and solve minimize φ(z)+i Ax =b (z ) subject to z = z instead. Here we notice that (4) is equivalent to minimize f(d 1 ỹ)+g(e x) subject to ỹ = DAE x, where x = E 1 x and ỹ = Dy for diagonal D R m m and E R n n with positive diagonal entires. Using graph projection splitting for this problem is equivalent to running ADMM on minimize φ(z)+i [ Ax =y ] (z ) E 1 0 subject to z = 0 D [ E 1 0 0 D The second step of ADMM is to project onto the graph subspace G = {( x,ỹ) DAE x = ỹ} which is a linear transformation defined by [ Π G ( x,ỹ) = I DAE (DAE) T I ] 1 [ I (DAE) T 0 0 ] z. ][ x ỹ Inthenextsection, wewill discuss howwecanchoosedande toaccelerateadmmingraph projection splitting form. Notice that for fixed DAE (even when D and E are changing in each step), the projection matrix remains the same, and hence no additional factorization caching is required in evaluating the projection operator. We will return to this fact in the next section. ]. 5 Parameter selection In this section we propose methods for choosing matrices D, E to speed up the graph projection splitting algorithm. We argue that following the steps below can help us to achieve a desired accuracy in smaller number of iterations. 4

Using the algorithms discussed in 2, we choose ˆD, Ê such that ˆDAÊ is equilibrated. This will usually result in reducing the condition number of A. As a result, thesingular values of ˆDA Ê will be closer to each other and intuitively matrix A becomes more isotropic. We notice that defining D = αˆd and E = βê for any α,β R ++, the matrix DAE is equilibrated. We have two degrees of freedom left: αβ can be chosen to scale DAE and can play the role of step size in ADMM. Theproductαβ ischosentoset DAE. Letγ = ˆDAÊ, thenwewillhave DAE = αβγ. Since DAE is likely to have a small condition number, this means that the norms of y and x are related with a factor close to DAE. This should be chosen appropriately for every problem. In most examples, setting DAE = 1 gives us reasonably good results. Of course, this depends on the functions f and g. We have inspected this effect in the next section. Note that we can change at each step while DAE is constant. Doing this procedure can help us to change ρ adaptively to balance the norms of dual and primal residuals. More importantly, this procedure is costless and does not require additional factorization caching in direct methods, also in indirect methods warm start can be used. 6 Numerical results In this section we inspect the effect of parameters explained in 5 on the speed of graph projection splitting algorithm for two different problems. 6.1 Lasso Consider the lasso problem minimize (1/2) Ax b 2 2 +λ x 1, where A R m n and b R m and λ R are the problem data and x R n is the decision variable. We can write this problem in form (4) with f(y) = y b 2 2 and g(x) = λ x 1. We generate instances of the lasso problem with m = 750 and n = 250. We plot the average number of iterations required versus values of (scaling) and (step size ρ) in Figure 1. The relative tolerance was set to 10 4 as the stopping criterion. 6.2 Linear program (LP) Consider the problem minimize c T x subject to Ax b 5

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 100 90 80 70 60 50 40 30 20 10 100γ 2000 1800 1600 1400 γ 1200 1000 800 600 400 200 γ/100 1/(100γ) 1/γ 100/γ Figure 1: Average number of iterations for graph projection splitting on the lasso problem where c R n and b R m and A R m n are the problem data and x R n is the decision variable. This can be written in form (4) by f(y) = I {y b} (y) and g(x) = c T x. We generate several instancesoftheproblemwith m = 750andn=250. Weplotthenumber ofiterations required, relative objective suboptimality and relative constraint error versus values of (scaling) and (step size ρ). The relative tolerance was set to 10 4 as the stopping criterion. 50/γ 50/γ 5000 4500 1/γ 4000 1/γ 3500 3000 2500 1/(50γ) 50/γ 1/γ 50/γ (b) Relative suboptimality 2000 1500 1000 50/γ 1/(50γ) 50/γ 1/γ 50/γ (a) Average number of iterations 500 1/γ 1/(50γ) 50/γ 1/γ 50/γ (c) Relative constraint error Figure 2: Graph projection splitting on LP with inequality constraint We see that after matrix equilibration, the remaining two degrees of freedom need to be set appropriately to achieve a desired accuracy in smaller number of iterations. Setting α and β such that DAE = 1 and choosing appropriate step size afterwards usually results in a fast convergence. 6

References [BPC + 10] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3:1 122, 2010. [Bra10] A. M. Bradley. Algorithms for the equilibration of matrices and their application to limited-memory quasi-newton methods. Ph.D. thesis, Institute for Computational and Mathematical Engineering, Stanford University, 2010. [OCPB13] B. O Donoghue, E. Chu, N. Parikh, and S. Boyd. Operator splitting for conic optimization via homogeneous self-dual embedding. arxiv preprint arxiv:1312.3039, 2013. [PB13] [Rui01] [Sin64] N. Parikh and S. Boyd. Graph projection block splitting for distributed optimization. Mathematical Programming Computation, 2013. D. Ruiz. A scaling algorithm to equilibrate both rows and columns norms in matrices. Rutherford Appleton Lab., Oxon, UK, Tech. Rep. RAL-TR-2001-034, 2001. R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876 879, 1964. 7