Subgradient methods for huge-scale optimization problems

Similar documents
Lecture 3: Huge-scale optimization problems

Subgradient methods for huge-scale optimization problems

Complexity and Simplicity of Optimization Problems

Primal-dual IPM with Asymmetric Barrier

Accelerating Nesterov s Method for Strongly Convex Functions

Coordinate Descent Methods on Huge-Scale Optimization Problems

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions

Conjugate Gradient (CG) Method

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

An Introduction to Algebraic Multigrid (AMG) Algorithms Derrick Cerwinsky and Craig C. Douglas 1/84

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

Sparsity Regularization

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Nonlinear Programming Algorithms Handout

Quick Tour of Linear Algebra and Graph Theory

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Linear Solvers. Andrew Hazel

Accelerated Proximal Gradient Methods for Convex Optimization

Robust PageRank: Stationary Distribution on a Growing Network Structure

LECTURE NOTES ELEMENTARY NUMERICAL METHODS. Eusebius Doedel

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Iterative solvers for linear equations

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Subgradient Method. Ryan Tibshirani Convex Optimization

Chapter 7 Iterative Techniques in Matrix Algebra

Parallel Coordinate Optimization

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Semidefinite Programming

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

A hybrid reordered Arnoldi method to accelerate PageRank computations

ORIE 4741 Final Exam

A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints

Numerical Methods I Non-Square and Sparse Linear Systems

Lecture 5 : Projections

DISCUSSION PAPER 2012/2. Subgradient methods for huge-scale optimization problems. Yu. Nesterov. Center for Operations Research and Econometrics

Stochastic Optimization: First order method

Numerical Methods I: Eigenvalues and eigenvectors

Lecture 13: Spectral Graph Theory

Affine iterations on nonnegative vectors

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

CS 246 Review of Linear Algebra 01/17/19

Linear algebra issues in Interior Point methods for bound-constrained least-squares problems

Cubic regularization of Newton s method for convex problems with constraints

Design of Projection Matrix for Compressive Sensing by Nonsmooth Optimization

Spectral radius, symmetric and positive matrices

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

Notes on PCG for Sparse Linear Systems

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints

Composite nonlinear models at scale

Recommendation Systems

Lecture 7: September 17

Iteration-complexity of first-order penalty methods for convex programming

Interpolation-Based Trust-Region Methods for DFO

MATH36001 Perron Frobenius Theory 2015

Quadratic forms. Here. Thus symmetric matrices are diagonalizable, and the diagonalization can be performed by means of an orthogonal matrix.

10.34 Numerical Methods Applied to Chemical Engineering Fall Quiz #1 Review

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Empirical Risk Minimization

A user s guide to Lojasiewicz/KL inequalities

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Sparse Optimization Lecture: Basic Sparse Optimization Models

Lecture: Local Spectral Methods (1 of 4)

Conceptual Questions for Review

Conjugate Gradient Method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Distributed Optimization over Networks Gossip-Based Algorithms

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

1 Strict local optimality in unconstrained optimization

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

A New Trust Region Algorithm Using Radial Basis Function Models

A direct formulation for sparse PCA using semidefinite programming

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

A direct formulation for sparse PCA using semidefinite programming

Lecture 5. Ch. 5, Norms for vectors and matrices. Norms for vectors and matrices Why?

Lecture 17: October 27

ADMM and Fast Gradient Methods for Distributed Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

FRTN10 Multivariable Control, Lecture 13. Course outline. The Q-parametrization (Youla) Example: Spring-mass System

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

8 Numerical methods for unconstrained problems

Accelerating the cubic regularization of Newton s method on convex problems

Descent methods. min x. f(x)

Course Notes: Week 1

Transcription:

Subgradient methods for huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) May 24, 2012 (Edinburgh, Scotland) Yu. Nesterov Subgradient methods for huge-scale problems 1/24

Outline 1 Problems sizes 2 Sparse Optimization problems 3 Sparse updates for linear operators 4 Fast updates in computational trees 5 Simple subgradient methods 6 Application examples 7 Computational experiments: Google problem Yu. Nesterov Subgradient methods for huge-scale problems 2/24

Nonlinear Optimization: problems sizes Class Operations Dimension Iter.Cost Memory Small-size All 10 0 10 2 n 4 n 3 Kilobyte: 10 3 Medium-size A 1 10 3 10 4 n 3 n 2 Megabyte: 10 6 Large-scale Ax 10 5 10 7 n 2 n Gigabyte: 10 9 Huge-scale x + y 10 8 10 12 n log n Terabyte: 10 12 Sources of Huge-Scale problems Internet (New) Telecommunications (New) Finite-element schemes (Old) PDE, Weather prediction (Old) Main hope: Sparsity. Yu. Nesterov Subgradient methods for huge-scale problems 3/24

Sparse problems Problem: min f (x), x Q where Q is closed and convex in RN, and f (x) = Ψ(Ax), where Ψ is a simple convex function: Ψ(y 1 ) Ψ(y 2 ) + Ψ (y 2 ), y 1 y 2, y 1, y 2 R M, A : R N R M is a sparse matrix. Let p(x) def = # of nonzeros in x. Sparsity coefficient: γ(a) def = p(a) MN. Example 1: Matrix-vector multiplication Computation of vector Ax needs p(a) operations. Initial complexity MN is reduced in γ(a) times. Yu. Nesterov Subgradient methods for huge-scale problems 4/24

Example: Gradient Method x 0 Q, x k+1 = π Q (x k hf (x k )), k 0. Main computational expenses Projection of simple set Q needs O(N) operations. Displacement x k x k hf (x k ) needs O(N) operations. f (x) = A T Ψ (Ax). If Ψ is simple, then the main efforts are spent for two matrix-vector multiplications: 2p(A). Conclusion: As compared with full matrices, we accelerate in γ(a) times. Note: For Large- and Huge-scale problems, we often have γ(a) 10 4... 10 6. Can we get more? Yu. Nesterov Subgradient methods for huge-scale problems 5/24

Sparse updating strategy Main idea def After update x + = x + d we have y + = Ax + = }{{} Ax +Ad. y What happens if d is sparse? Denote σ(d) = {j : d (j) 0}. Then y + = y + d (j) Ae j. Its complexity, κ A (d) = M κ A (d) def = j σ(d) j σ(d) p(ae j ), γ(ae j ) = γ(d) 1 p(d) γ(d) max γ(ae j) MN. 1 j m If γ(d) cγ(a), γ(a j ) cγ(a), then κ A (d) c 2 γ 2 (A) MN. j σ(d) can be VERY small! γ(ae j ) MN j σ(d) Expected acceleration: (10 6 ) 2 = 10 12 1 sec 32 000 years! Yu. Nesterov Subgradient methods for huge-scale problems 6/24

When it can work? Simple methods: No full-vector operations! (Is it possible?) Simple problems: Let us try: Functions with sparse gradients. 1 Quadratic function f (x) = 1 2 Ax, x b, x. The gradient f (x) = Ax b, x R N, is not sparse even if A is sparse. 2 Piece-wise linear function g(x) = max [ a i, x b (i) ]. Its 1 i m subgradient f (x) = a i(x), i(x) : f (x) = a i(x), x b (i(x)), can be sparse is a i is sparse! But: We need a fast procedure for updating max-type operations. Yu. Nesterov Subgradient methods for huge-scale problems 7/24

Fast updates in short computational trees Def: Function f (x), x R n, is short-tree representable, if it can be computed by a short binary tree with the height ln n. Let n = 2 k and the tree has k + 1 levels: v 0,i = x (i), i = 1,..., n. Size of the next level halves the size of the previous one: v i+1,j = ψ i+1,j (v i,2j 1, v i,2j ), j = 1,..., 2 k i 1, i = 0,..., k 1, where ψ i,j are some bivariate functions. v k 1,1 v k,1 v k 1,2......... v 2,1 v 2,n/4 v 1,1 v 1,2... v 1,n/2 1 v 1,n/2 v 0,1 v 0,2 v 0,3 v 0,4 v 0,n 3 v 0,n 2 v 0,n 1 v 0,n Yu. Nesterov Subgradient methods for huge-scale problems 8/24

Main advantages Important examples (symmetric functions) f (x) = x ( p, p 1, ψ i,j (t 1, t 2 ) [ t 1 p + t 2 p ] 1/p, n ) f (x) = ln e x(i), ψ i,j (t 1, t 2 ) ln (e t 1 + e t 2 ), i=1 f (x) = max x (i), ψ i,j (t 1, t 2 ) max {t 1, t 2 }. 1 i n The binary tree requires only n 1 auxiliary cells. Its value needs n 1 applications of ψ i,j (, ) ( operations). If x + differs from x in one entry only, then for re-computing f (x + ) we need only k log 2 n operations. Thus, we can have pure subgradient minimization schemes with Sublinear Iteration Cost. Yu. Nesterov Subgradient methods for huge-scale problems 9/24

Simple subgradient methods I. Problem: f def = min x Q f (x), where Q is a closed and convex and f (x) L(f ), x Q, the optimal value f is known. Consider the following optimization scheme (B.Polyak, 1967): ( x 0 Q, x k+1 = π Q x k f (x k) f ) f (x k ) 2 f (x k ), k 0. Denote f k = min 0 i k f (x i). Then for any k 0 we have: f k f L(f ) x 0 π X (x 0 ) (k+1) 1/2, x k x x 0 x, x X. Yu. Nesterov Subgradient methods for huge-scale problems 10/24

Proof: Let us fix x X. Denote r k (x ) = x k x. r 2 k+1 (x ) Then x k f (x k) f f (x f (x k ) 2 k ) x 2 = rk 2(x ) 2 f (x k) f f (x f (x k ) 2 k ), x k x + (f (x k) f ) 2 f (x k ) 2 r 2 k (x ) (f (x k) f ) 2 f (x k ) 2 r 2 k (x ) (f k f ) 2 L 2 (f ). From this reasoning, x k+1 x 2 x k x 2, x X. Corollary: Assume X has recession direction d. Then x k π X (x 0 ) x 0 π X (x 0 ), d, x k d, x 0. (Proof: consider x = π X (x 0 ) + αd, α 0.) Yu. Nesterov Subgradient methods for huge-scale problems 11/24

Constrained minimization (N.Shor (1964) & B.Polyak) II. Problem: min{f (x) : g(x) 0}, where x Q Q is closed and convex, f, g have uniformly bounded subgradients. Consider the following method. It has step-size parameter h > 0. ( ) If g(x k ) > h g (x k ), then (A): x k+1 = π Q x k g(x k) g (x g (x k ) 2 k, ) else (B): x k+1 = π Q (x k h f (x k ) f (x k ). Let F k {0,..., k} be the set (B)-iterations, and = min f (x i ). i F k f k Theorem: If k > x 0 x 2 /h 2, then F k and f k f (x) hl(f ), max g(x i ) hl(g). i F k Yu. Nesterov Subgradient methods for huge-scale problems 12/24

Computational strategies 1. Constants L(f ), L(g) are known (e.g. Linear Programming) ɛ max{l(f ),L(g)}. We can take h = number of steps N (easy!). Note: The standard advice is h = Then we need to decide on the R N+1 2. Constants L(f ), L(g) are not known Start from a guess. (much more difficult!) Restart from scratch each time we see the guess is wrong. The guess is doubled after restart. 3. Tracking the record value f k Double run. Other ideas are welcome! Yu. Nesterov Subgradient methods for huge-scale problems 13/24

Random sparse coordinate methods III. Problem: { } min f (x) def = max [l i(x) a i, x b i ]. x 0 1 i M Define i(x) : f (x) = l i(x) (x), and random variable ξ(x), which gives indexes from σ(a i(x) ) with equal probabilities 1 p(a i(x) ). Assuming that f is known, we can define now a random vector variable Next(x) by the following rules: 1. Compute h(x) = f (x) f. Generate j(x) = ξ(x). f (x) 2 ( ) 2. Define [ Next(x) ] (j(x)) = x (j(x)) h(x)a (j(x)) i(x) 3. For other indexes j j(x), define [ Next(x) ] (j) = x (j). +. Yu. Nesterov Subgradient methods for huge-scale problems 14/24

Algorithmic scheme 0. Choose x 0 0. Compute u 0 = Ax 0 b and f (x 0 ). 1. kth iteration (k 0). a) Generate j k = ξ(x k ) and update x k+1 = Next(x k ). b) Update u k+1 = u k + Ae jk parallel the value of f (x k+1 ). ( x (j k) k+1 x (j k) k ), re-computing in This method defines a sequence of discrete random variables {x k }. Denote fk = min f (x i). 0 i k Theorem: Let p(a i ) r, i = 1,..., m. Then, for any k 0 we have: ( E [fk f ] 2) rl2 (f ) x 0 π X (x 0 ) 2 k+1, E( x k x 2 ) x 0 x 2, x X. NB: One iteration needs at most max 1 j N p(ae j) log 2 M operations. Yu. Nesterov Subgradient methods for huge-scale problems 15/24

Application examples Observations: 1 Very often, Large- and Huge- scale problems have repetitive sparsity patterns and/or limited connectivity. Social networks. Mobile phone networks. Truss topology design (local bars). Finite elements models (2D: four neighbors, 3D: six neighbors). 2 For p-diagonal matrices κ(a) p 2. Yu. Nesterov Subgradient methods for huge-scale problems 16/24

Google problem Goal: Rank the agents in the society by their social weights. Unknown: x i 0 - social influence of agent i = 1,..., N. Known: σ i - set of friends of agent i. Hypothesis Agent i shares his support among all friends by equal parts. The influence of agent i is equal to the total support obtained from his friends. Yu. Nesterov Subgradient methods for huge-scale problems 17/24

Mathematical formulation: quadratic problem Let E R N N be an incidence matrix of the connections graph. Denote e = (1,..., 1) T R N and Ē = E diag (E T e) 1. Since, Ē T e = e, this matrix is stochastic. Problem: Find x 0 : Ēx = x, x 0. The size is very big! Known technique: Regularization + Fixed Point (Google Founders, B.Polyak & coauthors, etc.) N09: Solve it by random CD-method as applied to 1 2 Ēx x 2 + γ 2 [ e, x 1]2, γ > 0. Main drawback: No interpretation for the objective function! Yu. Nesterov Subgradient methods for huge-scale problems 18/24

Nonsmooth formulation of Google Problem Main property of spectral radius (A 0) If A R+ n n, then ρ(a) = min max 1 e x 0 1 i n x (i) i, Ax. The minimum is attained at the corresponding eigenvector. Since ρ(ē) = 1, our problem is as follows: f (x) def = max 1 i N [ e i, Ēx x (i) ] min x 0. Interpretation: Increase self-confidence! Since f = 0, we can apply Polyak s method with sparse updates. Additional features; the optimal set X is a convex cone. If x 0 = e, then the whole sequence is separated from zero: x, e x, x k x 1 x k = x, e x k. Goal: Find x 0 such that x 1 and f ( x) ɛ. (First condition is satisfied automatically.) Yu. Nesterov Subgradient methods for huge-scale problems 19/24

Computational experiments: Iteration Cost We compare Polyak s GM with sparse update (GM s ) with the standard one (GM). Setup: Each agent has exactly p random friends. Thus, κ(a) def = max κ A(A T e i ) p 2. 1 i M Iteration Cost: GM s κ(a) log 2 N p 2 log 2 N, GM pn. (log 2 10 3 = 10, log 2 10 6 = 20, log 2 10 9 = 30) Time for 10 4 iterations (p = 32) N κ(a) GM s GM 1024 1632 3.00 2.98 2048 1792 3.36 6.41 4096 1888 3.75 15.11 8192 1920 4.20 139.92 16384 1824 4.69 408.38 Time for 10 3 iterations (p = 16) N κ(a) GM s GM 131072 576 0.19 213.9 262144 592 0.25 477.8 524288 592 0.32 1095.5 1048576 608 0.40 2590.8 1 sec 100 min! Yu. Nesterov Subgradient methods for huge-scale problems 20/24

Convergence of GM s : Medium Size Let N = 131072, p = 16, κ(a) = 576, and L(f ) = 0.21. Iterations f f Time (sec) 1.0 10 5 0.1100 16.44 3.0 10 5 0.0429 49.32 6.0 10 5 0.0221 98.65 1.1 10 6 0.0119 180.85 2.2 10 6 0.0057 361.71 4.1 10 6 0.0028 674.09 7.6 10 6 0.0014 1249.54 1.0 10 7 0.0010 1644.13 Dimension and accuracy are sufficiently high, but the time is still reasonable. Yu. Nesterov Subgradient methods for huge-scale problems 21/24

Convergence of GM s : Large Scale Let N = 1048576, p = 8, κ(a) = 192, and L(f ) = 0.21. Iterations f f Time (sec) 0 2.000000 0.00 1.0 10 5 0.546662 7.69 4.0 10 5 0.276866 30.74 1.0 10 6 0.137822 76.86 2.5 10 6 0.063099 192.14 5.1 10 6 0.032092 391.97 9.9 10 6 0.016162 760.88 1.5 10 7 0.010009 1183.59 Final point x : x = 2.941497, R 2 0 def = x e 2 2 = 1.2 105. Theoretical bound: L2 (f )R 2 0 ɛ 2 = 5.3 10 7. Time for GM: 1 year! Yu. Nesterov Subgradient methods for huge-scale problems 22/24

Conclusion 1 Sparse GM is an efficient and reliable method for solving Large- and Huge- Scale problems with uniform sparsity. 2 We can treat also dense rows. Assume that inequality a, x b is dense. It is equivalent to the following system: y (1) = a (1) x (1), y (j) = y (j 1) + a (j) x (j), j = 2,..., n, y (n) b. We need new variables y (j) for all nonzero coefficients of a. Introduce p(a) additional variables and p(a) additional equality constraints. (No problem!) Hidden drawback: the above equalities are satisfied with errors. May be it is not too bad? 3 Similar technique can be applied to dense columns. Yu. Nesterov Subgradient methods for huge-scale problems 23/24

Theoretical consequences Assume that κ(a) γ 2 (A)n 2. Compare three methods: Sparse updates (SU). Complexity γ 2 (A)n 2 L2 R 2 log n ɛ 2 operations. Smoothing technique (ST). Complexity γ(a)n 2 LR ɛ Polynomial-time methods (PT). Complexity (γ(a)n + n 3 )n ln LR ɛ operations. There are three possibilities. Low accuracy: γ(a) LR ɛ < 1. Then we choose SU. Moderate accuracy: 1 < γ(a) LR ɛ < n 2. We choose ST. High accuracy: γ(a) LR ɛ > n 2. We choose PT. NB: For Huge-Scale problems usually γ(a) 1 n. operations. Yu. Nesterov Subgradient methods for huge-scale problems 24/24