Lec10p1, ORF363/COS323

Similar documents
Lec7p1, ORF363/COS323

Chapter 10 Conjugate Direction Methods

Lec3p1, ORF363/COS323

An Iterative Descent Method

the method of steepest descent

The Conjugate Gradient Method

Notes on Some Methods for Solving Linear Systems

EECS 275 Matrix Computation

Iterative Methods for Solving A x = b

Conjugate Gradients I: Setup

The Conjugate Gradient Method

Chapter 2: Matrix Algebra

Conjugate Gradients: Idea

Lecture 1: Basic Concepts

Constrained Optimization

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Chapter 8 Gradient Methods

Quasi-Newton Methods

Lec12p1, ORF363/COS323. The idea behind duality. This lecture

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

The Conjugate Gradient Method for Solving Linear Systems of Equations

The Conjugate Gradient Method

Programming, numerics and optimization

1 Extrapolation: A Hint of Things to Come

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Nonlinear Optimization for Optimal Control

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

Conjugate Gradient (CG) Method

Gradient Descent Methods

Math 411 Preliminaries

Computational Linear Algebra

Columbus State Community College Mathematics Department Public Syllabus

Course Notes: Week 1

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Chapter 4. Unconstrained optimization

Lecture 13: Spectral Graph Theory

M.A. Botchev. September 5, 2014

Lecture Notes: Geometric Considerations in Unconstrained Optimization

HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS

Lecture 10: September 26

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

The conjugate gradient method

Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

Conjugate Gradient Method

Lecture 10: October 27, 2016

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS

The speed of Shor s R-algorithm

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Conjugate Directions for Stochastic Gradient Descent

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Optimization Tutorial 1. Basic Gradient Descent

Chapter 6: Orthogonality

Numerical Optimization

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Advanced Mathematical Programming IE417. Lecture 24. Dr. Ted Ralphs

IE 5531: Engineering Optimization I

Mathematical optimization

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Convex Optimization. Problem set 2. Due Monday April 26th

Focusing Learning on Concepts of Introductory Linear Algebra using MAPLE Interactive Tutors

The Conjugate Gradient Algorithm

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

y 2 . = x 1y 1 + x 2 y x + + x n y n 2 7 = 1(2) + 3(7) 5(4) = 3. x x = x x x2 n.

Math 416, Spring 2010 Gram-Schmidt, the QR-factorization, Orthogonal Matrices March 4, 2010 GRAM-SCHMIDT, THE QR-FACTORIZATION, ORTHOGONAL MATRICES

MAT1302F Mathematical Methods II Lecture 19

AP Calculus AB. Limits & Continuity. Table of Contents

Computational Linear Algebra

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

UNDERSTANDING THE DIAGONALIZATION PROBLEM. Roy Skjelnes. 1.- Linear Maps 1.1. Linear maps. A map T : R n R m is a linear map if

There are two things that are particularly nice about the first basis

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

Math 5630: Conjugate Gradient Method Hung M. Phan, UMass Lowell March 29, 2019

The Method of Conjugate Directions 21

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

B553 Lecture 5: Matrix Algebra Review

An Algorithmist s Toolkit September 10, Lecture 1

An Algorithmist s Toolkit Nov. 10, Lecture 17

Linear Algebra. Min Yan

Conjugate Gradient Tutorial

Orthonormal Bases; Gram-Schmidt Process; QR-Decomposition

Math 273a: Optimization Netwon s methods

(Refer Slide Time: 2:04)

1.2 Row Echelon Form EXAMPLE 1

ORIE 6326: Convex Optimization. Quasi-Newton Methods

2.098/6.255/ Optimization Methods Practice True/False Questions

Chapter 7: Symmetric Matrices and Quadratic Forms

LINEAR ALGEBRA: THEORY. Version: August 12,

AM 205: lecture 6. Last time: finished the data fitting topic Today s lecture: numerical linear algebra, LU factorization

Selected Topics in Optimization. Some slides borrowed from

Lecture 6: Lies, Inner Product Spaces, and Symmetric Matrices

Chapter 7 Iterative Techniques in Matrix Algebra

1 Kernel methods & optimization

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Transcription:

Lec10 Page 1 Lec10p1, ORF363/COS323 This lecture: Conjugate direction methods Conjugate directions Conjugate Gram-Schmidt The conjugate gradient (CG) algorithm Solving linear systems Leontief input-output model of an economy Instructor: Amir Ali Ahmadi Fall 2014 TAs: Y. Chen, G. Hall, J. Ye In the last couple of lectures we have seen several types of gradient descent methods as well as the Newton's method. Today we see yet another class of descent methods that are particularly clever and efficient: the conjugate direction methods. This method is primarily developed for minimizing quadratic functions. A classic reference is due to Hestenes and Stiefel [HS52], but some of the ideas date back further. Conjugate direction methods are in some sense intermediate between gradient descent and Newton. They try to accelerate the convergence rate of steepest descent without paying the overhead of Newton's method. They have some very attractive properties: They minimize a quadratic function in variables in steps (in absence of roundoff errors). Evaluation and storage of the Hessian matrix is not required. Unlike Newton, we do not need to invert a matrix (or solve a linear system) as a sub-problem. Conjugate direction methods are also used in practice for solving largescale linear systems; in particular those defined by a positive definite matrix. Like our other descent methods, conjugate direction methods take the following iterative form: The direction is chosen using the notion of conjugate directions which is fundamental to everything that follows. So let us start with defining that formally. Our presentation in this lecture is mostly based on [CZ13] but also adapts ideas from [Ber03], [Boy13], [HS52], [Kel09], [Lay03], [She94].

Lec10 Page 2 Lec10p2, ORF363/COS323 In this lecture we are mostly concerned with minimizing a quadratic function where Definition. Let be an real symmetric matrix. We say that a set of nonzero vectors are -conjugate if If is the identity matrix, this simply means that the vectos are pairwise orthogonal. For general [She94] gives a nice intution of what -conjugacy means. Figure (a) below shows the level sets of a quadratic function and a number of -conjugate pairs of vectors. "Imagine if this page was printed on bubble gum, and you grabbed Figure (a) by the ends and stretched in until the ellipse appear circular. Then vectors would appear orthogonal, as in Figure (b)." Image: [She94] The pairs on the left are orthogonal. -conjugate because the pairs on the right are Recall that a set of vectors any sets of scalars we have are linearly independent if for

Lec10p3, ORF363/COS323 Lemma 1. Let directions independent. be an symmetric positive definite matrix. If the are -conjugate, then they are linearly Proof. Remark. Why in the statement of the lemma we have Because we cannot have more than linearly independent vectors in (basic fact in linear algebra, proven for your convenience below). Hence we cannot have more than vectors that are -conjugate. Lec10 Page 3

Lec10p4, ORF363/COS323 So what are -conjugate directions good for? As we will see next, if we have conjugate directions, we can minimize by doing exact line search iteratively along them. This is the conjugate direction method. Let's see it more formally. Input: An matrix A set of -conjugate directions The Conjugate Direction Algorithm: Pick an initial point to Let For Let Let Lemma 2. The step size direction given above gives the exact minimum along the And more importantly, Theorem 1. For any starting point the unique minimum of Proof of Lemma 2. Lec10 Page 4 the algorithm above converges to in steps; i.e.,

Lec10p5, ORF363/COS323 Proof of Theorem 1. Lec10 Page 5

Lec10p6, ORF363/COS323 The algorithm we presented assumed a list of -conjugate directions as input. But how can compute these directions from the matrix? Here we propose a simple algorithm (and we'll see a more clever approach later in the lecture). The conjugate Gram-Schmidt procedure This is a simple procedure that allows us to start with any set of linearly independent vectors (say, the standard basis vectors ) and turn them into -conjugate vectors in such a way that and span the same subspace. In the special case that is the identity matrix, this is the usual Gram-Schmidt process for obtaining an orthogonal basis. Theorem 2. Let be an positive definite matrix and let set of linearly independent vectors. Then, the vectors as follows are -conjugate (and span the same space as Proof. Lec10 Page 6 Jorgen Gram (1850-1916) Erhard Schmidt (1876-1959) be a constructed ):

Lec10p7, ORF363/COS323 Proof (Cont'd). While you can always use the conjugate Gram-Schmidt algorithm to generate conjugate directions and then start your conjugate direction method, there is a much more efficient way of doing this. This is the conjugate gradient (CG) algorithm that we will see shortly. This method can generate your conjugate directions on the fly in each iteration. Moreover, The update rule to find new conjugate directions will be more efficient that what the Gram-Schmidt process offered, which requires a stack of all previous conjugate directions in the memory. Before we get to the conjugate gradient algorithm, we state an important geometric property of any conjugate direction method called the "expanding subspace theorem". The proof of this comes up in the proof of correctness of the conjugate gradient algorithm. Lec10 Page 7

Lec10 Page 8 Lec10p8, ORF363/COS323 Theorem 3 (the expanding subspace theorem). Let symmetric positive definite matrix, be an of -conjugate directions, an arbitrary point in, and the sequence of points generated by the conjugate direction algorithm. Let a set Then, for minimizes over Remark: Note that So this proves again that the algorithm indeed terminates with the optimal solution in steps. The following lemma is the main ingredient of the proof. Lemma 3. Let be an symmetric positive definite matrix, a set of -conjugate directions, an arbitrary point in, and the sequence of points generated by the conjugate direction algorithm. Let Then, for all and Image credit: [CZ13]

Lec10p9, ORF363/COS323 Proof of Lemma 3. Proof of the expanding subspace theorem. Lec10 Page 9

Lec10 Page 10 Lec10p10, ORF363/COS323 The conjugate gradient algorithm We are now ready to introduce the conjugate gradient algorithm. This is a specific type of a conjugate direction algorithm. It does not require the conjugate directions a priori but generates them on the fly. The update rule for these directions is very simple: In each iteration, the new conjugate direction is a linear combination of the current gradient and the previous conjugate direction. The algorithm is spelled out below in 8 steps. We follow here the presentation of [CZ13]. The update rule for (step 6) can appear in alternate forms ; see, e.g., [Ber03]. Note that the first step of the algorithm is exactly the same as what we would do in steepest descent. 1. Set select an initial point 2. If, stop; else, set 3. 4. 5. If stop. 6. 7. 8. Set go to step 3.

Lec10 Page 11 Lec10p11, ORF363/COS323 Theorem 4. The directions that the conjugate gradient algortihm produces (in step 7) are -conjugate. Proof. See [CZ13], Proposition 10.1 on p. 184. A main step of the proof is Lemma 3. Note that in view of Theorem 1, Theorem 4 immediately implies that the conjugate gradient algorithm minimizes in steps. Solving linear systems: Solving linear systems is one of the most basic and fundamental tasks in computational mathematics. It has been studied for centuries. You have probably seen algorithms for this in your linear algebra class, most likely Gaussian elimination. Here we show how the conjugate gradient method can be used to solve a linear system. This method (or some of its more elaborate variants) are often the method of choice for large-scale (sparse) linear systems. Suppose we are solving where is symmetric and positive definite. This implies that there is a unique solution (why?). Examples of problems where positive definite linear systems appear [Boy13]: Newton's method applied to convex functions (we have already seen this linear system for finding the Newton direction) Least-squares (so-called normal equations): Solving for voltages in resistor circuits: (G is the "conductance matrix") Graph Laplacian linear systems How to solve the system Define the quadratic function minimum: Let CG find its global

Lec10 Page 12 Lec10p12, ORF363/COS323 Is the assumption too restrictive? No (at least not in theory). Indeed, any nondegenerate linear system can be reduced to a positive definite one: Suppose we want to solve, where is invertible but not positive definite or even symmetric. Claim: if and only if Note that is symmetric. It is also positive definite if is invertible (why?). So we can solve using CG instead. But there is a word of caution: while this simple reduction is mathematically valid, it may raise concerns from a numerical point of view. For example, the condition number of the second linear system is the square of that of the first one. (Why? Recall that condition number is ) Moreover, we don't want to pay the price of matrix multiplication to get The good news is that we don't have to: In the CG algorithm only matrix-vector operations are needed and we can do them from right to left by two matix-vector multiplications instead of first doing the matrix-matrix multiplicaiton. Leontief input-output model of an economy The Leontief input-output model breaks a nation's economy into sectors (so-called producing sectors). For example, Agriculture Manufacturing Services Education Wassily Leontief (1906-1999) (1973) Separately, it considers the society as an "open sector" which is a consumer of the output of the sectors. Each of the sectors also needs the output of (some of) the other sectors in order to produce its own output. The model tries to understand the interdependencies among the sectors by studying how much each sector should produce to meet the demand of the other sectors, as well as the demand of the society.

Lec10 Page 13 Lec10p13, ORF363/COS323 Leontief input-output model (Cont'd) Input consumed per unit of output Transportation Agriculture Services Manufacturing Transportation.2.3.5.3 Agriculture.5.3.1.0 Services.1.2.2.5 Manufacturing.1.1.1.1 This is called the consumption matrix, denoted by Here is how you should interpret the first column: In order to produce one unit of transportation, the transportation industry needs to consume as input.2 units of transportation itself,.5 units of agriculture,.1 unites of services, and.1 units of manufacturing. Other columns interpreted analogously. Let be an vector denoting the demand of the open sector (i.e., the society or the non-producing sector) for each of the producing sectors. We are interested in solving the following linear system, which is called the Leontief production equation: "Amount produced = intermediate demand + final demand" Here, is the amount that sector needs to produce to meet intermidate demand (demand of other producing sectors) and the final demand (demand of the open sector). So for a given and we need to solve the following liner system to figure out how much each sector should produce: An economy is called "productive" if for every demand vector there exists a nonnegative production vector satisfying the above linear system. This is a property of the consumption matrix only. A sufficient condition for this is for to have spectral radius (i.e., largest eigenvalue in absolute value) less than one. Can you prove this?

Lec10 Page 14 Lec10p14, ORF363/COS323 A bit of Leontief history Source: [Lay03] In 1949, Wassily Leontief (then at Harvard) used statistics from the U.S. Bureau of Labor to divide the U.S. economy in 500 sectors. For each one he wrote a linear equation (as in the previous page) to describe how the sector distributed its output to other sectors of the economy. He used Harvard University's Mark II, a "super computer" of the time, to solve this linear system. This was a machine financed by the U.S. Navy and built in 1947. Since the resulting linear system was too big for Mark II, Leontief aggregated his data to construct a 42 linear system. Programming Mark II for this task took several months. Once it was finally done, the machine took 56 hours to solve this 42 linear system! Today, on my tiny laptop, this is done in the order of micro seconds. Leontief's achievement is considered to be one of the first significant uses of computers in mathematical economics. If you had access to the fastest computer of today, what problem would you give to it?

Lec10 Page 15 Lec10p15, ORF363/COS323 Notes: The relevant [CZ13]chapter for this lecture is Chapter 10. References: - - - - - - - [Ber03] D.P. Bertsekas. Nonlinear Programming. Second edition. Athena Scientific, 2003. [Boy13] S. Boyd. Lecture slides for Convex Optimization II. Stanford University, 2013. http://stanford.edu/class/ee364b/ [CZ13] E.K.P. Chong and S.H. Zak. An Introduction to Optimization. Fourth edition. Wiley, 2013. [HS52] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, Vol. 49, No. 6, 1952. [Kel09] J. Kelner. Topics in theoretical computer science: an algorithmist's toolkit, MIT OpenCourseWare, 2009. [Lay03] D. C. Lay. Linear Algebra and its Applications. Third edition. Addison Wesley, 2003. [She94] J.R. Shewchuk. "An introduction to the conjugate gradient method without the agonizing pain", (1994).