Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Similar documents
Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Lecture 9: Numerical Linear Algebra Primer (February 11st)

9. Numerical linear algebra background

9. Numerical linear algebra background

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Scientific Computing

Conditional Gradient (Frank-Wolfe) Method

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6

Lecture 8: February 9

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

Numerical Methods I Non-Square and Sparse Linear Systems

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti

B553 Lecture 5: Matrix Algebra Review

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Matrix decompositions

LINEAR SYSTEMS (11) Intensive Computation

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Scientific Computing: Solving Linear Systems

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 0

Scientific Computing: Dense Linear Systems

Orthogonal Transformations

MATH 3511 Lecture 1. Solving Linear Systems 1

Solving Linear Systems Using Gaussian Elimination. How can we solve

Algebra C Numerical Linear Algebra Sample Exam Problems

Numerical Methods - Numerical Linear Algebra

Numerical Linear Algebra

Orthonormal Transformations and Least Squares

Inverses. Stephen Boyd. EE103 Stanford University. October 28, 2017

Numerical Linear Algebra

Review Questions REVIEW QUESTIONS 71

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

This ensures that we walk downhill. For fixed λ not even this may be the case.

14.2 QR Factorization with Column Pivoting

MS&E 318 (CME 338) Large-Scale Numerical Optimization

Introduction to Mobile Robotics Compact Course on Linear Algebra. Wolfram Burgard, Bastian Steder

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Subgradient Method. Ryan Tibshirani Convex Optimization

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

12. Cholesky factorization

Matrix decompositions

Introduction to Mobile Robotics Compact Course on Linear Algebra. Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Diego Tipaldi, Luciano Spinello

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Numerical Solution Techniques in Mechanical and Aerospace Engineering

Numerical Analysis Fall. Gauss Elimination

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Solving linear equations with Gaussian Elimination (I)

Since the determinant of a diagonal matrix is the product of its diagonal elements it is trivial to see that det(a) = α 2. = max. A 1 x.

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Computational Methods. Systems of Linear Equations

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Example: Current in an Electrical Circuit. Solving Linear Systems:Direct Methods. Linear Systems of Equations. Solving Linear Systems: Direct Methods

lecture 2 and 3: algorithms for linear algebra

Lecture 9: September 28

Solving linear systems (6 lectures)

Solution of Linear Equations

Course Notes: Week 1

Practical Linear Algebra: A Geometry Toolbox

Gaussian Elimination and Back Substitution

Applied Numerical Linear Algebra. Lecture 8

Lecture 17: October 27

Index. book 2009/5/27 page 121. (Page numbers set in bold type indicate the definition of an entry.)

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Bindel, Fall 2013 Matrix Computations (CS 6210) Week 2: Friday, Sep 6

Descent methods. min x. f(x)

Introduction to Mobile Robotics Compact Course on Linear Algebra. Wolfram Burgard, Cyrill Stachniss, Kai Arras, Maren Bennewitz

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

Basic Concepts in Linear Algebra

COURSE Numerical methods for solving linear systems. Practical solving of many problems eventually leads to solving linear systems.

Next topics: Solving systems of linear equations

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Introduction to PDEs and Numerical Methods Lecture 7. Solving linear systems

22A-2 SUMMER 2014 LECTURE 5

Review of Basic Concepts in Linear Algebra

The Solution of Linear Systems AX = B

Lab 1: Iterative Methods for Solving Linear Systems

Process Model Formulation and Solution, 3E4

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Transcription:

Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725

Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple in so much as prox t (x) = argmin z 1 2t x z 2 2 + h(x) is computable. Proximal gradient descent: let x (0) R n, repeat x (k) = prox tk ( x (k 1) t k g(x (k 1) ) ), k = 1, 2, 3,... Step sizes t k chosen to be fixed and small, or via backtracking If g is Lipschitz with constant L, then this has convergence rate O(1/ɛ) (this counts iteration complexity, each iteration uses prox) Lastly we can accelerate this, to the optimal rate of O(1/ ɛ) 2

Outline Today: Complexity of basic operations Solving linear systems Matrix factorizations Sensitivity analysis Alternative indirect methods 3

Complexity (flop counts) of basic operations Flop (floating point operation): One addition, subtraction, multiplication, or division of two floating point numbers Serves as a basic unit of computation We are interested in rough, not exact flop counts Vector-vector operations: given a, b R n : Addition, a + b, costs n flops Scalar multiplication, c a, costs n flops Inner product, a T b, costs 2n flops Flops do not tell the whole story: setting every element of a to 1 costs 0 flops 4

Matrix-vector product: given A R m n, b R n, consider Ab: In general, costs 2mn flops For s-sparse A, costs 2s flops For k-banded A R n n, costs 2nk flops For A = r i=1 u iv T i R m n, costs 2r(m + n) flops For A R n n a permutation matrix, costs 0 flops Matrix-matrix product: for A R m n, B R n p, consider AB: In general, costs 2mnp flops For s-sparse A, costs 2sp flops (less if B is also sparse) Matrix-matrix-vector product: for A R m n, B R n p, c R p, consider ABc: Costs 2np + 2mn flops if done properly Costs 2mnp + 2mp flops if done improperly! 5

Solving linear systems For nonsingular A R n n, consider solving linear system Ax = b, i.e., computing x = A 1 b: In general, costs about n 3 flops we ll see more on this later For diagonal A, costs n flops: x = (b 1 /a 1,... b n /a n ) For lower triangular A (i.e., A ij = 0 for j > i), costs n 2 flops: x 1 = b 1 /A 11 x 2 = (b 2 A 21 x 1 )/A 22. x n = (b n A n,n 1 x n 1... A n1 x 1 )/A nn This is called forward substitution For upper triangular A, costs n 2, by backward substitution 6

For s-sparse A, often costs n 3 flops, but exact (worse-case) flop counts are not known for abitrary sparsity structures For k-banded A, costs nk 2 flops more later For orthogonal A, we have A 1 = A T, and so x = A T b costs 2n 2 flops For permutation A, again A 1 = A T, and so x = A T b costs 0 flops. Example: A = 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 and A 1 = A T = 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 7

Matrix factorizations As you ve probably learned, we can solve Ax = b by, e.g., Gaussian elimination. But instead it can be useful to factorize A: A = A 1 A 2... A k and then compute x = A 1 k... A 1 2 A 1 1 b. Usually k = 2 or 3, and Computing the factorization is expensive, about n 3 flops Applying A 1 1,... A 1 k is cheaper, about n 2 flops This is because A 1,... A k are structured: either orthogonal, triangular, diagonal, or permutation matrices This is useful when we want to solve Ax = b, Ax = b,... many linear systems in A. Also, if A undergoes a simple change, then an old factorization for A can often be efficiently updated 8

Cholesky decomposition If A S n ++ (symmetric, positive definite), then it has a Cholesky decomposition: A = LL T with L R n n lower triangular. Can compute this in n 3 /3 flops Once we have Cholesky factors, we solve Ax = b: y = L 1 b x = (L T ) 1 y So solving costs 2n 2 flops by forward substitution, n 2 flops by back substitution, n 2 flops Factorization and solve steps together cost n 3 /3 + 2n 2 flops 9

10 Least squares problems and Cholesky Now given y R n, X R n p, consider the least squares problem: min β R p y Xβ 2 2 Assuming X has full column rank, solution is ˆβ = (X T X) 1 X T y. How expensive? Compute X T y, in 2pn flops Compute X T X, in p 2 n flops Compute Cholesky of X T X, in p 3 /3 flops Solve (X T X)β = X T y, in 2p 2 flops Thus in total, about (n + p/3)p 2 flops (or np 2 flops if n p)

11 QR decomposition If A R m n with m n, then it has a QR decomposition: A = QR with Q R m n orthogonal (i.e., Q T Q = I), and R R n n upper triangular. Can compute this in 2(m n/3)n 2 flops If A has rank n, then all diagonal elements of R are nonzero, and the columns of Q form an orthonormal basis for col(a) If A has rank r, then the first r diagonal elements of R are nonzero, and the first r columns of Q form a basis for col(a) Key identity: x 2 2 = QT x 2 2 + Q T x 2 2, where Q R m (m n) completes the basis from Q

12 Least squares problems and QR Now back to y R n, X R n p full column rank, and the least squares problem: min β R p y Xβ 2 2 Let X = QR be a QR decomposition. Then y QRβ 2 2 = Q T y Rβ 2 2 + Q T y 2 2 Second term does not depend on β. So for least squares solution: Compute X = QR, in 2(n p/3)p 2 flops Compute Q T y, in 2pn flops Solve Rβ = Q T y, in p 2 flops Hence in total, about 2(n p/3)p 2 flops (or 2np 2 flops if n p)

13 Linear systems and sensitivity Consider first the linear system Ax = b, for nonsingular A R n n. The singular value decomposition of A: A = UΣV T, where U, V R n n are orthogonal, and Σ R n n is diagonal with elements σ 1... σ n > 0 Even if A is full rank, it could be near a singular matrix B, i.e., dist(a, R k ) = min rank(b)=k A B op could be small, for some k < n. An easy SVD analysis shows that dist(a, R k ) = σ k+1. If this is small, then solving x = A 1 b could pose problems

14 From the lens of the SVD: x = A 1 b = V Σ 1 U T b = n i=1 v i u T i b σ i We can see that if some σ i > 0 is small (close to set of rank i 1 matrices) then we could be in trouble Precise sensitivity analysis: fix some F R n n, f R n. Solve (A + ɛf )x(ɛ) = (b + ɛf) Theorem: The solution to the perturbed system satisfies x(ɛ) x 2 2 x 2 κ(a)(ρ A + ρ b ) + O(ɛ 2 ) where κ(a) = σ 1 /σ n is the condition number of A, and ρ A, ρ b are the relative errors ρ A = ɛ F op / A op, ρ b = ɛ f 2 / b 2

15 Proof: By implicit differentiation, where we abbreviate x = x(0) dx dɛ (0) = A 1 (f F x) Using a Taylor expansion around 0, Rearranging gives x(ɛ) x 2 x 2 x(ɛ) = x + ɛa 1 (f F x) + O(ɛ) 2 ɛ A 1 op ( ) f 2 + F op + O(ɛ) 2 x 2 Multiplying and dividing by A op proves the result, since κ(a) = A op A 1 op

16 Cholesky versus QR for least squares Linear systems: worse conditioning means great sensitivity. What about for least squares problems? min β R p y Xβ 2 2 Recall Cholesky solves X T Xβ = X T y. Hence we know that sensitivity scales with κ(x T X) = κ(x) 2 Meanwhile, QR operates on X, never forms X T X, and can show that sensitivity scales with κ(x) + ρ LS κ(x) 2, where ρ LS = y X ˆβ 2 2 Summary: Cholesky is cheaper (and uses less memory), but QR is more stable when ρ LS is small and κ(x) is large

17 Some advanced topics Updating matrix factorizations: can often be done efficiently after a simple change. E.g., QR of A R m n can be updated in O(m 2 ) flops after adding or deleting a row, and O(mn) flops after adding or deleting a column Underdetermined least squares: if X R n p and rank(x)<p, the criterion y Xβ 2 2 has infinitely many minimizers. One with smallest l 2 norm can be computed using QR Banded matrix factorizations: if A S n ++ is k-banded, then we can compute its Choleksy decomposition in nk 2 /4 flops, and apply it in 2nk flops Sparse matrix factorizations: this is in general a lot trickier, and can require very complex pivoting schemes. Theoretical analysis is loose, but practical performance is extremely good. See Davis (2006), Direct methods for sparse linear systems and SuiteSparse

Alternative indirect methods So far we ve been talking about direct methods for linear systems. These return the exact solution (in perfect computing environment) Indirect methods (iterative methods) produce x (k), k = 1, 2, 3,... converging to a solution x. Most often used for very large, sparse systems Jacobi iterations are the most basic approach. Suppose that A S n ++, initialize x (0) R n, and repeat for k = 1, 2, 3,... ( x (k+1) i = b i ) A ij x (k) j /A ii, i = 1,,... n j i Gauss-Seidl iterations are similar but always use most recent iterates, i.e., use j<i A ijx (k+1) j + j>i A ijx (k) j instead of above sum. Gauss-Seidl iterations always converge, but Jacobi iterations do not 18

19 Gradient descent on f(x) = 1 2 xt Ax b T x: this repeats r (k) = b Ax (k) x (k+1) = x (k) + t exact r (k) Since A S n ++, the criterion f is strongly convex, implying linear convergence. But the contraction depends adversely on κ(a). That is, gradient directions r (k) are not diverse enough across iterations Conjugate gradient method replaces gradient directions above with clever directions p (k) satisfying p (k) span{ap (1),... Ap (k 1) } Note these directions are constructed to be diverse. Conjugate gradient method still uses one A multiplication per iteration, and in principle, it takes n iterations or much less. In practice, this is not true (numerical errors), and preconditioning is used

20 General: References and further reading S. Boyd, Lecture notes for EE 264A, Stanford University, Winter 2014-2015 S. Boyd and L. Vandenberghe (2004), Convex optimization, Appendix C G. Golub and C. van Loan (1996), Matrix computations, Chapters 1 5, 10 Sparse numerical linear algebra/special systems: T. Davis (2006), Direct methods for sparse linear systems. Find his state-of-the-art (and free) C++ package SuiteSparse at http://faculty.cse.tamu.edu/davis/suitesparse.html N. Vishnoi (2013), Lx = b; Laplacian solvers and their applications