Least squares: the big idea

Similar documents
1 Cricket chirps: an example

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for At a high level, there are two pieces to solving a least squares problem:

Sparse least squares and Q-less QR

Linear Least-Squares Data Fitting

2. Review of Linear Algebra

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Linear Least Squares Problems

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Vector Spaces, Orthogonality, and Linear Least Squares

Linear Algebra, part 3 QR and SVD

1 Error analysis for linear systems

Vector and Matrix Norms. Vector and Matrix Norms

Lecture 9. Errors in solving Linear Systems. J. Chaudhry (Zeb) Department of Mathematics and Statistics University of New Mexico

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

AM205: Assignment 2. i=1

Pseudoinverse & Moore-Penrose Conditions

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix

Orthogonality. 6.1 Orthogonal Vectors and Subspaces. Chapter 6

Lecture notes: Applied linear algebra Part 1. Version 2

CLASS NOTES Computational Methods for Engineering Applications I Spring 2015

DS-GA 1002 Lecture notes 10 November 23, Linear models

ENGG5781 Matrix Analysis and Computations Lecture 8: QR Decomposition

Numerical Methods. Elena loli Piccolomini. Civil Engeneering. piccolom. Metodi Numerici M p. 1/??

Numerical Methods I Non-Square and Sparse Linear Systems

Linear Models Review

σ 11 σ 22 σ pp 0 with p = min(n, m) The σ ii s are the singular values. Notation change σ ii A 1 σ 2

. = V c = V [x]v (5.1) c 1. c k

Outline. Math Numerical Analysis. Errors. Lecture Notes Linear Algebra: Part B. Joseph M. Mahaffy,

EECS 275 Matrix Computation

lecture 2 and 3: algorithms for linear algebra

Least squares problems Linear Algebra with Computer Science Application

COMP 558 lecture 18 Nov. 15, 2010

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

lecture 3 and 4: algorithms for linear algebra

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

EECS 275 Matrix Computation

Chapter 6 - Orthogonality

Assignment #9: Orthogonal Projections, Gram-Schmidt, and Least Squares. Name:

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Linear Algebra Massoud Malek

UNIT 6: The singular value decomposition.

Numerical Methods I Singular Value Decomposition

IV. Matrix Approximation using Least-Squares

The Singular Value Decomposition

Applied Numerical Linear Algebra. Lecture 8

ECE 275A Homework # 3 Due Thursday 10/27/2016

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

Inner product spaces. Layers of structure:

orthogonal relations between vectors and subspaces Then we study some applications in vector spaces and linear systems, including Orthonormal Basis,

Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

AMS526: Numerical Analysis I (Numerical Linear Algebra)

P = A(A T A) 1 A T. A Om (m n)

Lecture 10: Vector Algebra: Orthogonal Basis

Normed & Inner Product Vector Spaces

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Linear Algebra, Summer 2011, pt. 3

Worksheet for Lecture 25 Section 6.4 Gram-Schmidt Process

Stability of the Gram-Schmidt process

AM 205: lecture 6. Last time: finished the data fitting topic Today s lecture: numerical linear algebra, LU factorization

Notes on Solving Linear Least-Squares Problems

Math 407: Linear Optimization

The geometry of least squares

Linear least squares problem: Example

Lecture 13: Orthogonal projections and least squares (Section ) Thang Huynh, UC San Diego 2/9/2018

Linear Systems. Carlo Tomasi

Singular Value Decomposition

Notes on Eigenvalues, Singular Values and QR

Pseudoinverse and Adjoint Operators

Lecture 4: Applications of Orthogonality: QR Decompositions

Dot product and linear least squares problems

Math 416, Spring 2010 Gram-Schmidt, the QR-factorization, Orthogonal Matrices March 4, 2010 GRAM-SCHMIDT, THE QR-FACTORIZATION, ORTHOGONAL MATRICES

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

3 Gramians and Balanced Realizations

Linear Algebra Fundamentals

Topics. Review of lecture 2/11 Error, Residual and Condition Number. Review of lecture 2/16 Backward Error Analysis The General Case 1 / 22

Getting Started with Communications Engineering

Lecture 3: Review of Linear Algebra

Bindel, Fall 2009 Matrix Computations (CS 6210) Week 8: Friday, Oct 17

Linear Algebra- Final Exam Review

LECTURE 7. Least Squares and Variants. Optimization Models EE 127 / EE 227AT. Outline. Least Squares. Notes. Notes. Notes. Notes.

Lecture 4 Orthonormal vectors and QR factorization

Orthonormal Transformations and Least Squares

Sample ECE275A Midterm Exam Questions

Lecture 3: Review of Linear Algebra

Orthogonalization and least squares methods

8. Least squares. ˆ Review of linear equations. ˆ Least squares. ˆ Example: curve-fitting. ˆ Vector norms. ˆ Geometrical intuition

MATH 350: Introduction to Computational Mathematics

Lecture 5 Least-squares

Stat 159/259: Linear Algebra Notes

8 The SVD Applied to Signal and Image Deblurring

The SVD-Fundamental Theorem of Linear Algebra

Lecture 6, Sci. Comp. for DPhil Students

Applied Linear Algebra in Geoscience Using MATLAB

The 'linear algebra way' of talking about "angle" and "similarity" between two vectors is called "inner product". We'll define this next.

Linear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions

Introduction to Numerical Linear Algebra II

AMS526: Numerical Analysis I (Numerical Linear Algebra)

7. Symmetric Matrices and Quadratic Forms

AM 205: lecture 6. Last time: finished the data fitting topic Today s lecture: numerical linear algebra, LU factorization

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Transcription:

Notes for 2016-02-22 Least squares: the big idea Least squares problems are a special sort of minimization problem. Suppose A R m n where m > n. In general, we cannot solve the overdetermined system Ax = b; the best we can do is minimize the residual r = b Ax. In the least squares problem, we minimize the two norm of the residual: Find x to minimize r 2 2 = r, r. This is not the only way to approximately solve the system, but it is attractive for several reasons: 1. It s mathematically attractive: the solution of the least squares problem is x = A b where A is the Moore-Penrose pseudoinverse of A. 2. There s a nice picture that goes with it the least squares solution is the projection of b onto the range of A, and the residual at the least squares solution is orthogonal to the range of A. 3. It s a mathematically reasonable choice in statistical settings when the data vector b is contaminated by Gaussian noise. Cricket chirps: an example Did you know that you can estimate the temperature by listening to the rate of chirps? The data set in Table 1 1. represents measurements of the number of chirps (over 15 seconds) of a striped ground cricket at different temperatures measured in degrees Farenheit. A plot (Figure 1) shows that the two are roughly correlated: the higher the temperature, the faster the crickets chirp. We can quantify this by attempting to fit a linear model where ɛ is an error term. minimize the residual temperature = α chirps + beta + ɛ To solve this problem by linear regression, we r = b Ax 1 Data set originally attributed to http://mste.illinois.edu

95 90 Degrees 85 80 75 70 14 15 16 17 18 19 20 Chirps Figure 1: Cricket chirps vs. temperature and a model fit via linear regression. where b i = temperature in experiment i A i1 = chirps in experiment i A i2 = 1 [ ] α x = β MATLAB and Octave are capable of solving least squares problems using the backslash operator; that is, if chirps and temp are column vectors in MATLAB, we can solve this regression problem as A = [chirps, ones(ndata,1)]; x = A\temp; The algorithms underlying that backslash operation will make up most of the next lecture. In more complex examples, we want to fit a model involving more than two variables. This still leads to a linear least squares problem, but one in which A may have more than one or two columns. As we will see later in the semester, we also use linear least squares problems as a building block

Chirp 20 16 20 18 17 16 15 17 15 16 15 17 16 17 14 Temp 89 72 93 84 81 75 70 82 69 83 80 83 81 84 76 Table 1: Cricket data: Chirp count over a 15 second period vs. temperature in degrees Farenheit. 0 Ax b r = b Ax R(A) Figure 2: Picture of a linear least squares problem. The vector Ax is the closest vector in R(A) to a target vector b in the Euclidean norm. Consequently, the residual r = b Ax is normal (orthogonal) to R(A). for more complex fitting procedures, including fitting nonlinear models and models with more complicated objective functions. Normal equations When we minimize the Euclidean norm of r = b Ax, we find that r is normal to everything in the range space of A (Figure 2): b Ax R(A), or, equivalently, for all z R n we have 0 = (Az) T (b Ax) = z T (A T b A T Ax). The statement that the residual is orthogonal to everything in R(A) thus leads to the normal equations A T Ax = A T b. To see why this is the right system, suppose x satisfies the normal equations and let y R n be arbitrary. Using the fact that r Ay and the Pythagorean

theorem, we have b A(x + y) 2 = r Ay 2 = r 2 + Ay 2 > 0. The inequality is strict if Ay 0; and if the columns of A are linearly independent, Ay = 0 is equivalent to y = 0. We can also reach the normal equations by calculus. Define the least squares objective function: F (x) = Ax b 2 = (Ax b) T (Ax b) = x T A T Ax 2x T A T b + b T b. The minimum occurs at a stationary point; that is, for any perturbation δx to x we have δf = 2δx T (A T Ax A T b) = 0; equivalently, F (x) = 2(A T Ax A T b) = 0 the normal equations again! A family of factorizations Cholesky If A is full rank, then A T A is symmetric and positive definite matrix, and we can compute a Cholesky factorization of A T A: A T A = R T R. The solution to the least squares problem is then or, in MATLAB world R = chol(a A, upper ); x = R\(R \(A b)); x = (A T A) 1 A T b = R 1 R T A T b, Economy QR The Cholesky factor R appears in a different setting as well. Let us write A = QR where Q = AR 1 ; then Q T Q = R T A T AR 1 = R T R T RR 1 = I.

That is, Q is a matrix with orthonormal columns. This economy QR factorization can be computed in several different ways, including one that you have seen before in a different guise (the Gram-Schmidt process). MATLAB provides a numerically stable method to compute the QR factorization via [Q,R] = qr(a,0); and we can use the QR factorization directly to solve the least squares problem without forming A T A by [Q,R] = qr(a,0); x = R\(Q b); Full QR There is an alternate full QR decomposition where we write A = QR, where Q = [ [ ] ] Q 1 Q 2 R n n R1, R = R m n. 0 To see how this connects to the least squares problem, recall that the Euclidean norm is invariant under orthogonal transformations, so [ ] r 2 = Q T r 2 = Q T 1 b Q T 2 b [ ] R1 0 x 2 = Q T 1 b R 1 x 2 + Q T 2 b 2. We can set Q T 1 v R 1 x 2 r 2 = Q T 2 b 2. to zero by setting x = R 1 1 Q T 1 b; the result is SVD The full QR decomposition is useful because orthogonal transformations do not change lengths. Hence, the QR factorization lets us change to a coordinate system where the problem is simple without changing the problem in any fundamental way. The same is true of the SVD, which we write as A = [ ] [ ] Σ U 1 U 2 V T Full SVD 0 = U 1 ΣV T Economy SVD.

As with the QR factorization, we can apply an orthogonal transformation involving the factor U that makes the least squares residual norm simple: [ ] [ ] U T r 2 = U T 1 b ΣV T U2 T x b 0 = U 1 T b ΣV T x 2 + U2 T b 2, and we can minimize by setting x = V Σ 1 U1 T b. The Moore-Penrose pseudoinverse If A is full rank, then A T A is symmetric and positive definite matrix, and the normal equations have a unique solution x = A b where A = (A T A) 1 A T. The matrix A R n m is the Moore-Penrose pseudoinverse. We can also write A via the economy QR and SVD factorizations as A = R 1 Q T 1, A = V Σ 1 U T 1. If m = n, the pseudoinverse and the inverse are the same. For m > n, the Moore-Penrose pseudoinverse has the property that and A A = I; Π = AA = Q 1 Q T 1 = U 1 U T 1 is the orthogonal projector that maps each vector to the closest vector (in the Euclidean norm) in the range space of A. The good, the bad, and the ugly At a high level, there are two pieces to solving a least squares problem: 1. Project b onto the span of A. 2. Solve a linear system so that Ax equals the projected b.

Consequently, there are two ways we can get into trouble in solving least squares problems: either b may be nearly orthogonal to the span of A, or the linear system might be ill conditioned. Let s first consider the issue of b nearly orthogonal to the range of A first. Suppose we have the trivial problem [ ] 1 A =, b = 0 [ ] ɛ. 1 The solution to this problem is x = ɛ; but the solution for [ ] [ ] 1 ɛ A =, ˆb =. 0 1 is ˆx = ɛ. Note that ˆb b / b 2ɛ is small, but ˆx x / x = 2 is huge. That is because the projection of b onto the span of A (i.e. the first component of b) is much smaller than b itself; so an error in b that is small relative to the overall size may not be small relative to the size of the projection onto the columns of A. Of course, the case when b is nearly orthogonal to A often corresponds to a rather silly regressions, like trying to fit a straight line to data distributed uniformly around a circle, or trying to find a meaningful signal when the signal to noise ratio is tiny. This is something to be aware of and to watch out for, but it isn t exactly subtle: if r / b is near one, we have a numerical problem, but we also probably don t have a very good model. A more subtle problem occurs when some columns of A are nearly linearly dependent (i.e. A is ill-conditioned). The condition number of A for least squares is If κ(a) is large, that means: κ(a) = A A = σ 1 /σ n. 1. Small relative changes to A can cause large changes to the span of A (i.e. there are some vectors in the span of  that form a large angle with all the vectors in the span of A). 2. The linear system to find x in terms of the projection onto A will be ill-conditioned.

If θ is the angle between b and the range of A, then the sensitivity to perturbations in b is δx x κ(a) cos(θ) δb b while the sensitivity to perturbations in A is δx x ( κ(a) 2 tan(θ) + κ(a) ) δa A Even if the residual is moderate, the sensitivity of the least squares problem to perturbations in A (either due to roundoff or due to measurement error) can quickly be dominated by κ(a) 2 tan(θ) if κ(a) is at all large. In regression problems, the columns of A correspond to explanatory factors. For example, we might try to use height, weight, and age to explain the probability of some disease. In this setting, ill-conditioning happens when the explanatory factors are correlated for example, perhaps weight might be well predicted by height and age in our sample population. This happens reasonably often. When there is a lot of correlation, we have an ill-posed problem; we will talk about this case in a couple lectures.