Expansion formula using properties of dot product (analogous to FOIL in algebra): u v 2 u v u v u u 2u v v v u 2 2u v v 2

Similar documents
Solving Systems of Linear Equations Symbolically

1 Systems of Differential Equations

CHAPTER 5. Linear Operators, Span, Linear Independence, Basis Sets, and Dimension

When two letters name a vector, the first indicates the and the second indicates the of the vector.

ERASMUS UNIVERSITY ROTTERDAM Information concerning the Entrance examination Mathematics level 2 for International Business Administration (IBA)

Vectors in Function Spaces

D. Determinants. a b c d e f

Upper Bounds for Stern s Diatomic Sequence and Related Sequences

Differential Geometry of Surfaces

Span and Linear Independence

Here is a general Factoring Strategy that you should use to factor polynomials. 1. Always factor out the GCF(Greatest Common Factor) first.

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Inner products. Theorem (basic properties): Given vectors u, v, w in an inner product space V, and a scalar k, the following properties hold:

58. The Triangle Inequality for vectors is. dot product.] 59. The Parallelogram Law states that

Sample Solutions from the Student Solution Manual

45. The Parallelogram Law states that. product of a and b is the vector a b a 2 b 3 a 3 b 2, a 3 b 1 a 1 b 3, a 1 b 2 a 2 b 1. a c. a 1. b 1.

The Gram-Schmidt Process 1

ERASMUS UNIVERSITY ROTTERDAM

Math 3191 Applied Linear Algebra

Module 9: Further Numbers and Equations. Numbers and Indices. The aim of this lesson is to enable you to: work with rational and irrational numbers

Linear Algebra, Summer 2011, pt. 2

MATH 12 CLASS 2 NOTES, SEP Contents. 2. Dot product: determining the angle between two vectors 2

Applied Linear Algebra in Geoscience Using MATLAB

MTH 2310, FALL Introduction

Polynomial Degree and Finite Differences

1.3.3 Basis sets and Gram-Schmidt Orthogonalization

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

Vectors and Matrices

1Number ONLINE PAGE PROOFS. systems: real and complex. 1.1 Kick off with CAS

Elements of linear algebra

Generalized Reed-Solomon Codes

Inner Product Spaces 6.1 Length and Dot Product in R n

UNIT 5 QUADRATIC FUNCTIONS Lesson 2: Creating and Solving Quadratic Equations in One Variable Instruction

Linear Algebra. and

Chapter 7. 1 a The length is a function of time, so we are looking for the value of the function when t = 2:

Mathematics Background

Chapter 6: Orthogonality

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Determinants of generalized binary band matrices

Genetic Algorithms applied to Problems of Forbidden Configurations

Chapter 4 & 5: Vector Spaces & Linear Transformations

(v, w) = arccos( < v, w >

Definition 1. A set V is a vector space over the scalar field F {R, C} iff. there are two operations defined on V, called vector addition

Consequences of Orthogonality

As shown in the text, we can write an arbitrary azimuthally-symmetric solution to Laplace s equation in spherical coordinates as:

which are not all zero. The proof in the case where some vector other than combination of the other vectors in S is similar.

Math 1180, Notes, 14 1 C. v 1 v n v 2. C A ; w n. A and w = v i w i : v w = i=1

Section 8.5. z(t) = be ix(t). (8.5.1) Figure A pendulum. ż = ibẋe ix (8.5.2) (8.5.3) = ( bẋ 2 cos(x) bẍ sin(x)) + i( bẋ 2 sin(x) + bẍ cos(x)).

Recall: Dot product on R 2 : u v = (u 1, u 2 ) (v 1, v 2 ) = u 1 v 1 + u 2 v 2, u u = u u 2 2 = u 2. Geometric Meaning:

Chapter 4: Interpolation and Approximation. October 28, 2005

1 2 2 Circulant Matrices

(v, w) = arccos( < v, w >

March 27 Math 3260 sec. 56 Spring 2018

6.1. Inner Product, Length and Orthogonality

4.3 - Linear Combinations and Independence of Vectors

8.04 Spring 2013 March 12, 2013 Problem 1. (10 points) The Probability Current

Exploring Lucas s Theorem. Abstract: Lucas s Theorem is used to express the remainder of the binomial coefficient of any two

Graphs and polynomials

LINEAR ALGEBRA W W L CHEN

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Math 291-2: Lecture Notes Northwestern University, Winter 2016

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Linear Least-Squares Data Fitting

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

Math 216 Second Midterm 28 March, 2013

Math Linear Algebra

MAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

(v, w) = arccos( < v, w >

There are two things that are particularly nice about the first basis

Robot Position from Wheel Odometry

Linear Algebra Massoud Malek

Graphs and polynomials

Summary Chapter 2: Wave diffraction and the reciprocal lattice.

Worksheet for Lecture 23 (due December 4) Section 6.1 Inner product, length, and orthogonality

Lecture 12: Grover s Algorithm

A Brief Outline of Math 355

Vector calculus background

Generalized Geometric Series, The Ratio Comparison Test and Raabe s Test

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES

Lecture 6. Numerical methods. Approximation of functions

Linear Algebra. Min Yan

1 Review of the dot product

Solutions to Exam 2, Math 10560

Math 313 Midterm II KEY Spring 2011 sections 001 and 002 Instructor: Scott Glasgow

1 Caveats of Parallel Algorithms

The geometry of least squares

MTH 65 WS 3 ( ) Radical Expressions

Linear Algebra: Homework 3

Linear Algebra, Summer 2011, pt. 3

Non-Linear Regression Samuel L. Baker

PHY451, Spring /5

Math 307 Learning Goals. March 23, 2010

NOTES ON LINEAR ALGEBRA CLASS HANDOUT

which arises when we compute the orthogonal projection of a vector y in a subspace with an orthogonal basis. Hence assume that P y = A ij = x j, x i

Fact: Every matrix transformation is a linear transformation, and vice versa.

COMP 558 lecture 18 Nov. 15, 2010

Linear Algebra- Final Exam Review

Composition of Haar Paraproducts

MATH 221: SOLUTIONS TO SELECTED HOMEWORK PROBLEMS

Transcription:

Least squares: Mathematical theory Below we provide the "vector space" formulation, and solution, of the least squares prolem. While not strictly necessary until we ring in the machinery of matrix algera, we usually think of a vector as a column with "n" entries, and use the "arrow" notation to denote a vector, e.g. u, v, etc. When we get to the matrix formulation, we sometime drop the "arrow", i.e. if we write x we mean a column matrix. Basic least-squares prolem: Find coefficients c 1,.. c k so as to approximate as closely as possile a given vector y a vector of the specified form c 1 v 1... c k v k, in the sense that the sum of the squares of the components of error vector e c 1 v 1... c k v k is as small as possile. Alternatively, we can descrie the prolem as that of getting as close as possile to a given vector y using a comination of vectors v 1,.., v k, so that the sum of the squares of the component errors is as small as possile. Background theory: The dot product of two vectors: If u u 1,.., u n and v v 1,.., v n then u v u 1 v 1... u n v n Note that u u u 1 2... u n 2. This is the sum of the squares of the components of u. Geometrically, u u can e interpreted as the square of the length of the vector u, and we write u u u 2, where the non-negative symol u is the "norm" or "length" of u and is defined through the dot product, namely u u u 1/2. We do not, however, need any geometric arguments here, rather, geometry is simply a motivation for certain definitions. We use only the intrinsic properties of the dot product which we enumerate elow. Properties of the dot product we wish to distinguish: u v w u w v w cu v c u v u v v u u u 0 and u u 0 if and only if u 0 Note also that cu c u, which follows from the definition of u. Terminology: If vectors u and v satisfy u v 0 we say that u and v are orthogonal to each other, or mutually orthogonal, or simply orthogonal. We also write u v to denote the fact that u and v are orthogonal to each other. (Orthogonality is motivated y the geometric property of two vectors eing perpendicular to each other.) Expansion formula using properties of dot product (analogous to FOIL in algera): u v 2 u v u v u u 2u v v v u 2 2u v v 2 Important special case: If u and v are orthogonal, u v 2 u 2 v 2 Next, a simple ut fundamental special case of the least squares prolem, with its solution: 1

Given a vector and a (nonzero) vector v, the minimum norm e of e cv occurs when c is chosen so that e is orthogonal to v. 1) First, it s easy to find the c that works, and it is unique: We set e v 0, and otain cv v 0, v cv v 0, c v as the unique value of c. In what follows, v v we let c v v v denote this optimal value, we let v cv denote the optimal approximating vector, and let e c v denote the corresponding error vector. (In short, if you see " " it means we are talking aout an optimal quantity.) 2) It s easy to show now that c gives the smallest of e. For consider that in general, e cv c v c c v e c c v. Recalling that e is orthogonal to v, (and hence orthogonal to any scalar multiple of v ) we otain from the expansion formula, e 2 e 2 c c v 2 e 2 c c 2 v 2 and see that the smallest value of e 2 is otained y choosing c c (so that the second term in the sum is zero). This completes the proof, ut a couple of additional oservations: 3) Note that if is orthogonal to v, then c v 0. The est approximation of v v in this case is the zero vector. 4) For general, v, we can write c v e v e and since the two vectors on the right are orthogonal, we have 2 v 2 e 2, and so e 2 2 v 2. We wish to point out, complementary to the oservation in 3), that if v 0 then e. Now we can formulate the solution to the general least squares prolem. It is called the Orthogonal Projection Theorem: Given vectors and v 1,.., v k, the minimum value of e, where e c 1 v 1 c 2 v 2... c k v k, is otained if and only if the coefficients c 1,.., c k are chosen so that e is orthogonal to each of the vectors v 1,.., v k. Moreover, this choice of coefficients gives the unique vector that minimizes e Proof: 1) First, the orthogonality condition is shown to e necessary. For if the coefficients are chosen so that e c 1 v 1 c 2 v 2... c k v k is not orthogonal to, say, v j, then using the special case aove in the case where e plays the role of, we see that e c v j e, where c e v j, which means that the coefficient of v j can e changed so as to reduce v j v j the magnitude of the error vector. 2) Next, the orthogonality condition is shown to e sufficient, and the optimal vector 2

v c 1 v 1 c 2 v 2... c k v k is shown to e unique. For let c 1,..., c k e such that e c 1 v 1 c 2 v 2... c k v k is orthogonal to each of v 1,..., v k. As in the simple case aove, we can calculate for a generic choice of coefficients: e c 1 v 1 c 2 v 2... c k v k v (where v is used to replace that whole expression) v v v (where we added and sutracted our purportedly optimal v ) e v v Now v v c 1 c 1 v 1 c 2 c 2 v 2... c k c k v k and we see that e is orthogonal to each term in this sum and so is orthogonal to v v itself. By orthgonality and the expansion formula, we then otain e 2 e 2 v v 2 and we see that e 2 is minimized if and only if we choose v v, which of course can e done y letting c 1 c 1,.., c k c k. To completely "solve" the least squares prolem it only remains to show that in fact a solution always exists (for if a solution exists it must have, and need only have, the property of the orthogonal projection theorem). This can e done either y showing the existence of an orthogonal asis using the Gram-Schmit procedure on the vectors v 1,.., v k (if you don t know what that means, that s OK) or y appeal to some theorems of analysis. Least-squares and linear systems We can descrie the least squares prolem and the orthogonal projection theorem very succinctly using matrix algera, and conversely, we can interpret the least-squares solution of a linear system as a least-squares prolem as discussed aove. We note that a (linear) comination of vectors can e written as a matrix times a vector of coefficients: c 1 c 1 v 1... c k v k c 1 v 1... c k v k v 1 v 2... v k c 2 c k Ac, where the matrix A is composed of the vectors as columns, and c is the matrix of unknown coefficients. Then e Ac. Next, note that the dot product u v of two vectors can e carried out in matrix algera y u t v. The orthogonal projection theorem states that e is minimized when e is orthogonal to each v 1, v 2,..., v k. In matrix form, this results in the equations: v 1 t Ac 0 or v 1 t v 1 t Ac v 2 t Ac 0 or v 2 t v 2 t Ac 3

v k t Ac 0 or v k t v k t Ac These are sometimes called the "normal equations" for the least squares solution. However, this system of k linear equations (for the k unknown coefficients in the vector c ) can e assemled into a single matrix form. Noting that v 1 t,..., v k t are simply the columns of A turned into rows, we can write the system as the single matrix equation: A t Ac 0 or A t A t A c This system is also referred to as the "normal equations". Regardless of the matrix A, this is also represents a square system of linear equations, and it always has a solution (though the solution is not guraranteed to e unique unless the columns of A are linearly independent). Now, if we wish to solve the overdetermined system Ax so as to minimize e Ax it is clear that the minimum value of e is otained when x satisfies the normal equations A t A t A x. This is the system that MATLAB solves when it is presented with an overdetermined system (more equations than unknowns). Data fitting: In curve fitting we are given a set of x, y values, where y is assumed to e a function of x (or simply determined y x in some way). We wish to find a function x from among some simple collection of functions which fairly well approximates the given data values in the sense that x, x x, y for each given data value. To e more specific, we we suppose our data values are x i, y i, i 1,.., n then we wish to choose a function x so as to minimize the pointwise error x i y i in the least squares sense, i.e. we want to minimize n i 1 x i y i 2. Our function f x is assumed otained from a (linear) comination of a simple set of functions (e.g. polynomials). (This is an important assumption!) We assume there are k such functions and we write x c 1 1 x... c k k x. Now what we wish is that we could otain: y 1 x 1 c 1 1 x 1... c k k x 1 y n x n c 1 1 x n... c k k x n But this is simply an overdetermined system of linear equations for the coefficients c 1,.., c k whose least squares solution we know how to otain. If we define y as the vector of y values and x as the corresponding vector of x values, we can write y x c as our system, where 4

x 1 x 2 x... k x. is the matrix whose columns 1 x,..., k x are the "data vectors" of each of the functions with which we are approximating the data. Indeed, we can write the curve fitting prolem in the form: Find c 1,.., c k which minimizes the sum of the squares of the error in approximating y c 1 1 x... c k k x so that we are approximating the data vector y as a comination of the data vectors of the functions 1 x,..., k x. We can then otain the solution of this least squares prolem using the normal equations. Least squares function approximation (optional): Imagine now that our data are the points on an entire curve, corresponding to the points x, f x for all x in some interval a,. Once again we wish to approximate y f x c 1 1 x... c k k x for all values of x on the interval. But how do we measure the error over the whole interval? Instead of expressing the size of the error (or a vector in general) in terms of the sum of the squares of the components of the vector, in the case of functions we take the integral of the square of the function over the interval concerned, that is we define f 2 a f x 2 dx. What "dot product" would this norm come from? If we define the dot product of two functions with domain a, as f g a f x g x dx then f 2 f f. Note that this dot product has exactly the same general properties as the dot product for vectors, as we previously enumerated them. This gives rise, using exactly the same proof, to the orthogonal projection theorem for least squares function approximation: The smallest value of e x f x x, where x c 1 1 x... c k k x, is otained when e x is orthogonal to each of the functions 1 x,..., k x on the interval a,. That is, the optimal values of c 1,.., c k are given y the solution of the system of linear equations: 1 f 1 1 1 c 1 1 2 c 2... 1 k c k k f k k 1 c 1 k 2 c 2... k k c k In general, least squares approximation y polynomials gives a much etter overall "fit" than interpolation once the degree of the polynomials egins to grow. In a sense, we are minimizing the "average" squared error over all values of x, as opposed to interpolation, which forces zero error at a discrete set of points while not caring at all aout the error at other values of x. In fact, the least squares polynomial approximations of a given function actually converge to the function as the degree of the polynomial approaches infinity, 5

requiring only that the function satisfy a mild smoothness condition (a continuous derivative is sufficient, even just a piecewise continuous derivative, is sufficient). In practice, one cannot exactly compute the integrals (i.e. the dot products) involved in the normal equations for least squares function approximation; one can resort to approximating the integrals involved or, for a quick and easy sustitute, one can simply perform a vector least squares approximation y sampling the function at many equally spaced points on the interval a,. If the points are not equally spaced then we are approximating a slightly different and more general type of least squares function approximation called "weighted least squares" in which the errors in different parts of the interval can e given different emphasis or "weight". In this case we are "really" looking at a norm given y f 2 a f x 2 w x dx where the "weighting function" w x satisfies w x 0. Such prolems also arise naturally in proaility theory when we try to minimize the "average" or "expected" squared error when different values of x are given different proailities of occuring. There are many other "systems" of functions esides polynomials that can e used for least squares function approximation. One very important, even more important, system is the so-called trigonometric polynomials on the interval, given y 1, cosx, sin x, cos2x, sin 2x, cos3x, sin 3x,.... These are particularly suitale for approximating 2 periodic functions and have the especially useful/important/fundamental property of orthogonality: i j a i x j x dx 0 whenever i and j are different. In this case the normal equations reduce very simply to: i f a i x f x dx a i x i x dx c i and so the coefficients are immediately determined, and in fact independent of which other functions are eing used in the approximation. In the case of the trigonometric polynomials, the resulting coefficients are the so-called Fourier coefficients and the resulting least squares approximations are the partial sums of the Fourier series of the function f x on,. 6