Least squares: Mathematical theory Below we provide the "vector space" formulation, and solution, of the least squares prolem. While not strictly necessary until we ring in the machinery of matrix algera, we usually think of a vector as a column with "n" entries, and use the "arrow" notation to denote a vector, e.g. u, v, etc. When we get to the matrix formulation, we sometime drop the "arrow", i.e. if we write x we mean a column matrix. Basic least-squares prolem: Find coefficients c 1,.. c k so as to approximate as closely as possile a given vector y a vector of the specified form c 1 v 1... c k v k, in the sense that the sum of the squares of the components of error vector e c 1 v 1... c k v k is as small as possile. Alternatively, we can descrie the prolem as that of getting as close as possile to a given vector y using a comination of vectors v 1,.., v k, so that the sum of the squares of the component errors is as small as possile. Background theory: The dot product of two vectors: If u u 1,.., u n and v v 1,.., v n then u v u 1 v 1... u n v n Note that u u u 1 2... u n 2. This is the sum of the squares of the components of u. Geometrically, u u can e interpreted as the square of the length of the vector u, and we write u u u 2, where the non-negative symol u is the "norm" or "length" of u and is defined through the dot product, namely u u u 1/2. We do not, however, need any geometric arguments here, rather, geometry is simply a motivation for certain definitions. We use only the intrinsic properties of the dot product which we enumerate elow. Properties of the dot product we wish to distinguish: u v w u w v w cu v c u v u v v u u u 0 and u u 0 if and only if u 0 Note also that cu c u, which follows from the definition of u. Terminology: If vectors u and v satisfy u v 0 we say that u and v are orthogonal to each other, or mutually orthogonal, or simply orthogonal. We also write u v to denote the fact that u and v are orthogonal to each other. (Orthogonality is motivated y the geometric property of two vectors eing perpendicular to each other.) Expansion formula using properties of dot product (analogous to FOIL in algera): u v 2 u v u v u u 2u v v v u 2 2u v v 2 Important special case: If u and v are orthogonal, u v 2 u 2 v 2 Next, a simple ut fundamental special case of the least squares prolem, with its solution: 1
Given a vector and a (nonzero) vector v, the minimum norm e of e cv occurs when c is chosen so that e is orthogonal to v. 1) First, it s easy to find the c that works, and it is unique: We set e v 0, and otain cv v 0, v cv v 0, c v as the unique value of c. In what follows, v v we let c v v v denote this optimal value, we let v cv denote the optimal approximating vector, and let e c v denote the corresponding error vector. (In short, if you see " " it means we are talking aout an optimal quantity.) 2) It s easy to show now that c gives the smallest of e. For consider that in general, e cv c v c c v e c c v. Recalling that e is orthogonal to v, (and hence orthogonal to any scalar multiple of v ) we otain from the expansion formula, e 2 e 2 c c v 2 e 2 c c 2 v 2 and see that the smallest value of e 2 is otained y choosing c c (so that the second term in the sum is zero). This completes the proof, ut a couple of additional oservations: 3) Note that if is orthogonal to v, then c v 0. The est approximation of v v in this case is the zero vector. 4) For general, v, we can write c v e v e and since the two vectors on the right are orthogonal, we have 2 v 2 e 2, and so e 2 2 v 2. We wish to point out, complementary to the oservation in 3), that if v 0 then e. Now we can formulate the solution to the general least squares prolem. It is called the Orthogonal Projection Theorem: Given vectors and v 1,.., v k, the minimum value of e, where e c 1 v 1 c 2 v 2... c k v k, is otained if and only if the coefficients c 1,.., c k are chosen so that e is orthogonal to each of the vectors v 1,.., v k. Moreover, this choice of coefficients gives the unique vector that minimizes e Proof: 1) First, the orthogonality condition is shown to e necessary. For if the coefficients are chosen so that e c 1 v 1 c 2 v 2... c k v k is not orthogonal to, say, v j, then using the special case aove in the case where e plays the role of, we see that e c v j e, where c e v j, which means that the coefficient of v j can e changed so as to reduce v j v j the magnitude of the error vector. 2) Next, the orthogonality condition is shown to e sufficient, and the optimal vector 2
v c 1 v 1 c 2 v 2... c k v k is shown to e unique. For let c 1,..., c k e such that e c 1 v 1 c 2 v 2... c k v k is orthogonal to each of v 1,..., v k. As in the simple case aove, we can calculate for a generic choice of coefficients: e c 1 v 1 c 2 v 2... c k v k v (where v is used to replace that whole expression) v v v (where we added and sutracted our purportedly optimal v ) e v v Now v v c 1 c 1 v 1 c 2 c 2 v 2... c k c k v k and we see that e is orthogonal to each term in this sum and so is orthogonal to v v itself. By orthgonality and the expansion formula, we then otain e 2 e 2 v v 2 and we see that e 2 is minimized if and only if we choose v v, which of course can e done y letting c 1 c 1,.., c k c k. To completely "solve" the least squares prolem it only remains to show that in fact a solution always exists (for if a solution exists it must have, and need only have, the property of the orthogonal projection theorem). This can e done either y showing the existence of an orthogonal asis using the Gram-Schmit procedure on the vectors v 1,.., v k (if you don t know what that means, that s OK) or y appeal to some theorems of analysis. Least-squares and linear systems We can descrie the least squares prolem and the orthogonal projection theorem very succinctly using matrix algera, and conversely, we can interpret the least-squares solution of a linear system as a least-squares prolem as discussed aove. We note that a (linear) comination of vectors can e written as a matrix times a vector of coefficients: c 1 c 1 v 1... c k v k c 1 v 1... c k v k v 1 v 2... v k c 2 c k Ac, where the matrix A is composed of the vectors as columns, and c is the matrix of unknown coefficients. Then e Ac. Next, note that the dot product u v of two vectors can e carried out in matrix algera y u t v. The orthogonal projection theorem states that e is minimized when e is orthogonal to each v 1, v 2,..., v k. In matrix form, this results in the equations: v 1 t Ac 0 or v 1 t v 1 t Ac v 2 t Ac 0 or v 2 t v 2 t Ac 3
v k t Ac 0 or v k t v k t Ac These are sometimes called the "normal equations" for the least squares solution. However, this system of k linear equations (for the k unknown coefficients in the vector c ) can e assemled into a single matrix form. Noting that v 1 t,..., v k t are simply the columns of A turned into rows, we can write the system as the single matrix equation: A t Ac 0 or A t A t A c This system is also referred to as the "normal equations". Regardless of the matrix A, this is also represents a square system of linear equations, and it always has a solution (though the solution is not guraranteed to e unique unless the columns of A are linearly independent). Now, if we wish to solve the overdetermined system Ax so as to minimize e Ax it is clear that the minimum value of e is otained when x satisfies the normal equations A t A t A x. This is the system that MATLAB solves when it is presented with an overdetermined system (more equations than unknowns). Data fitting: In curve fitting we are given a set of x, y values, where y is assumed to e a function of x (or simply determined y x in some way). We wish to find a function x from among some simple collection of functions which fairly well approximates the given data values in the sense that x, x x, y for each given data value. To e more specific, we we suppose our data values are x i, y i, i 1,.., n then we wish to choose a function x so as to minimize the pointwise error x i y i in the least squares sense, i.e. we want to minimize n i 1 x i y i 2. Our function f x is assumed otained from a (linear) comination of a simple set of functions (e.g. polynomials). (This is an important assumption!) We assume there are k such functions and we write x c 1 1 x... c k k x. Now what we wish is that we could otain: y 1 x 1 c 1 1 x 1... c k k x 1 y n x n c 1 1 x n... c k k x n But this is simply an overdetermined system of linear equations for the coefficients c 1,.., c k whose least squares solution we know how to otain. If we define y as the vector of y values and x as the corresponding vector of x values, we can write y x c as our system, where 4
x 1 x 2 x... k x. is the matrix whose columns 1 x,..., k x are the "data vectors" of each of the functions with which we are approximating the data. Indeed, we can write the curve fitting prolem in the form: Find c 1,.., c k which minimizes the sum of the squares of the error in approximating y c 1 1 x... c k k x so that we are approximating the data vector y as a comination of the data vectors of the functions 1 x,..., k x. We can then otain the solution of this least squares prolem using the normal equations. Least squares function approximation (optional): Imagine now that our data are the points on an entire curve, corresponding to the points x, f x for all x in some interval a,. Once again we wish to approximate y f x c 1 1 x... c k k x for all values of x on the interval. But how do we measure the error over the whole interval? Instead of expressing the size of the error (or a vector in general) in terms of the sum of the squares of the components of the vector, in the case of functions we take the integral of the square of the function over the interval concerned, that is we define f 2 a f x 2 dx. What "dot product" would this norm come from? If we define the dot product of two functions with domain a, as f g a f x g x dx then f 2 f f. Note that this dot product has exactly the same general properties as the dot product for vectors, as we previously enumerated them. This gives rise, using exactly the same proof, to the orthogonal projection theorem for least squares function approximation: The smallest value of e x f x x, where x c 1 1 x... c k k x, is otained when e x is orthogonal to each of the functions 1 x,..., k x on the interval a,. That is, the optimal values of c 1,.., c k are given y the solution of the system of linear equations: 1 f 1 1 1 c 1 1 2 c 2... 1 k c k k f k k 1 c 1 k 2 c 2... k k c k In general, least squares approximation y polynomials gives a much etter overall "fit" than interpolation once the degree of the polynomials egins to grow. In a sense, we are minimizing the "average" squared error over all values of x, as opposed to interpolation, which forces zero error at a discrete set of points while not caring at all aout the error at other values of x. In fact, the least squares polynomial approximations of a given function actually converge to the function as the degree of the polynomial approaches infinity, 5
requiring only that the function satisfy a mild smoothness condition (a continuous derivative is sufficient, even just a piecewise continuous derivative, is sufficient). In practice, one cannot exactly compute the integrals (i.e. the dot products) involved in the normal equations for least squares function approximation; one can resort to approximating the integrals involved or, for a quick and easy sustitute, one can simply perform a vector least squares approximation y sampling the function at many equally spaced points on the interval a,. If the points are not equally spaced then we are approximating a slightly different and more general type of least squares function approximation called "weighted least squares" in which the errors in different parts of the interval can e given different emphasis or "weight". In this case we are "really" looking at a norm given y f 2 a f x 2 w x dx where the "weighting function" w x satisfies w x 0. Such prolems also arise naturally in proaility theory when we try to minimize the "average" or "expected" squared error when different values of x are given different proailities of occuring. There are many other "systems" of functions esides polynomials that can e used for least squares function approximation. One very important, even more important, system is the so-called trigonometric polynomials on the interval, given y 1, cosx, sin x, cos2x, sin 2x, cos3x, sin 3x,.... These are particularly suitale for approximating 2 periodic functions and have the especially useful/important/fundamental property of orthogonality: i j a i x j x dx 0 whenever i and j are different. In this case the normal equations reduce very simply to: i f a i x f x dx a i x i x dx c i and so the coefficients are immediately determined, and in fact independent of which other functions are eing used in the approximation. In the case of the trigonometric polynomials, the resulting coefficients are the so-called Fourier coefficients and the resulting least squares approximations are the partial sums of the Fourier series of the function f x on,. 6