Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Similar documents
Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Fitting Linear Statistical Models to Data by Least Squares: Introduction

Fitting Linear Statistical Models to Data by Least Squares II: Weighted

Fitting Linear Statistical Models to Data by Least Squares III: Multivariate

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix

Sensitivity to Model Parameters

which arises when we compute the orthogonal projection of a vector y in a subspace with an orthogonal basis. Hence assume that P y = A ij = x j, x i

Applied Linear Algebra in Geoscience Using MATLAB

MATH 23a, FALL 2002 THEORETICAL LINEAR ALGEBRA AND MULTIVARIABLE CALCULUS Solutions to Final Exam (in-class portion) January 22, 2003

MATH 167: APPLIED LINEAR ALGEBRA Least-Squares

Linear Regression and Its Applications

Vector Calculus. Lecture Notes

ECE521 week 3: 23/26 January 2017

Review problems for MA 54, Fall 2004.

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

Linear Algebra Practice Problems

Math 302 Outcome Statements Winter 2013

Diagonalizing Matrices

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

Chapter 3 Numerical Methods

Math 261 Lecture Notes: Sections 6.1, 6.2, 6.3 and 6.4 Orthogonal Sets and Projections

Upon successful completion of MATH 220, the student will be able to:

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS

MATH 315 Linear Algebra Homework #1 Assigned: August 20, 2018

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

Unconstrained minimization of smooth functions

Lecture 1: Systems of linear equations and their solutions

FIRST-ORDER SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS III: Autonomous Planar Systems David Levermore Department of Mathematics University of Maryland

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

MATH 22A: LINEAR ALGEBRA Chapter 4

Review of Some Concepts from Linear Algebra: Part 2

Math Advanced Calculus II

Tangent spaces, normals and extrema

MATH 1120 (LINEAR ALGEBRA 1), FINAL EXAM FALL 2011 SOLUTIONS TO PRACTICE VERSION

DS-GA 1002 Lecture notes 12 Fall Linear regression

Computational Methods. Least Squares Approximation/Optimization

Linear Algebra Massoud Malek

Projections and Least Square Solutions. Recall that given an inner product space V with subspace W and orthogonal basis for

Linear Algebra Practice Final

(a) II and III (b) I (c) I and III (d) I and II and III (e) None are true.

x 3y 2z = 6 1.2) 2x 4y 3z = 8 3x + 6y + 8z = 5 x + 3y 2z + 5t = 4 1.5) 2x + 8y z + 9t = 9 3x + 5y 12z + 17t = 7

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Semidefinite Programming

Math Linear Algebra

Chapter 6: Orthogonality

The value of a problem is not so much coming up with the answer as in the ideas and attempted ideas it forces on the would be solver I.N.

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

Ch. 12 Linear Bayesian Estimators

Conceptual Questions for Review

Linear regression COMS 4771

The Hilbert Space of Random Variables

A Primer in Econometric Theory

Final Exam - Take Home Portion Math 211, Summer 2017

Algebra II. Paulius Drungilas and Jonas Jankauskas

17 Solution of Nonlinear Systems

Linear Models Review

Recitation 1. Gradients and Directional Derivatives. Brett Bernstein. CDS at NYU. January 21, 2018

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Math 54. Selected Solutions for Week 5

Solutions to Second In-Class Exam: Math 401 Section 0201, Professor Levermore Friday, 16 April 2010

DS-GA 1002 Lecture notes 10 November 23, Linear models

Exercises * on Linear Algebra

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Selection on Multiple Traits

A geometric proof of the spectral theorem for real symmetric matrices

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

MATH 304 Linear Algebra Lecture 19: Least squares problems (continued). Norms and inner products.

MATH2070 Optimisation

Approximation Theory

Final Exam. Linear Algebra Summer 2011 Math S2010X (3) Corrin Clarkson. August 10th, Solutions

Preface. Figures Figures appearing in the text were prepared using MATLAB R. For product information, please contact:

Math (P)Review Part II:

Scientific Computing: An Introductory Survey

The Gram Schmidt Process

The Gram Schmidt Process

5 Linear Transformations

STA414/2104 Statistical Methods for Machine Learning II

CENTRAL TEXAS COLLEGE SYLLABUS FOR MATH 2318 Linear Algebra. Semester Hours Credit: 3

Vectors and Matrices Statistics with Vectors and Matrices

Numerical Methods I Solving Nonlinear Equations

Approximations - the method of least squares (1)

Lecture 13 The Fundamental Forms of a Surface

Linear Algebra Review

Nonlinear Optimization for Optimal Control

Optimization methods

Math 1060 Linear Algebra Homework Exercises 1 1. Find the complete solutions (if any!) to each of the following systems of simultaneous equations:

Association studies and regression

8 Numerical methods for unconstrained problems

Linear Solvers. Andrew Hazel

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

1. What is the determinant of the following matrix? a 1 a 2 4a 3 2a 2 b 1 b 2 4b 3 2b c 1. = 4, then det

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

There are two things that are particularly nice about the first basis

The geometry of least squares

Lecture 3 September 1

Linear Algebra, Summer 2011, pt. 3

Linear Regression (continued)

Inner Product, Length, and Orthogonality

Vectors in Function Spaces

Transcription:

Fitting Linear Statistical Models to Data by Least Squares I: Introduction Brian R. Hunt and C. David Levermore University of Maryland, College Park Math 420: Mathematical Modeling January 25, 2012 version

Outline of Three Lectures 1) Introduction to Linear Statistical Models 2) Linear Euclidean Least Squares Fitting 3) Linear Weighted Least Squares Fitting 4) Least Squares Fitting for Univariate Polynomial Models 5) Least Squares Fitting with Orthogonalization 6) Multivariate Linear Least Squares Fitting 7) General Multivariate Linear Least Squares Fitting

1. Introduction to Linear Statistical Models In modeling one is often faced with the problem of fitting data with some analytic expression. Let us suppose that we are studying a phenomenon that evolves over time. Given a set of n times {t j } n j=1 such that at each time t j we take a measurement y j of the phenomenon. We can represent this data as the set of ordered pairs { (tj, y j ) } n j=1. Each y j might be a single number or a vector of numbers. For simplicity, we will first treat the univariate case when it is a single number. The more complicted multivariate case when it is a vector will be treated later. The basic problem we will examine is the following. How can you use this data set to make a reasonable guess about what a measurment of this phenomenon might yield at any other time?

Of course, you can always find functions f(t) such that y j = f(t j ) for every j = 1,, n. For example, you can use Lagrange interpolation to construct a unique polynomial of degree at most n 1 that does this. However, such a polynomial often exhibits wild oscillations that make it a useless fit. This phenomena is called overfitting. There are two reasons why such difficulties arise. The times t j and measurements y j are subject to error, so finding a function that fits the data exactly is not a good strategy. The assumed form of f(t) might be ill suited for matching the behavior of the phenomenon over the time interval being considered.

One strategy to help avoid these difficulties is to draw f(t) from a family of suitable functions, which is called a model in statistics. If we denote this model by f(t; β 1,, β m ) where m << n then the idea is to find values of β 1,, β m such that the graph of f(t; β 1,, β m ) best fits the data. More precisely, we will define the residuals r j (β 1,, β m ) by the relation y j = f(t j ; β 1,, β m ) + r j (β 1,, β m ), for every j = 1,, n, and try to minimize the r j (β 1,, β m ) in some sense. The problem can be simplified by restricting ourselves to models in which the parameters appear linearly so-called linear models. Such a model is specified by the choice of a basis {f i (t)} m i=1 and takes the form f(t; β 1,, β m ) = m i=1 β i f i (t).

Example. The most classic linear model is the family of all polynomials of degree less than m. This family is often expressed as f(t; β 0,, β m 1 ) = m 1 i=0 β i t i. Notice that here the index i runs from 0 to m 1 rather than from 1 to m. This indexing convention is used for polynomial models because it matches the degree of each term in the sum. Example. Another classic linear model is the family of all trigonometric polynomials of degree at most l for some preiod T. This family can be expressed as f(t; α 0,, α l, β 1,, β l ) = α 0 + l k=1 ( αk cos(kωt) + β k sin(kωt) ), where ω = 2π/T its fundamental frequency. Notice that here m = 2l+1.

It is as easy to work in the more general setting in which we are given data { (xj, y j ) } n j=1, where the x j lie within a bounded domain X R p and the y j lie in R. The problem we will examine now becomes the following. How can you use this data set to make a reasonable guess about the value of y when x takes a value in X that is not represnted in the data set? We call x the independent variable and y the dependent variable. We will consider a linear statistical model with m real parameters in the form f(x; β 1,, β m ) = m i=1 β i f i (x), where each basis function f i (x) is defined over X and takes values in R.

Example. A classic linear model in this setting is the family of all affine functions. If x i denotes the i th entry of x then this family can be written as f(x; β 0,, β p ) = β 0 + p i=1 β i x i. Alternatively, it can be expressed in vector notation as f(x; a,b) = a + b T x, where a R and b R p. Notice that here m = p + 1. Remark. When the independent variable x has dimension p > 1, the dimension m of an associated general linear model grows rapidly as the model complexity increases. For example, the family of polynomials of degree at most d over R p has dimension m = (p + d)! p! d! = (p + 1)(p + 2) (p + d). d!

Given a linear model f(x j ; β 1,, β m ), we fit it to the data by minimizing the residuals r j (β 1,, β m ) that are defined by the relation y j = f(x j ; β 1,, β m ) + r j (β 1,, β m ). This can be recast as a linear algebra problem. We introduce the m-vector β, the n-vectors y and r, and the n m-matrix F by β = β 1. β m, y = F = 1. y y n, r = f 1 (x 1 ).. f m (x 1 ). f 1 (x n ) f m (x n ) We will assume the matrix F has rank m. The fitting problem then becomes the problem of finding a value of β that minimizes the size of the n-vector r(β) = y Fβ. But what does size mean?. 1. r r n,

2. Linear Euclidean Least Squares Fitting One popular notion of the size of a vector is the Euclidean norm, which is r(β) = r(β) T r(β) = n j=1 r j (β 1,, β m ) 2. Minimizing r(β) is equivalent to minimizing r(β) 2, which is the sum of the squares of the residuals. For linear models r(β) 2 is a quadratic function of β that is easy to minimize, which is why the method is popular. Specifically, because r(β) = y Fβ, we minimize q(β) = 1 2 r(β) 2 = 1 2 r(β)t r(β) = 1 2 (y Fβ)T (y Fβ) = 1 2 yt y β T F T y + 1 2 βt F T Fβ.

Let us recall some multivariable calculus. The gradient (if it exists) of a real-valued function q(β) with respect to the m-vector β is the m-vector β q(β) such that d q(β + sγ) = γ T ds β q(β) for every γ R m. s=0 In particular, for the quadratic q(β) arising from our least squares problem you can easily check that q(β + sγ) = q(β) + sγ T( F T Fβ F T y ) + 1 2 s2 γ T F T Fγ. By differentiating this with respect to s and setting s = 0 we obtain d q(β + sγ) = γ T( F T Fβ F T y ), ds s=0 from which we read off that β q(β) = F T Fβ F T y.

Similarly, the derivative (if it exists) of the vector-valued function β q(β) with respect to the m-vector β is the m m-matrix ββ q(β) such that d ds βq(β + sγ) = ββ q(β)γ for every γ R m. s=0 The symmetric matrix-valued function ββ q(β) is sometimes called the Hessian of q(β). For the quadratic q(β) arising from our least squares problem you can easily check that d ds βq(β + sγ) = d β q(β) + sf s=0 ds( T Fγ ) from which we read off that ββ q(β) = F T F. s=0 = F T Fγ, Because F has rank m, the m m-matrix F T F is positive definite.

Because ββ q(β) is positive definite, the function q(β) is strictly convex, whereby it has at most one global minimizer. We find this minimizer by setting the gradient of q(β) equal to zero, yielding β q(β) = F T Fβ F T y = 0. Because the matrix F T F is positive definite, it is invertible. The solution of the above equation is therefore β = β where β = (F T F) 1 F T y. The fact that β is a global minimizer can be seen from the fact F T F is positive definite and the identity q(β) = 1 2 yt y 1 2 β T F T F β + 1 2 (β β) T F T F(β β) = q( β) + 1 2 (β β) T F T F(β β). In particular, this shows that q(β) q( β) for every β R m and that q(β) = q( β) if and only if β = β.

Remark. The least squares fit has a beautiful geometric interpretation with respect to the associated Euclidean inner product Define r = y F β. Observe that (p q) = p T q. y = F β + r = F(F T F) 1 F T y + r. The matrix P = F(F T F) 1 F T has the properties P 2 = P, P T = P. This means that Py is the orthogonal projection of y associated with the Euclidean inner product onto the subspace of R n spanned by the columns of F. Then the residual r will be orthogonal to Py with respect to this inner product. This means that the residual will have mean zero for any data if and only if the constant function 1 is in the span of the basis functions for the model.

Example. If we want to find the Euclidean least squares fit of the affine model f(t; α, β) = α+βt to the data {(t j, y j )} n j=1 then the matrix F has the form If we define t = 1 n F = ( e t ), where e = n then we see that j=1 t j, t 2 = 1 n n j=1 F T F = n 1., t = 1 t 2 j, σ 2 t = 1 n ( ) 1 t t t 2 1. t n j=1 det ( F T F ) = n 2( t 2 t 2) = n 2 σ 2 t > 0., t n. (t j t) 2, Notice that t and σ 2 t are the sample mean and variance of t respectively.

Then the α and β that give the least squares fit are given by ) ( α = β β = (F T F) 1 F T y = 1 ( 1 t 2 t nσt 2 t 1 = 1 ( t 2 ) ( ) t ȳ σt 2 = 1 ( ) t2 ȳ t ty t 1 ty σt 2, ty t ȳ where ȳ = 1 n et y = 1 n n j=1 y j, yt = 1 n tt y = 1 n These formulas for α and β can be expressed simply as β = yt ȳ t σ 2 t, α = ȳ β t. ) ( ) e T n j=1 t T y y j t j. Notice that β is the ratio of the covariance of y and t to the variance of t.

The best fit is therefore f(t) = α + βt = ȳ + β(t t) = ȳ + yt ȳ t σ 2 t (t t). Remark. In the above example we inverted the matrix F T F to obtain β. This was easy because our model had only two parameters in it, so F T F was only 2 2. The number of paramenters m does not have to be too large before this approach becomes slow or unfeasible. However for fairly large m you can obtain β by using Gaussian elimination or some other direct method to efficiently solve the linear system F T Fβ = F T y. Such methods work because the matrix F T F is positive definite. As we will soon see, this step can be simplified by constructing the basis {f i (t)} m i=1 so that F T F is diagonal.

Further Questions We have seen how to use least squares to fit linear statistical models with m parameters to data sets containing n pairs when m << n. Among the questions that arise are the following. How does one pick a basis that is well suited to the given data? How can one avoid overfitting? Do these methods extended to nonlinear statistical models? Can one use other notions of smallness of the residual?