MASTER OF SCIENCE IN ANALYTICS 2014 EMPLOYMENT REPORT

Similar documents
Review Packet 1 B 11 B 12 B 13 B = B 21 B 22 B 23 B 31 B 32 B 33 B 41 B 42 B 43

Contents. 1 Vectors, Lines and Planes 1. 2 Gaussian Elimination Matrices Vector Spaces and Subspaces 124

Orthogonality. Orthonormal Bases, Orthogonal Matrices. Orthogonality

CS123 INTRODUCTION TO COMPUTER GRAPHICS. Linear Algebra /34

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Linear Models Review

Chapter Two Elements of Linear Algebra

The value of a problem is not so much coming up with the answer as in the ideas and attempted ideas it forces on the would be solver I.N.

CS123 INTRODUCTION TO COMPUTER GRAPHICS. Linear Algebra 1/33

The geometry of least squares

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

MAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction

Knowledge Discovery and Data Mining 1 (VO) ( )

Chapter 6: Orthogonality

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2

Lecture 2: Linear Algebra Review

Inner product spaces. Layers of structure:

Chapter 1: Systems of linear equations and matrices. Section 1.1: Introduction to systems of linear equations

The General Linear Model in Functional MRI

Elementary Linear Algebra

Linear Algebra. Preliminary Lecture Notes

Vectors and Matrices Statistics with Vectors and Matrices

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Math 302 Outcome Statements Winter 2013

Lecture 6: Geometry of OLS Estimation of Linear Regession

Linear Algebra. The analysis of many models in the social sciences reduces to the study of systems of equations.

This pre-publication material is for review purposes only. Any typographical or technical errors will be corrected prior to publication.

Lecture 1: Systems of linear equations and their solutions

Linear Algebra. Preliminary Lecture Notes

Vector Spaces, Orthogonality, and Linear Least Squares

22 Approximations - the method of least squares (1)

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS

Linear Algebra March 16, 2019

Basic Concepts in Matrix Algebra

Math Linear Algebra II. 1. Inner Products and Norms

Chapter 2. Matrix Arithmetic. Chapter 2

7. Dimension and Structure.

The Hilbert Space of Random Variables

Introduction to Matrices

Image Registration Lecture 2: Vectors and Matrices

Solving Systems of Equations Row Reduction

ORTHOGONALITY AND LEAST-SQUARES [CHAP. 6]

Linear Algebra. Ben Woodruff. Compiled July 17, 2010

Chapter 4. Solving Systems of Equations. Chapter 4

Linear Algebra, Summer 2011, pt. 2

ELEMENTS OF MATRIX ALGEBRA

An Introduction to Matrix Algebra

Mathematics for Graphics and Vision

Abstract & Applied Linear Algebra (Chapters 1-2) James A. Bernhard University of Puget Sound

Linear Algebra I. Ronald van Luijk, 2015

OHSx XM511 Linear Algebra: Solutions to Online True/False Exercises

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra

Vector Geometry. Chapter 5

STAT 151A: Lab 1. 1 Logistics. 2 Reference. 3 Playing with R: graphics and lm() 4 Random vectors. Billy Fang. 2 September 2017

4.1 Distance and Length

Applied Linear Algebra in Geoscience Using MATLAB

An Introduction To Linear Algebra. Kuttler

Systems of Linear Equations and Matrices

SAMPLE OF THE STUDY MATERIAL PART OF CHAPTER 1 Introduction to Linear Algebra

Matrices and systems of linear equations

Linear Algebra for Machine Learning. Sargur N. Srihari

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

Systems of Linear Equations and Matrices

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

Some Notes on Linear Algebra

LS.1 Review of Linear Algebra

Math 3108: Linear Algebra

B553 Lecture 5: Matrix Algebra Review

Approximations - the method of least squares (1)

MTH Linear Algebra. Study Guide. Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education

IMPORTANT DEFINITIONS AND THEOREMS REFERENCE SHEET

Glossary of Linear Algebra Terms. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Basic Concepts in Linear Algebra

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Page 52. Lecture 3: Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 2008/10/03 Date Given: 2008/10/03

Matrix Algebra: Vectors

Chapter 2 Notes, Linear Algebra 5e Lay

Deep Learning Book Notes Chapter 2: Linear Algebra

Linear Algebra. Min Yan

Review of Basic Concepts in Linear Algebra

Math Camp II. Basic Linear Algebra. Yiqing Xu. Aug 26, 2014 MIT

NOTES ON LINEAR ALGEBRA CLASS HANDOUT

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

Systems of Linear Equations

Math 123, Week 2: Matrix Operations, Inverses

MATH 221: SOLUTIONS TO SELECTED HOMEWORK PROBLEMS

The following definition is fundamental.

Linear Algebra and Eigenproblems

Introduction to Vectors

Foundations of Matrix Analysis

Lecture 7. Econ August 18

MAT Linear Algebra Collection of sample exams

Sums of Squares (FNS 195-S) Fall 2014

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

LINEAR ALGEBRA - CHAPTER 1: VECTORS

HOSTOS COMMUNITY COLLEGE DEPARTMENT OF MATHEMATICS

Lecture Notes 1: Vector spaces

Linear Algebra. Session 12

Linear Algebra Homework and Study Guide

Transcription:

MASTER OF SCIENCE IN ANALYTICS 204 EMPLOYMENT REPORT! Results at graduation, May 204 Number of graduates: 79 MSA 205 Number of graduates seeking new employment: 75 Percent with one or more offers of employment by graduation: 00 Percent placed by graduation: 00 Number of employers interviewing: 38 Average number of initial job interviews per student: 3 Linear Algebra Percent of all interviews arranged by Institute: 92 Percent of graduates with 2 or more job offers: 90 Percent of graduates with 3 or more job offers: 6 Percent of graduates with 4 or more job offers: 40 Average base salary offer ($): 96,600 Author: Shaina Race Median base salary offer ($): 95,000 Average base salary offers candidates with job experience ($): 00,600 Range of base salary offers candidates with job experience ($): 80,000-35,000 Percent of graduates with prior professional work experience: 50 Average base salary offers candidates without experience ($): 89,000 Range of base salary offers candidates without experience ($): 75,000-0,000 Percent of graduates receiving a signing bonus: 65 Average amount of signing bonus ($): 2,200 Percent remaining in NC: 59 Percent of graduates sharing salary data: 95 Number of reported job offers: 246 Percent of reported job offers based in U.S.: 00 204 North&Carolina&State&University& &920&Main&Campus&Drive,&Suite&530& &Raleigh,&NC&27606& &http://analytics.ncsu.edu&

CONTENTS The Basics. Conventional Notation........................... Matrix Partitions....................... 2..2 Special Matrices and Vectors................ 3..3 n-space............................. 4.2 Vector Addition and Scalar Multiplication............ 4.3 Exercises................................ 7 2 Norms, Inner Products and Orthogonality 9 2. Norms and Distances......................... 9 2.2 Inner Products............................. 3 2.2. Covariance........................... 3 2.2.2 Mahalanobis Distance.................... 5 2.2.3 Angular Distance....................... 6 2.2.4 Correlation.......................... 6 2.3 Orthogonality............................. 7 2.4 Outer Products............................ 9 3 Linear Combinations and Linear Independence 23 3. Linear Combinations......................... 23 3.2 Linear Independence......................... 26 3.2. Determining Linear Independence............. 27 3.3 Span of Vectors............................ 28 4 Basis and Change of Basis 32 5 Least Squares 38

CONTENTS 2 6 Eigenvalues and Eigenvectors 43 6. Diagonalization............................ 47 6.2 Geometric Interpretation of Eigenvalues and Eigenvectors... 49 7 Principal Components Analysis 5 7. Comparison with Least Squares.................. 57 7.2 Covariance or Correlation Matrix?................. 57 7.3 Applications of Principal Components............... 58 7.3. PCA for dimension reduction................ 58 8 Singular Value Decomposition (SVD) 62 8. Resolving a Matrix into Components............... 63 8.. Data Compression...................... 64 8..2 Noise Reduction....................... 64 8..3 Latent Semantic Indexing.................. 65 9 Advanced Regression Techniques 68 9. Biased Regression........................... 68 9.. Principal Components Regression (PCR)......... 69 9..2 Ridge Regression....................... 72

CHAPTER THE BASICS. Conventional Notation Linear Algebra has some conventional ways of representing certain types of numerical objects. Throughout this course, we will stick to the following basic conventions: Bold and uppercase letters like A, X, and U will be used to refer to matrices. Occasionally, the size of the matrix will be specified by subscripts, like A m n, which means that A is a matrix with m rows and n columns. Bold and lowercase letters like x and y will be used to reference vectors. Unless otherwise specified, these vectors will be thought of as columns, with x T and y T referring to the row equivalent. The individual elements of a vector or matrix will often be referred to with subscripts, so that A ij (or sometimes a ij ) denotes the element in the i th row and j th column of the matrix A. Similarly, x k denotes the k th element of the vector x. These references to individual elements are not generally bolded because they refer to scalar quantities. Scalar quantities are written as unbolded greek letters like α, δ, and λ. The trace of a square matrix A n n, denoted Tr(A) or Trace(A), is the sum of the diagonal elements of A, Tr(A) = n A ii. i=

.. Conventional Notation 2 Beyond these basic conventions, there are other common notational tricks that we will become familiar with. The first of these is writing a partitioned matrix... Matrix Partitions We will often want to consider a matrix as a collection of either rows or columns rather than individual elements. As we will see in the next chapter, when we partition matrices in this form, we can view their multiplication in simplified form. This often leads us to a new view of the data which can be helpful for interpretation. When we write A = (A A 2... A n ) we are viewing the matrix A as collection of column vectors, A i, in the following way:... A = (A A 2... A n ) = A A 2 A 3... A p... Similarly, we can write A as a collection of row vectors: A A A 2 A =. = A 2... A m A m Sometimes, we will want to refer to both rows and columns in the same context. The above notation is not sufficient for this as we have A j referring to either a column or a row. In these situations, we may use A j to reference the j th column and A i to reference the i th row: A A 2...... A n a a 2...... a n... a i... a ij... a in... a m......... a mn A a a 2...... a n.... A i a i... a ij... a in.... A m a m......... a mn

.. Conventional Notation 3..2 Special Matrices and Vectors The bold capital letter I is used to denote the identity matrix. Sometimes this matrix has a single subscript to specify the size of the matrix. More often, the size of the identity is implied by the matrix equation in which it appears. 0 0 0 I 4 = 0 0 0 0 0 0 0 0 0 The bold lowercase e j is used to refer to the j th column of I. It is simply a vector of zeros with a one in the j th position. We do not often specify the size of the vector e j, the number of elements is generally assumed from the context of the problem. e j = 0. 0 j th row 0. 0 The vector e with no subscript refers to a vector of all ones. e =. A diagonal matrix is a matrix for which off-diagonal elements, A ij, i = j are zero. For example: σ 0 0 0 D = 0 σ 2 0 0 0 0 σ 3 0 0 0 0 σ 4 Since the off diagonal elements are 0, we need only define the diagonal elements for such a matrix. Thus, we will frequently write or simply D = diag{σ, σ 2, σ 3, σ 4 } D ii = σ i.

.2. Vector Addition and Scalar Multiplication 4..3 n-space You are already familiar with the concept of ordered pairs" or coordinates (x, x 2 ) on the two-dimensional plane (in Linear Algebra, we call this plane "2-space"). Fortunately, we do not live in a two-dimensional world! Our data will more often consist of measurements on a number (lets call that number n) of variables. Thus, our data points belong to what is known as n-space. They are represented by n-tuples which are nothing more than ordered lists of numbers: (x, x 2, x 3,..., x n ). An n-tuple defines a vector with the same n elements, and so these two concepts should be thought of interchangeably. The only difference is that the vector has a direction, away from the origin and toward the n-tuple. You will recall that the symbol R is used to denote the set of real numbers. R is simply -space. It is a set of vectors with a single element. In this sense any real number, x, has a direction: if it is positive, it is to one side of the origin, if it is negative it is to the opposite side. That number, x, also has a magnitude: x is the distance between x and the origin, 0. n-space (the set of real n-tuples) is denoted R n. In set notation, the formal mathematical definition is simply: R n = {(x, x 2,..., x n ) : x i R, i =,..., n}. We will often use this notation to define the size of an arbitrary vector. For example, x R p simply means that x is a vector with p entries: x = (x, x 2,..., x p ). Many (all, really) of the concepts we have previously considered in 2- or 3-space extend naturally to n-space and a few new concepts become useful as well. One very important concept is that of a norm or distance metric, as we will see in Chapter 2. Before discussing norms, let s revisit the basics of vector addition and scalar multiplication..2 Vector Addition and Scalar Multiplication You ve already learned how vector addition works algebraically: it occurs element-wise between two vectors of the same length: a b a + b a 2 b 2 a 2 + b 2 a + b = a 3 + b 3 = a 3 + b 3... a n b n a n + b n

.2. Vector Addition and Scalar Multiplication 5 Geometrically, vector addition is witnessed by placing the two vectors, a and b, tail-to-head. The result, a + b, is the vector from the open tail to the open head. This is called the parallelogram law and is demonstrated in Figure.a. a+b a b a-b a b (a) Addition of vectors (b) Subtraction of Vectors Figure.: Vector Addition and Subtraction Geometrically: Tail-to-Head When subtracting vectors as a b we simply add b to a. The vector b has the same length as b but points in the opposite direction. This vector has the same length as the one which connects the two heads of a and b as shown in Figure.b. Example.2.: Vector Subtraction: Centering Data One thing we will do frequently in this course is consider centered and/or standardized data. To center a group of variables, we merely subtract the mean of each variable from each observation. Geometrically, this amounts to a translation (shift) of the data so that it s center (or mean) is at the origin. The following graphic illustrates this process using 4 data points.

.2. Vector Addition and Scalar Multiplication 6 x2 x2 x x x -x x2 x2 x x Scalar multiplication is another operation which acts element-wise: a αa a 2 αa 2 αa = α a 3 = αa 3.. a n αa n Scalar multiplication changes the length of a vector but not the overall direction (although a negative scalar will scale the vector in the opposite direction through the origin). We can see this geometric interpretation of scalar multiplication in Figure.2.

.3. Exercises 7 2a -.5a a Figure.2: Geometric Effect of Scalar Multiplication.3 Exercises. For a general matrix A m n describe what the following products will provide. Also give the size of the result (i.e. "n vector" or "scalar"). a. Ae j b. e T i A c. e T i Ae j d. Ae e. e T A f. n e T A 2. Let D n n be a diagonal matrix with diagonal elements D ii. What effect does multiplying a matrix A m n on the left by D have? What effect does multiplying a matrix A n m on the right by D have? If you cannot see this effect in a general sense, try writing out a simple 3 3 matrix as an example first. 3. What is the inverse of a diagonal matrix, D = diag{d, d 22,..., d nn }? 4. Suppose you have a matrix of data, A n p, containing n observations on p variables. Suppose the standard deviations of these variables are σ, σ 2,..., σ p. Give a formula for a matrix that contains the same data but with each variable divided by its standard deviation. Hint: you should use exercises 2 and 3. 5. Suppose we have a network/graph as shown in Figure.3. This particular network has 6 numbered vertices (the circles) and edges which connect the vertices. Each edge has a certain weight (perhaps reflecting some level of association between the vertices) which is given as a number.

.3. Exercises 8 3 2 3 5 2 5 2 4 9 0 6 Figure.3: An example of a graph or network a. The adjacency matrix of a graph is defined to be the matrix A such that element A ij reflects the the weight of the edge connecting vertex i and vertex j. Write out the adjacency matrix for this graph. b. The degree of a vertex is defined as the sum of the weights of the edges connected to that vertex. Create a vector d such that d i is the degree of node i. c. Write d as a matrix-vector product in two different ways using the adjacency matrix, A, and e.

9 CHAPTER 2 NORMS, INNER PRODUCTS AND ORTHOGONALITY 2. Norms and Distances In applied mathematics, Norms are functions which measure the magnitude or length of a vector. They are commonly used to determine similarities between observations by measuring the distance between them. As we will see, there are many ways to define distance between two points. Definition 2..: Vector Norms and Distance Metrics A Norm, or distance metric, is a function that takes a vector as input and returns a scalar quantity ( f : R n R). A vector norm is typically denoted by two vertical bars surrounding the input vector, x, to signify that it is not just any function, but one that satisfies the following criteria:. If c is a scalar, then cx = c x 2. The triangle inequality: x + y x + y 3. x = 0 if and only if x = 0. 4. x 0 for any vector x We will not spend any time on these axioms or on the theoretical aspects of

2.. Norms and Distances 0 norms, but we will put a couple of these functions to good use in our studies, the first of which is the Euclidean norm or 2-norm. Definition 2..2: Euclidean Norm, 2 The Euclidean Norm, also known as the 2-norm simply measures the Euclidean length of a vector (i.e. a point s distance from the origin). Let x = (x, x 2,..., x n ). Then, x 2 = x 2 + x2 2 + + x2 n If x is a column vector, then x 2 = x T x. Often we will simply write rather than 2 to denote the 2-norm, as it is by far the most commonly used norm. This is merely the distance formula from undergraduate mathematics, measuring the distance between the point x and the origin. To compute the distance between two different points, say x and y, we d calculate x y 2 = (x y ) 2 + (x 2 y 2 ) 2 + + (x n y n ) 2 Example 2..: Euclidean Norm and Distance Suppose I have two vectors in 3-space: x = (,, ) and y = (, 0, 0) Then the magnitude of x (i.e. its length or distance from the origin) is and the magnitude of y is x 2 = 2 + 2 + 2 = 3 y 2 = 2 + 0 2 + 0 2 = and the distance between point x and point y is x y 2 = ( ) 2 + ( 0) 2 + ( 0) 2 = 2. The Euclidean norm is crucial to many methods in data analysis as it measures the closeness of two data points.

2.. Norms and Distances Thus, to turn any vector into a unit vector, a vector with a length of, we need only to divide each of the entries in the vector by its Euclidean norm. This is a simple form of standardization used in many areas of data analysis. For a unit vector x, x T x =. Perhaps without knowing it, we ve already seen many formulas involving the norm of a vector. Examples 2..2 and 2..3 show how some of the most important concepts in statistics can be represented using vector norms. Example 2..2: Standard Deviation and Variance Suppose a group of individuals has the following heights, measured in inches: (60, 70, 65, 50, 55). The mean height for this group is 60 inches. The formula for the sample standard deviation is typically given as s = n i= (x i x) 2 n We want to subtract the mean from each observation, square the numbers, sum the result, take the square root and divide by n. If we let x = xe = (60, 60, 60, 60, 60) be a vector containing the mean, and x = (60, 70, 65, 50, 55) be the vector of data then the standard deviation in matrix notation is: s = n x x 2 = 7.9 The sample variance of this data is merely the square of the sample standard deviation: s 2 = n x x 2 2 Example 2..3: Residual Sums of Squares Another place we ve seen a similar calculation is in linear regression. You ll recall the objective of our regression line is to minimize the sum of squared residuals between the predicted value ŷ and the observed value y: n (ŷ i y i ) 2. i= In vector notation, we d let y be a vector containing the observed data and ŷ be a vector containing the corresponding predictions and write this summation as ŷ y 2 2

2.. Norms and Distances 2 In fact, any situation where the phrase "sum of squares" is encountered, the 2-norm is generally implicated. Example 2..4: Coefficient of Determination, R 2 Since variance can be expressed using the Euclidean norm, so can the coefficient of determination or R 2. R 2 = SS reg = n i= (ŷ i ȳ) 2 ŷ ȳ 2 SS tot i= n (y = i ȳ) 2 y ȳ 2 Other useful norms and distances -norm,. If x = ( x x 2... x n ) then the -norm of X is x = n x i. i= This metric is often referred to as Manhattan distance, city block distance, or taxicab distance because it measures the distance between points along a rectangular grid (as a taxicab must travel on the streets of Manhattan, for example). When x and y are binary vectors, the -norm is called the Hamming Distance, and simply measures the number of elements that are different between the two vectors. Figure 2.: The lengths of the red, yellow, and blue paths represent the - norm distance between the two points. The green line shows the Euclidean measurement (2-norm).

2.2. Inner Products 3 The infinity norm, also called the Supremum, or Max dis- -norm,. tance, is: x = max{ x, x 2,..., x p } 2.2 Inner Products The inner product of vectors is a notion that you ve already seen, it is what s called the dot product in most physics and calculus text books. Definition 2.2.: Vector Inner Product The inner product of two n vectors x and y is written x T y (or sometimes as x, y ) and is the sum of the product of corresponding elements. y x T y = ( ) y 2 n x x 2... x n.. = x y + x 2 y 2 + + x n y n = x i y i. i= y n When we take the inner product of a vector with itself, we get the square of the 2-norm: x T x = x 2 2. Inner products are at the heart of every matrix product. When we multiply two matrices, X m n and Y n p, we can represent the individual elements of the result as inner products of rows of X and columns of Y as follows: X Y X Y 2... X Y p X X 2 X 2 Y X 2 Y 2... X 2 Y p ( ) XY = Y Y 2... Y p = X 3 Y X 3 Y 2... X 3 Y p....... X m. X m Y..... Xm Y p 2.2. Covariance Another important statistical measurement that is represented by an inner product is covariance. Covariance is a measure of how much two random variables change together. The statistical formula for covariance is given as Covariance(x, y) = E[(x E[x])(y E[y])] (2.) where E[ ] is the expected value of the variable. If larger values of one variable correspond to larger values of the other variable and at the same time smaller

2.2. Inner Products 4 values of one correspond to smaller values of the other, then the covariance between the two variables is positive. In the opposite case, if larger values of one variable correspond to smaller values of the other and vice versa, then the covariance is negative. Thus, the sign of the covariance shows the tendency of the linear relationship between variables, however the magnitude of the covariance is not easy to interpret. Covariance is a population parameter - it is a property of the joint distribution of the random variables x and y. Definition 2.2.2 provides the mathematical formulation for the sample covariance. This is our best estimate for the population parameter when we have data sampled from a population. Definition 2.2.2: Sample Covariance If x and y are n vectors containing n observations for two different variables, then the sample covariance of x and y is given by n n i= (x i x)(y i ȳ) = n (x x)t (y ȳ) Where again x and ȳ are vectors that contain x and ȳ repeated n times. It should be clear from this formulation that cov(x, y) = cov(y, x). When we have p vectors, v, v 2,..., v p, each containing n observations for p different variables, the sample covariances are most commonly given by the sample covariance matrix, Σ, where Σ ij = cov(v i, v j ). This matrix is symmetric, since Σ ij = Σ ji. If we create a matrix V whose columns are the vectors v, v 2,... v p once the variables have been centered to have mean 0, then the covariance matrix is given by: cov(v) = Σ = n VT V. The j th diagonal element of this matrix gives the variance v j since Σ jj = cov(v j, v j ) = n (v j v j ) T (v j v j ) (2.2) = n v j v j 2 2 (2.3) = var(v j ) (2.4) When two variables are completely uncorrelated, their covariance is zero.

2.2. Inner Products 5 This lack of correlation would be seen in a covariance matrix with a diagonal structure. That is, if v, v 2,..., v p are uncorrelated with individual variances σ 2, σ2 2,..., σ2 p respectively then the corresponding covariance matrix is: σ 2 0 0... 0 0 σ2 2 0... 0 Σ =. 0 0... 0....... 0 0 0 0 σp 2 Furthermore, for variables which are independent and identically distributed (take for instance the error terms in a linear regression model, which are assumed to independent and normally distributed with mean 0 and constant variance σ), the covariance matrix is a multiple of the identity matrix: σ 2 0 0... 0 0 σ 2 0... 0 Σ =. 0 0... 0 = σ 2 I....... 0 0 0 0 σ 2 Transforming our variables in a such a way that their covariance matrix becomes diagonal will be our goal in Chapter 7. Theorem 2.2.: Properties of Covariance Matrices The following mathematical properties stem from Equation 2.. Let X n p be a matrix of data containing n observations on p variables. If A is a constant matrix (or vector, in the first case) then cov(xa) = A T cov(x)a and cov(x + A) = cov(x) 2.2.2 Mahalanobis Distance Mahalanobis Distance is similar to Euclidean distance, but takes into account the correlation of the variables. This metric is relatively common in data mining applications like classification. Suppose we have p variables which have some covariance matrix, Σ. Then the Mahalanobis distance between two observations, x = ( ) T ( ) T x x 2... x p and y = y y 2... y p is given by d(x, y) = (x y) T Σ (x y).

2.2. Inner Products 6 If the covariance matrix is diagonal (meaning the variables are uncorrelated) then the Mahalanobis distance reduces to Euclidean distance normalized by the variance of each variable: d(x, y) = p i= 2.2.3 Angular Distance (x i y i ) 2 s 2 i = Σ /2 (x y) 2. The inner product between two vectors can provide useful information about their relative orientation in space and about their similarity. For example, to find the cosine of the angle between two vectors in n-space, the inner product of their corresponding unit vectors will provide the result. This cosine is often used as a measure of similarity or correlation between two vectors. Definition 2.2.3: Cosine of Angle between Vectors The cosine of the angle between two vectors in n-space is given by cos(θ) = x T y x 2 y 2 y θ x This angular distance is at the heart of Pearson s correlation coefficient. 2.2.4 Correlation Pearson s correlation is a normalized version of the covariance, so that not only the sign of the coefficient is meaningful, but its magnitude is meaningful in measuring the strength of the linear association.

2.3. Orthogonality 7 Example 2.2.: Pearson s Correlation and Cosine Distance You may recall the formula for Pearson s correlation between variable x and y with a sample size of n to be as follows: r = n i= (x i x)(y i ȳ) n i= (x i x) 2 n i= (y i ȳ) 2 If we let x be a vector that contains x repeated n times, like we did in Example 2..2, and let ȳ be a vector that contains ȳ then Pearson s coefficient can be written as: r = (x x)t (y ȳ) x x y ȳ In other words, it is just the cosine of the angle between the two vectors once they have been centered to have mean 0. This makes sense: correlation is a measure of the extent to which the two variables share a line in space. If the cosine of the angle is positive or negative one, this means the angle between the two vectors is 0 or 80, thus, the two vectors are perfectly correlated or collinear. It is difficult to visualize the angle between two variable vectors because they exist in n-space, where n is the number of observations in the dataset. Unless we have fewer than 3 observations, we cannot draw these vectors or even picture them in our minds. As it turns out, this angular measurement does translate into something we can conceptualize: Pearson s correlation coefficient is the angle formed between the two possible regression lines using the centered data: y regressed on x and x regressed on y. This is illustrated in Figure 2.2. To compute the matrix of pairwise correlations between variables x, x 2, x 3,..., x p (columns containing n observations for each variable), we d first center them to have mean zero, then normalize them to have length x i = and then compose the matrix X = [x x 2 x 3... x p ]. Using this centered and normalized data, the correlation matrix is simply C = X T X. 2.3 Orthogonality Orthogonal (or perpendicular) vectors have an angle between them of 90, meaning that their cosine (and subsequently their inner product) is zero.

2.3. Orthogonality 8 y=f(x) θ x=f(y) r=cos(θ) Figure 2.2: Correlation Coefficient r and Angle between Regression Lines Definition 2.3.: Orthogonality Two vectors, x and y, are orthogonal in n-space if their inner product is zero: x T y = 0 Combining the notion of orthogonality and unit vectors we can define an orthonormal set of vectors, or an orthonormal matrix. Remember, for a unit vector, x T x =. Definition 2.3.2: Orthonormal Sets The n vectors {x, x 2, x 3,..., x p } form an orthonormal set if and only if. x T i x j = 0 when i = j and 2. x T i x i = (equivalently x i = ) In other words, an orthonormal set is a collection of unit vectors which are mutually orthogonal. If we form a matrix, X = (x x 2 x 3... x p ), having an orthonormal set of vectors as columns, we will find that multiplying the matrix by its transpose provides a nice result:

2.4. Outer Products 9 x T x Tx x Tx 2 x Tx 3... x Tx p x2 T X T X = x T x2 Tx x2 Tx 2 x2 Tx 3... x2 Tx p ( ) x x 2 x 3... x p = x3 3. Tx x3 Tx 2 x3 Tx 3... x3 Tx p......... x T p x T. p x........ x T p x p 0 0... 0 0 0... 0 = 0 0... 0 = I...... p... 0 0 0... We will be particularly interested in these types of matrices when they are square. If X is a square matrix with orthonormal columns, the arithmetic above means that the inverse of X is X T (i.e. X also has orthonormal rows): X T X = XX T = I. Square matrices with orthonormal columns are called orthogonal matrices. Definition 2.3.3: Orthogonal (or Orthonormal) Matrix A square matrix, U with orthonormal columns also has orthonormal rows and is called an orthogonal matrix. Such a matrix has an inverse which is equal to it s transpose, U T U = UU T = I 2.4 Outer Products The outer product of two vectors x R m and y R n, written xy T, is an m n matrix with rank. To see this basic fact, lets just look at an example.

2.4. Outer Products 20 Example 2.4.: Outer Product Let x = 2 2 3 and let y =. Then the outer product of x and y is: 3 4 2 3 xy T = 2 ( ) 3 2 3 = 4 2 6 6 3 9 4 8 4 2 which clearly has rank. It should be clear from this example that computing an outer product will always result in a matrix whose rows and columns are multiples of each other. Example 2.4.2: Centering Data with an Outer Product As we ve seen in previous examples, many statistical formulas involve the centered data, that is, data from which the mean has been subtracted so that the new mean is zero. Suppose we have a matrix of data containing observations of individuals heights (h) in inches, weights (w), in pounds and wrist sizes (s), in inches: A = h w s person 60 02 5.5 person 2 72 70 7.5 person 3 66 0 6.0 person 4 69 28 6.5 person 5 63 30 7.0 The average values for height, weight, and wrist size are as follows: h = 66 (2.5) w = 28 (2.6) s = 6.5 (2.7) To center all of the variables in this data set simultaneously, we could compute an outer product using a vector containing the means and a vector of all ones:

2.4. Outer Products 2 60 02 5.5 72 70 7.5 66 0 6.0 ( ) 66 28 6.5 69 28 6.5 63 30 7.0 60 02 5.5 66 28 6.5 72 70 7.5 = 66 0 6.0 66 28 6.5 66 28 6.5 69 28 6.5 66 28 6.5 63 30 7.0 66 28 6.5 6.0000 26.0000.0000 6.0000 42.0000.0000 = 0 8.0000 0.5000 3.0000 0 0 3.0000 2.0000 0.5000 Exercises. Let u = 2 4 2 and v =. a. Determine the Euclidean distance between u and v. b. Find a vector of unit length in the direction of u. c. Determine the cosine of the angle between u and v. d. Find the - and -norms of u and v. c. Suppose these vectors are observations on four independent variables, which have the following covariance matrix: 2 0 0 0 Σ = 0 0 0 0 0 2 0 0 0 0 Determine the Mahalanobis distance between u and v.

2.4. Outer Products 22 2. Let U = 3 2 0 2 2 2 0 0 0 3 0 2 0 2 a. Show that U is an orthogonal matrix. b. Let b =. Solve the equation Ux = b. 3. Write a matrix expression for the correlation matrix, C, for a matrix of centered data, X, where C ij = r ij is Pearson s correlation measure between variables x i and x j. To do this, we need more than an inner product, we need to normalize the rows and columns by the norms x i. For a hint, see Exercise 2 in Chapter. 4. Suppose you have a matrix of data, A n p, containing n observations on p variables. Develop a matrix formula for the standardized data (where the mean of each variable should be subtracted from the corresponding column before dividing by the standard deviation). Hint: use Exercises (f) and 4 from Chapter along with Example 2.4.2. 5. Explain why, for any norm or distance metric, x y = y x 6. Find two vectors which are orthogonal to x = 7. Pythagorean Theorem. Show that x and y are orthogonal if and only if (Hint: Recall that x 2 2 = xt x) x + y 2 2 = x 2 2 + y 2 2

23 CHAPTER 3 LINEAR COMBINATIONS AND LINEAR INDEPENDENCE One of the most central ideas in all of Linear Algebra is that of linear independence. For regression problems, it is repeatedly stressed that multicollinearity is problematic. Multicollinearity is simply a statistical term for linear dependence. It s bad. We will see the reason for this shortly, but first we have to develop the notion of a linear combination. 3. Linear Combinations Definition 3..: Linear Combination A linear combination is constructed from a set of terms v, v 2,..., v n by multiplying each term by a constant and adding the result: c = α v + α 2 v 2 + + α n v n = n α i v n i= The coefficients α i are scalar constants and the terms, {v i } can be scalars, vectors, or matrices. If we dissect our formula for a system of linear equations, Ax = b, we will find that the right-hand side vector b can be expressed as a linear combination of the columns in the coefficient matrix, A.

3.. Linear Combinations 24 b = Ax (3.) x x 2 b = (A A 2... A n ). x 3 (3.2) b = x A + x 2 A 2 + + x n A n (3.3) A concrete example of this expression is given in Example 3... Example 3..: Systems of Equations as Linear Combinations Consider the following system of equations: 3x + 2x 2 + 9x 3 = (3.4) 4x + 2x 2 + 3x 3 = 5 (3.5) 2x + 7x 2 + x 3 = 0 (3.6) We can write this as a matrix vector product Ax = b where 3 2 9 x A = 4 2 3 x = x 2 and b = 5 2 7 x 3 0 We can also write b as a linear combination of columns of A: 3 2 9 x 4 + x 2 2 + x 3 3 = 5 2 7 0 Similarly, if we have a matrix-matrix product, we can write each column of the result as a linear combination of columns of the first matrix. Let A m n, X n p, and B m p be matrices. If we have AX = B then x x 2... x p x 2 x 22... x 2n (A A 2... A n )...... = (B B 2... B n ) x n x n2... x np and we can write B j = AX j = x j A + x 2j A 2 + x 3j A 3 + + x nj A n. A concrete example of this expression is given in Example 3..2.

3.. Linear Combinations 25 Example 3..2: Linear Combinations in Matrix-Matrix Products Suppose we have the following matrix formula: AX = B 2 3 5 6 Where A = 4 2, X = 9 5 Then 3 2 7 8 B = = 2 3 5 6 4 2 9 5 (3.7) 3 2 7 8 2(5) + (9) + 3(7) 2(6) + (5) + 3(8) (5) + 4(9) + 2(7) (6) + 4(5) + 2(8) (3.8) 3(5) + 2(9) + (7) 3(6) + 2(5) + (8) and we can immediately notice that the columns of B are linear combinations of columns of A: 2 3 B = 5 + 9 4 + 7 2 3 2 2 3 B 2 = 6 + 5 4 + 8 2 3 2 We may also notice that the rows of B can be expressed as a linear combination of rows of X: B = 2 ( 5 6 ) + ( 9 5 ) + 3 ( 7 8 ) B 2 = ( 5 6 ) + 4 ( 9 5 ) + 2 ( 7 8 ) B 3 = 3 ( 5 6 ) + 2 ( 9 5 ) + ( 7 8 ) Linear combinations are everywhere, and they can provide subtle but important meaning in the sense that they can break data down into a sum of parts. You should convince yourself of one final view of matrix multiplication, as the sum of outer products. In this case B is the sum of 3 outer products (3 matrices of rank ) involving the columns of A and corresponding rows of X: B = A X + A 2 X 2 + A 3 X 3. Example 3..2 turns out to have important implications for our interpreta-

3.2. Linear Independence 26 tion of matrix factorizations. In this context we d call AX a factorization of the matrix B. We will see how to use these expressions to our advantage in later chapters. We don t necessarily have to use vectors as the terms for a linear combination. Example 3..3 shows how we can write any m n matrix as a linear combination of nm matrices with rank. Example 3..3: Linear Combination of Matrices ( ) 3 Write the matrix A = as a linear combination of the following 4 2 matrices: {( ) ( ) ( ) ( )} 0 0 0 0 0 0,,, 0 0 0 0 0 0 Solution: A = ( ) 3 = 4 2 ( ) 0 + 3 0 0 ( ) 0 + 4 0 0 ( ) 0 0 + 2 0 ( ) 0 0 0 Now that we understand the concept of Linear Combination, we can develop the important concept of Linear Independence. 3.2 Linear Independence Definition 3.2.: Linear Dependence and Linear Independence A set of vectors {v, v 2,..., v n } is linearly dependent if we can express the zero vector, 0, as non-trivial linear combination of the vectors. In other words there exist some constants α, α 2,... α n (non-trivial means that these constants are not all zero) for which α v + α 2 v 2 + + α n v n = 0. (3.9) A set of terms is linearly independent if Equation 3.9 has only the trivial solution (α = α 2 = = α n = 0). Another way to express linear dependence is to say that we can write one of the vectors as a linear combination of the others. If there exists a non-trivial set of coefficients α, α 2,..., α n for which α v + α 2 v 2 + + α n v n = 0

3.2. Linear Independence 27 then for α j = 0 we could write v j = n α j α i v i i= i =j Example 3.2.: Linearly Dependent Vectors 3 The vectors v = 2, v 2 = 2, and v 3 = 6 are linearly dependent because 2 3 7 v 3 = 2v + v 2 or, equivalently, because 2v + v 2 v 3 = 0 3.2. Determining Linear Independence You should realize that the linear combination expressed Definition 3.2. can be written as a matrix vector product. Let A m n = (A A 2... A n ) be a matrix. Then by Definition 3.2., the columns of A are linearly independent if and only if the equation Ax = 0 (3.0) has only the trivial solution, x = 0. Equation 3.0 is commonly known as the homogeneous linear equation. For this equation to have only the trivial solution, it must be the case that under Gauss-Jordan elimination, the augmented matrix (A 0) reduces to (I 0). We have already seen this condition in our discussion about matrix inverses - if a square matrix A reduces to the identity matrix under Gauss-Jordan elimination then it is equivalently called full rank, nonsingular, or invertible. Now we add an additional condition equivalent to the others - the matrix A has linearly independent columns (and rows). In Theorem 3.2. a important list of equivalent conditions regarding linear independence and invertibility is given. Theorem 3.2.: Equivalent Conditions for Matrix Invertibility Let A be an n n matrix. The following statements are equivalent. (If one these statement is true, then all of these statements are true) A is invertible (A exists)

3.3. Span of Vectors 28 A has full rank (rank(a) = n) The columns of A are linearly independent The rows of A are linearly independent The system Ax = b, b = 0 has a unique solution Ax = 0 = x = 0 A is nonsingular A Gauss Jordan I 3.3 Span of Vectors Definition 3.3.: Vector Span The span of a single vector v is the set of all scalar multiples of v: span(v) = {αv for any constant α} The span of a collection of vectors, V = {v, v 2,..., v n } is the set of all linear combinations of these vectors: span(v) = {α v + α 2 v 2 + + α n v n for any constants α,..., α n } Recall that addition of vectors can be done geometrically using the head-totail method shown in Figure 3.. Figure 3.: Geometrical addition of vectors: Head-to-tail If we have two linearly independent vectors on a coordinate plane, then any

3.3. Span of Vectors 29 third vector can be written as a linear combination of them. This is because two vectors is sufficient to span the entire 2-dimensional plane. You should take a moment to convince yourself of this geometrically. In 3-space, two linearly independent vectors can still only span a plane. Figure 3.2 depicts this situation. The set of all linearly combinations of the two vectors a and b (i.e. the span(a, b)) carves out a plane. We call this a two-dimensional collection of vectors a subspace of R 3. A subspace is formally defined in Definition 3.3.2. Figure 3.2: The span(a, b) in R 3 creates a plane (a 2-dimensional subspace) Definition 3.3.2: Subspace A subspace, S of R n is thought of as a flat (having no curvature) surface within R n. It is a collection of vectors which satisfies the following conditions:. The origin (0 vector) is contained in S 2. If x and y are in S then the sum x + y is also in S 3. If x is in S and α is a constant then αx is also in S The span of two vectors a and b is a subspace because it satisfies these three conditions. (Can you prove it? See exercise 4).

3.3. Span of Vectors 30 Example 3.3.: Span 3 Let a = 3 and b = 0. Explain why or why not each of the 4 following vectors is contained in the span(a, b)? 5 a. x = 6 9 To determine if x is in the span(a, b) we need to find coefficients α, α 2 such that α a + α 2 b = x. Thus, we attempt to solve the system 3 ( ) 5 3 0 α = 6. α 4 2 9 After Gaussian Elimination, we find that the system is consistent with the solution ( ) ( ) α 2 = and so x is in fact in the span(a, b). 2 b. y = 4 6 α 2 We could follow the same procedure as we did in part (a) to learn that the corresponding system is not consistent and thus that y is not in the span(a, b). Exercises. Six views of matrix multiplication: Let A m k, B k n, and C m n be matrices such that AB = C. a. Express the first column of C as a linear combination of the columns

3.3. Span of Vectors 3 of A. b. Express the first column of C as a matrix-vector product. c. Express C as a sum of outer products. d. Express the first row of C as a linear combination of the rows of B. e. Express the first row of C as a matrix-vector product. d. Express the element C ij as an inner product of row or column vectors from A and B. 2. Determine whether or not the vectors 0 2 x = 3, x 2 =, x 3 = 0 are linearly independent. 3 3. Let a = 3 and b = 0. 4 0 a. Show that the zero vector, 0 is in the span(a, b). 0 b. Determine whether or not the vector 0 is in the span(a, b). 4. Prove that the span of vectors is a subspace by showing that it satisfies the three conditions from Definition 3.3.. You can simply show this fact for the span of two vectors and notice how the concept will hold for more than two vectors. 5. True/False Mark each statement as true or false. Justify your response. If Ax = b has a solution then b can be written as a linear combination of the columns of A. If Ax = b has a solution then b is in the span of the columns of A. If the vectors v, v 2, and, v 3 form a linearly dependent set, then v is in the span(v 2, v 3 ).

32 CHAPTER 4 BASIS AND CHANGE OF BASIS When we think of coordinate pairs, or coordinate triplets, we tend to think of them as points on a grid where each axis represents one of the coordinate directions: span(e2) (-4,-2) (2,3) (5,2) span(e) When we think of our data points this way, we are considering them as linear combinations of elementary basis vectors ( ) ( ) 0 e = and e 0 2 =. For example, the point (2, 3) is written as ( ) ( ) ( ) 2 0 = 2 + 3 = 2e 3 0 + 3e 2. (4.)

33 We consider the coefficients (the scalars 2 and 3) in this linear combination as coordinates in the basis B = {e, e 2 }. The coordinates, in essence, tell us how much information from the vector/point (2, 3) lies along each basis direction: to create this point, we must travel 2 units along the direction of e and then 3 units along the direction of e 2. We can also view Equation 4. as a way to separate the vector (2, 3) into orthogonal components. Each component is an orthogonal projection of the vector onto the span of the corresponding basis vector. The orthogonal projection of vector a onto the span another vector v is simply the closest point to a contained on the span(v), found by projecting a toward v at a 90 angle. Figure 4. shows this explicitly for a = (2, 3). span(e2) orthogonal projection of a onto e2 a orthogonal projection of a onto e span(e) Figure 4.: Orthogonal Projections onto basis vectors. Definition 4.0.3: Elementary Basis For any vector a = (a, a 2,..., a n ), the basis B = {e, e 2,..., e n } (recall e i is the i th column of the identity matrix I n ) is the elementary basis and a can be written in this basis using the coordinates a, a 2,..., a n as follows: a = a e + a 2 e 2 +... a n e n. The elementary basis B is convenient for many reasons, one being its orthonormality: e T e = e T 2 e 2 = e T e 2 = e T 2 e = 0 However, there are many (infinitely many, in fact) ways to represent the data points on different axes. If I wanted to view this data in a different

34 way, I could use a different basis. Let s consider, for example, the following orthonormal basis, drawn in green over the original grid in Figure 4.2: B 2 = {v, v 2 } = { 2 2 ( ) ( )}, 2 2 span(e2) span(v) span(e) span(v2) Figure 4.2: New basis vectors, v and v 2, shown on original plane The scalar multipliers 2 2 are simply normalizing factors so that the basis vectors have unit length. You can convince yourself that this is an orthonormal basis by confirming that v T v = v T 2 v 2 = v T v 2 = v T 2 v = 0 If we want to change the basis from the elementary B to the new green basis vectors in B 2, we need to determine a new set of coordinates that direct us to the point using the green basis vectors as a frame of reference. In other words we need to determine (α, α 2 ) such that travelling α units along the direction v and then α 2 units along the direction v 2 will lead us to the point in question. For the point (2, 3) that means ( ) ( 2 2 = α 3 v + α 2 v 2 = α 2 2 This is merely a system of equations Va = b: α 2 2 ) ( ) ( ) 2 α 2 = ( ) 2 + α 2 2 2. 2 ( ) 2 3

35 The 2 2 matrix V on the left-hand side has linearly independent columns and thus has an inverse. In fact, V is an orthonormal matrix which means its inverse is its transpose. Multiplying both sides of the equation by V = V T yields the solution ( ) ( ) 5 2 α a = = V T b = 2 α 2 2 2 This result tells us that in order to reach the red point (formerly known as (2,3) in our previous basis), we should travel 5 2 2 units along the direction of v and then 2 2 units along the direction v 2 (Note that v 2 points toward the southeast corner and we want to move northwest, hence the coordinate is negative). Another way (a more mathematical way) to say this is that the length of the orthogonal projection of a onto the span of v is 5 2 2, and the length of the orthogonal projection of a onto the span of v 2 is 2 2. While it may seem that these are difficult distances to plot, they work out quite well if we examine our drawing in Figure 4.2, because the diagonal of each square is 2. In the same fashion, we can re-write all 3 of the red points on our graph in the new basis by solving the same system simultaneously for all the points. Let B be a matrix containing the original coordinates of the points and let A be a matrix containing the new coordinates: B = ( 4 2 ) 5 2 3 2 ( ) α α A = 2 α 3 α 2 α 22 α 23 Then the new data coordinates on the rotated plane can be found by solving: And thus A = V T B = VA = B 2 2 ( 6 5 ) 7 2 3 Using our new basis vectors, our alternative view of the data is that in Figure 4.3. In the above example, we changed our basis from the original elementary basis to a new orthogonal basis which provides a different view of the data. All of this amounts to a rotation of the data around the origin. No real information has been lost - the points maintain their distances from each other in nearly every distance metric. Our new variables, v and v 2 are linear combinations of our original variables e and e 2, thus we can transform the data back to its original coordinate system by again solving a linear system (in this example, we d simply multiply the new coordinates again by V). In general, we can change bases using the procedure outlined in Theorem 4.0..

36 + span(v) + span(v2) Figure 4.3: Points plotted in the new basis, B Theorem 4.0.: Changing Bases Given a matrix of coordinates (in columns), A, in some basis, B = {x, x 2,..., x n }, we can change the basis to B 2 = {v, v 2,..., v n } with the new set of coordinates in a matrix B by solving the system XA = VB where X and V are matrices containing (as columns) the basis vectors from B and B 2 respectively. Note that when our original basis is the elementary basis, X = I, our system reduces to A = VB. When our new basis vectors are orthonormal, the solution to this system is simply B = V T A. Definition 4.0.4: Basis Terminology A basis for the vector space R n can be any collection of n linearly independent vectors in R n ; n is said to be the dimension of the vector space R n. When the basis vectors are orthonormal (as they were in our

37 example), the collection is called an orthonormal basis. The preceding discussion dealt entirely with bases for R n (our example was for points in R 2 ). However, we will need to consider bases for subspaces of R n. Recall that the span of two linearly independent vectors in R 3 is a plane. This plane is a 2-dimensional subspace of R 3. Its dimension is 2 because 2 basis vectors are required to represent this space. However, not all points from R 3 can be written in this basis - only those points which exist on the plane. In the next chapter, we will discuss how to proceed in a situation where the point we d like to represent does not actually belong to the subspace we are interested in. This is the foundation for Least Squares. Exercises ( ) ( ) 3 2. Show that the vectors v = and v 2 = are orthogonal. Create 6 an orthonormal basis for R 2 using these two direction vectors. 2. Consider a = (, ) and a 2 = (0, ) as coordinates for points in the elementary basis. Write the coordinates of a and a 2 in the orthonormal basis found in exercise. Draw a picture which reflects the old and new basis vectors. 3. Write the orthonormal basis vectors from exercise as linear combinations of the original elementary basis vectors. 4. What is the length of the orthogonal projection of a onto v?

38 CHAPTER 5 LEAST SQUARES The least squares problem arises in almost all areas where mathematics is applied. Statistically, the idea is to find an approximate mathematical relationship between predictor and target variables such that the sum of squared errors between the true value and the approximation is minimized. In two dimensions, the goal would be to develop a line as depicted in Figure 5. such that the sum of squared vertical distances (the residuals, in green) between the true data (in red) and the mathematical prediction (in blue) is minimized. residual r { (x,y) (x,y) ^ Figure 5.: Least Squares Illustrated in 2 dimensions If we let r be a vector containing the residual values (r, r 2,..., r n ) then the

39 sum of squared residuals can be written in linear algebraic notation as n ri 2 = r T r = (y ŷ) T (y ŷ) = y ŷ 2 i= Suppose we want to regress our target variable y on p predictor variables, x, x 2,..., x p. If we have n observations, then the ideal situation would be to find a vector of parameters β containing an intercept, β 0 along with p slope parameters, β,..., β p such that x x 2... x p obs x x 2... x p β 0 y 0 obs 2 x 2 x 22... x 2p β y =........ obs n x n x n2... x np β p y n }{{}}{{}}{{} X β y (5.) With many more observations than variables, this system of equations will not, in practice, have a solution. Thus, our goal becomes finding a vector of parameters ˆβ such that X ˆβ = ŷ comes as close to y as possible. Using the design matrix, X, the least squares solution ˆβ is the one for which y X ˆβ 2 = y ŷ 2 is minimized. Theorem 5.0.2 characterizes the solution to the least squares problem. Theorem 5.0.2: Least Squares Problem and Solution For an n m matrix X and n vector y, let r = X β y. The least squares problem is to find a vector β that minimizes the quantity n ri 2 = y X β 2. i= Any vector β which provides a minimum value for this expression is called a least-squares solution. The set of all least squares solutions is precisely the set of solutions to the so-called normal equations, X T X β = X T y.

40 There is a unique least squares solution if and only if rank(x) = m (i.e. linear independence of variables or no perfect multicollinearity!), in which case X T X is invertible and the solution is given by β = (X T X) X T y Example 5.0.2: Solving a Least Squares Problem In 204, data was collected regarding the percentage of linear algebra exercises done by students and the grade they received on their examination. Based on this data, what is the expected effect of completing an additional 0% of the exercises on a students exam grade? ID % of Exercises Exam Grade 20 55 2 00 00 3 90 00 4 70 70 5 50 75 6 0 25 7 30 60 To find the least squares regression line, we want to solve the equation Xβ = y: 20 55 00 00 90 ( ) 70 β0 00 = 50 β 70 75 0 25 30 60 This system is obviously inconsistent. Thus, we want to find the least squares solution ˆβ by solving X T X ˆβ = X T y: ( ) ( ) 7 370 β0 = 370 26900 β ( ) 485 30800 Now, since multicollinearity was not a problem, we can simply find the inverse of X T X and multiply it on both sides of the equation: ( ) 7 370 = 370 26900 ( 0.5233 ) 0.0072 0.0072 0.000

4 and so ( β0 β ) = ( ) ( ) 0.5233 0.0072 485 = 0.0072 0.000 30800 ( ) 32.09 0.7033 Thus, for each additional 0% of exercises completed, exam grades are expected to increase by about 7 points. The data along with the regression line is shown below. grade = 32.09 + 0.7033percent_exercises Why the normal equations? The solution of the normal equations has a nice geometrical interpretation. It involves the idea of orthogonal projection, a concept which will be useful for understanding future topics. In order for a system of equations, Ax = b to have a solution, b must be a linear combination of columns of A. That is simply the definition of matrix multiplication and equality. If A is m n then Ax = b = b = x A + x 2 A 2 + + x n A n. As discussed in Chapter 3, another way to say this is that b is in the span of the columns of A. The span of the columns of A is called the column space of A. In Least-Squares applications, the problem is that b is not in the column space of A. In essence, we want to find the vector ˆb that is closest to b but