EECS 275 Matrix Computation

Size: px

Start display at page:

Download "EECS 275 Matrix Computation"

Nelson Wilcox
5 years ago
Views:

1 EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA Lecture 9 1 / 23

2 Overview Least squares minimization Regression Regularization 2 / 23

3 Reading Chapter 11 of Numerical Linear Algebra by Llyod Trefethen and David Bau Chapter 5 of Matrix Computations by Gene Golub and Charles Van Loan Chapter 4 of Matrix Analysis and Applied Linear Algebra by Carl Meyer Chapter 11 and Chapter 15 of Matrix Algebra From a Statistician s Perspective by David Harville 3 / 23

4 Matrix differentiation First order differentiation of linear form: Likewise x a x a x = x a = i = a x x = x i = x a = ( a i x i x a x 1. x a x n { 1 if i = j 0 if i j i a ix i ) = i = a a i x i = a j x a x = a (Ax) x = A (Ax) = A x 4 / 23

5 Matrix differentiation (cont d) First order differentiation of quadratic form: x Ax = i,k a ik x i x k x Ax = i,k a ikx i x k x Ax = (A + A )x x 2x j if i = k = j (x i x k ) x = i if k = j, i j x k if i = j, k j 0 otherwise = (a jj x 2 j + i j a ij x i x j + k j a jkx j x k + i j,k j a ikx i x k ) = a jj x 2 j + i j a ij (x i x j ) + k j a jk (x j x k ) + i j,k j a ik (x i x k ) = 2a jj x j + i j a ijx i + k j a jkx k + 0 = i a ijx i + k a jkx k 5 / 23

6 Matrix differentiation (cont d) First order differentiation of quadratic form: x Ax x = (A + A )x Let W be a symmetric matrix, it can be easily shown that s (x As) W (x As) = 2A W (x As) s (x s) W (x s) = 2W (x s) x (x As) W (x As) = 2W (x As) 6 / 23

7 Matrix differentiation (cont d) Second order derivative of quadratic form: 2 (x Ax) x s = i a ij x s + k a jk x k x s = a sj + a js Recall 2 (x Ax) x x = A + A f (x) f (a) + J(a)(x a) (x a) H(a)(x a) See The Matrix Cookbook by Kaare Petersen and Michael Pedersen ( for details 7 / 23

8 Overdetermined linear equations Consider y = Ax where A IR m n is skinny, i.e., m > n One can approximately solve y Ax, and define residual or error r = Ax y Find x = x ls that minimizes r x ls is the least squares solution Geometric interpretation: Ax ls is the point in ran(a) that is closest to y, i.e., Ax ls is the projection of y onto ran(a) 8 / 23

9 Least squares minimization Minimize norm of residual squared r = Ax y r 2 = x A Ax 2y Ax + y y Set gradient with respect to x to zero x r 2 = 2A Ax 2A y = 0 A Ax = A y (also known as normal equations) Assume A A is invertible, we have x ls = (A A) 1 A y Ax ls = A(A A) 1 A y 9 / 23

10 Least squares minimization y = Ax x ls = (A A) 1 A y x ls is linear function of y x ls = A 1 y if A is square x ls solves y = Ax ls if y ran(a) A = (A A) 1 A is called pseudo inverse or Moore-Penrose inverse A is a left inverse of (full rank, skinny) A: A A = (A A) 1 A A = I A(A A) 1 A is the projection matrix 10 / 23

11 Orthogonality principle Optimal residual r = Ax ls y = (A(A A) 1 A I )y which is orthogonal to ran(a): r, Az = y (A(A A) 1 A I ) Az = 0 for all z IR n Since r = Ax ls y A(x x ls ) for any x ran(a), we have Ax y 2 = (Ax ls y) + A(x x ls ) 2 = Ax ls y 2 + A(x x ls ) 2 which means for x x ls, Ax y > Ax ls y Can be further simplified via QR decomposition 11 / 23

12 Least squares minimization and orthogonal projection Recall if u IR m, then P = uu is an orthogonal projection u u Given a point x = x + x, its projection is P u x = uu x + uu x = x Generalize to orthogonal projections on a subspace spanned by a set of orthonormal basis A = [u 1,..., u r ] P A = AA In general, we need a normalization term for orthogonal projection if u 1,..., u r is not orthonormal basis, P A = A(A A) 1 A Given A = UΣV, it follows that P A = UU by least squares minimization 12 / 23

13 Least squares estimation Numerous applications in inversion, estimation and reconstruction problems have the form y = Ax + v x is what we want to estimate or reconstruct y is our sensor measurements v is unknown noise or measurement error i-th row of A characterizes i-th sensor Least squares estimation: choose ˆx that minimizes Aˆx y, i.e., deviation between what we actually observe y, and what we would observe if x = ˆx, and there were no noise (v = 0) least squares estimate is ˆx = (A A) 1 A y 13 / 23

14 Best linear unbiased estimator (BLUE) Linear estimator with noise: y = Ax + v with A is a full rank and skinny A linear estimator of form ˆx = By, is unbiased if ˆx = x whenever v = 0 (no estimator error when v = 0) Equivalent to BA = I, i.e., B is the left inverse of A Estimator error of unbiased linear estimator is x ˆx = x B(Ax + v) = Bv It follows that A = (A A) 1 A is the smallest left inverse of A such that for any B with BA = I, we have Bij 2 i,j i,j A 2 ij i.e., least squares provides the best linear unbiased estimator (BLUE) 14 / 23

15 Pseudo inverse via regularization For µ > 0, let x µ be unique minimizer of [ ] [ Ax y 2 + µ x 2 = A y µi x 0 thus x µ = (Ã Ã) 1 Ã ỹ = (A A + µi ) 1 A y ] 2 = Ãx ỹ 2 is called regularized least squares solution for Ax y Also called Tikhonov (Tychonov) regularization (ridge regression in statistics) As A A + µi > 0 and so is invertible, then we have and lim x µ = A y µ 0 lim µ 0 (A A + µi ) 1 A = A 15 / 23

16 Minimizing weighted-sum objective Two (or more) objectives: want J 1 = Ax y 2 small and also J 2 = F x g 2 small Consider minimize a weighted-sum objective [ Ax y 2 +µ F x g 2 = Thus, the least squares solution is ] [ A x µf y µg ] 2 = x = (Ã Ã) 1 Ã ỹ = (A A + µf F ) 1 (A y + µf g) Ãx ỹ 2 Widely used function approximation, regression, optimization, image processing, computer vision, control, machine learning, graph theory, etc. 16 / 23

17 Least squares data fitting Linear regression: Model one scalar y in terms of linear combination of t 1,..., t n n+1 y = α 0 + α 1 t α n t n = α i t j where α j are unknown parameters or coefficients For a set of m data points, {(t i, y i )}, t IR n, want to minimize m n+1 (y i t ij α j ) 2 i=1 j=1 j=1 17 / 23

18 Least squares data fitting For a set of training data, {(t i, y i )}, we form y and A In matrix form, denote A by m (n + 1) matrix with each row an input vector, and x IR n+1, y 1 1 t 11 t t 1n α 0 y = Ax y = y 2 A = 1 t 21 t t 2n 1 x = α 1. y m 1 t m1 t m2... t mn α n and thus we obtain the coefficients α i from x, where x = A y = (A A) 1 A y and n+1 y = α 0 + α 1 t α n t n = α i t j j=1 18 / 23

19 Least squares data fitting (cont d) Estimate the relationship of weight loss (y) and storage time (t 1 ) and storage temperature (t 2 ) with y = α 0 + α 1 t 1 + α 2 t 2 Time Temp Loss Least squares solution is found by A = x = α 0 α 1 y = α 2 Using MATALB: x = A\y = [ ] y = t t / 23

20 Least squares polynomial fitting Fit polynomial of degree n 1, n m with data (y i, t i ) y = p(t) = α 0 + α 1 t + α 2 t α n 1 t n 1 Basis functions are f j (t) = t j 1, j = 1,..., n (using geometric progression) Straight line: p(t) = α 0 + α 1 t 1 Quadratic: p(t) = αo + α 1 t 1 + α 2 t 2 2 Cubic, quartic, and higher polynomials 20 / 23

21 Least squares polynomial fitting Matrix A has form A ij = t j 1 i y 1 1 t 1 t1 2 t1 n 1 1 t 2 t2 2 t n 1 y = y 2 y m A = (called a Vandermonde matrix) 1 t m tm 2 tm n 1 See also kernel regression and splines 2 x = α 0 α 1 α n 1 21 / 23

22 Least squares polynomial fitting (cont d) Estimate the relationship between range of height of a missile Position Height A = y = f (t) = t t 2 22 / 23

23 Applications Thin plate spline: model/morph non-rigid motion 23 / 23

Lecture 5 Least-squares

EE263 Autumn 2008-09 Stephen Boyd Lecture 5 Least-squares least-squares (approximate) solution of overdetermined equations projection and orthogonality principle least-squares estimation BLUE property