Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Size: px

Start display at page:

Download "Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic"

Ronald Nash
6 years ago
Views:

1 Applied Mathematics 205 Unit II: Numerical Linear Algebra Lecturer: Dr. David Knezevic

2 Unit II: Numerical Linear Algebra Chapter II.3: QR Factorization, SVD 2 / 66

3 QR Factorization 3 / 66

4 QR Factorization A square matrix Q R n n is called orthogonal if its columns and rows are orthonormal vectors Equivalently, Q T Q = QQ T = I Orthogonal matrices preserve the Euclidean norm of a vector, i.e. Qv 2 2 = v T Q T Qv = v T v = v 2 2 Hence, geometrically, we picture orthogonal matrices as reflection or rotation operators Orthogonal matrices are very important in scientific computing, since norm-preservation implies cond(q) = 1 4 / 66

5 QR Factorization A matrix A R m n, m n, can be factorized into where A = QR Q R m m is orthogonal [ ] ˆR R R m n 0 ˆR R n n is upper-triangular As we indicated earlier, QR is very good for solving overdetermined linear least-squares problems, Ax b 1 1 QR can also be used to solve a square system Ax = b, but requires 2 as many operations as Gaussian elimination hence not the standard choice 5 / 66

6 QR Factorization To see why, consider the 2-norm of the least-squares residual: [ ] ˆR r(x) 2 2 = b Ax 2 2 = b Q x [ ] ) ˆR = Q (b T Q x [ ] ˆR = Q T b x (We used the fact that Q T z 2 = z 2 in the second line) 6 / 66

7 QR Factorization Then, let Q T b = [c 1, c 2 ] T where c 1 R n, c 2 R m n, so that r(x) 2 2 = c 1 ˆRx c Question: Based on this expression, how do we minimize r(x) 2? 7 / 66

8 QR Factorization Answer: We don t have any control over the second term, c 2 2 2, since it doesn t contain x Hence we minimize r(x) 2 2 by making the first term zero That is, we solve the n n triangular system ˆRx = c 1 this is what Matlab does when we use backslash for least squares Also, this tells us that min x R n r(x) 2 = c / 66

9 QR Factorization Recall that solving linear least-squares via the normal equations requires solving a system with the matrix A T A But using the normal equations directly is problematic since 2 cond(a T A) = cond(a) 2 The QR approach avoids this condition-number-squaring effect and is much more numerically stable! 2 This can be shown via the SVD 9 / 66

10 QR Factorization How do we compute the QR Factorization? There are three main methods Gram-Schmidt Orthogonalization Householder Triangularization Givens Rotations We will cover Gram-Schmidt and Householder in class 10 / 66

11 QR Factorization Suppose A R m n, m n One way to picture the QR factorization is to construct a sequence of orthonormal vectors q 1, q 2,... such that span{q 1, q 2,..., q j } = span{a (:,1), a (:,2),..., a (:,j) }, j = 1,..., n We seek coefficients r ij such that a (:,1) = r 11 q 1, a (:,2) = r 12 q 1 + r 22 q 2,. a (:,n) = r 1n q 1 + r 2n q r nn q n. This can be done via the Gram-Schmidt process, as we ll discuss shortly 11 / 66

12 QR Factorization In matrix form we have: a (:,1) a (:,2) a (:,n) = q1 q2 qn r 11 r 12 r 1n r 22 r 2n r nn This gives A = ˆQ ˆR for ˆQ R m n, ˆR R n n This is called the reduced QR factorization of A, which is slightly different from the definition we gave earlier Note that for m > n, ˆQ T ˆQ = I, but ˆQ ˆQ T I (the latter is why the full QR is sometimes nice) 12 / 66

13 Full vs Reduced QR Factorization The full QR factorization (defined earlier) A = QR is obtained by appending m n arbitrary orthonormal columns to ˆQ to make it an m m orthogonal matrix We also need to append rows of zeros [ to ˆR ] to silence the last ˆR m n columns of Q, to obtain R = 0 13 / 66

14 Full vs Reduced QR Factorization Full QR Reduced QR 14 / 66

15 Full vs Reduced QR Factorization Exercise: Show that the linear least-squares solution is given by ˆRx = ˆQ T b by plugging A = ˆQ ˆR into the Normal Equations This is equivalent to the least-squares result we showed earlier using the full QR factorization, since c 1 = ˆQ T b 15 / 66

16 Full vs Reduced QR Factorization In Matlab, qr gives the full QR factorization by default >> A = rand(5,3); >> [Q,R] = qr(a) Q = R = / 66

17 Full vs Reduced QR Factorization In Matlab, qr(a,0) gives the reduced QR factorization >> [Q,R] = qr(a,0) Q = R = / 66

18 Gram-Schmidt Orthogonalization Question: How do we use the Gram-Schmidt process to compute the q i, i = 1,..., n? Answer: In step j, find unit vector q j span{a (:,1), a (:,2),..., a (:,j) } orthogonal to span{q 1, q n,..., q j 1 } To do this, we set v j a (:,j) (q T 1 a (:,j))q 1 (q T j 1 a (:,j))q j 1, and then q j v j / v j 2 satisfies our requirements This gives the matrix ˆQ, and next we can determine the values of r ij such that A = ˆQ ˆR 18 / 66

19 Gram-Schmidt Orthogonalization We write the set of equations for the q i as q 1 = a (:,1) r 11, q 2 = a (:,2) r 12 q 1 r 22,. q n = a (:,n) n 1 i=1 r inq i r nn. Then from the definition of q j, we see that r ij = q T i a (:,j), i j j 1 r jj = a (:,j) r ij q i 2 The sign of r jj is not determined uniquely, e.g. we could choose r jj > 0 for each j i=1 19 / 66

20 Classical Gram-Schmidt Process The Gram-Schmidt algorithm we have described is provided in the pseudocode below 1: for j = 1 : n do 2: v j = a (:,j) 3: for i = 1 : j 1 do 4: r ij = q T i a (:,j) 5: v j = v j r ij q i 6: end for 7: r jj = v j 2 8: q j = v j /r jj 9: end for This is referred to the classical Gram-Schmidt (CGS) method 20 / 66

21 Gram-Schmidt Orthogonalization The only way the Gram-Schmidt process can fail is if r jj = v j 2 = 0 for some j This can only happen if a (:,j) = j 1 i=1 r ijq i for some j, i.e. if a (:,j) span{q 1, q n,..., q j 1 } = span{a (:,1), a (:,2),..., a (:,j 1) } This means that columns of A are linearly dependent Therefore, Gram-Schmidt fails = cols. of A linearly dependent 21 / 66

22 Gram-Schmidt Orthogonalization Equivalently, by contrapositive: cols. of A linearly independent = Gram-Schmidt succeeds Theorem: Every A R m n (m n) of full rank has a unique reduced QR factorization A = ˆQ ˆR with r ii > 0 The only non-uniqueness in the Gram-Schmidt process was in the sign of r ii, hence ˆQ ˆR is unique if r ii > 0 22 / 66

23 Gram-Schmidt Orthogonalization Theorem: Every A R m n (m n) has a full QR factorization. Case 1: A has full rank We compute the reduced QR factorization from above To make Q square we pad ˆQ with m n arbitrary orthonormal columns We also pad ˆR with m n rows of zeros to get R Case 2: A doesn t have full rank At some point in computing the reduced QR factorization, we encounter v j 2 = 0 At this point we pick an arbitrary q j orthogonal to span{q 1, q 2,..., q j 1 } and then proceed as in Case 1 23 / 66

24 Modified Gram-Schmidt Process The classical Gram-Schmidt process is numerically unstable! (sensitive to rounding error, orthogonality of the q j degrades) The algorithm can be reformulated to give the modified Gram-Schmidt process, which is numerically more robust Key idea: when each new q j is computed, orthogonalize each remaining columns of A against it 24 / 66

25 Modified Gram-Schmidt Process Modified Gram-Schmidt (MGS): 1: for i = 1 : n do 2: v i = a (:,i) 3: end for 4: for i = 1 : n do 5: r ii = v i 2 6: q i = v i /r ii 7: for j = i + 1 : n do 8: r ij = q T i v j 9: v j = v j r ij q i 10: end for 11: end for 25 / 66

26 Modified Gram-Schmidt Process MGS is just a rearrangement of CGS: In MGS outer loop is over rows, whereas in CGS outer loop is over columns In exact arithmetic, MGS and CGS are identical MGS is better numerically since when we compute r ij = q T i v j, components of q 1,..., q i 1 have already been removed from v j Whereas in CGS we compute r ij = q T i a (:,j), which involves some extra contamination from components of q 1,..., q i 1 in a (:,j) 3 3 The q i s are not exactly orthogonal in finite precision arithmetic 26 / 66

27 Operation Count Work in MGS is dominated by lines 8 and 9, the innermost loop: r ij = q T i v j v j = v j r ij q i First line requires m multiplications, m 1 additions; second line requires m multiplications, m subtractions Hence 4m operations per single inner iteration Hence total number of operations is asymptotic to n n n 4m 4m i 2mn 2 i=1 j=i+1 i=1 27 / 66

28 Householder Triangularization The other main method for computing QR factorizations is Householder 4 triangularization Householder algorithm is more numerically stable and more efficient than Gram-Schmidt But Gram-Schmidt allows us to build up orthogonal basis for successive spaces spanned by columns of A span{a (:,1) }, span{a (:,1), a (:,2) },... which can be important in some cases 4 Alston Householder, , American numerical analyst 28 / 66

29 Householder Triangularization Householder idea: Apply a succession of orthogonal matrices Q k R m m to A to compute upper triangular matrix R Q n Q 2 Q 1 A = R Hence we obtain the full QR factorization A = QR, where Q Q T 1 QT 2... QT n 29 / 66

30 Householder Triangularization In 1958, Householder proposed a way to choose Q k to introduce zeros below diagonal in col. k while preserving previous zeros Q Q Q A Q 1A Q 2Q 1A Q 3Q 2Q 1A This is achieved by Householder reflectors 30 / 66

31 Householder Reflectors We choose Q k = [ I 0 0 F ] where I R (k 1) (k 1) F R (m k+1) (m k+1) is a Householder reflector The I block ensures the first k 1 rows are unchanged F is an orthogonal matrix that operates on the bottom m k + 1 rows (Note that F orthogonal = Q k orthogonal) 31 / 66

32 Householder Reflectors Let x R m k+1 denote entries k,..., m of of the k th column We have two requirements for F : 1. F is orthogonal, so must have Fx 2 = x 2 2. Only the first entry of Fx should be non-zero Hence we must have x = Fx =. Question: How can we achieve this? x = x 2e 1 32 / 66

33 Householder Reflectors We can see geometrically that this can be achieved via reflection across a subspace H Here H is the subspace orthogonal to v x e 1 x, and the key point is that H bisects v 33 / 66

34 Householder Reflectors We see that this bisection property is because x and Fx both live on the hypersphere centered at the origin with radius x 2 34 / 66

35 Householder Reflectors Next, we need to determine the matrix F which maps x to x 2 e 1 F is closely related to the orthogonal projection of x onto H, since that projection takes us half way from x to x 2 e 1 Hence we first consider orthogonal projection onto H, and subsequently derive F Recall that the orthogonal projection of a onto b is given by (a b) b 2 b, or equivalently bbt b T b a Hence we can see that bbt b T b is a rank 1 projection matrix 35 / 66

36 Householder Reflectors Let P H I vv T v T v Then P H x gives x minus the orthogonal projection of x onto v We can show that P H x is in fact the orthogonal projection of x onto H, since it satisfies: P H x H (v T P H x = v T x v T vv T v T v x = v T x v T v v T v v T x = 0) The projection error x P H x is orthogonal to H (x P H x = x x + vv T v T v x = v T x v, parallel to v) v T v 36 / 66

37 Householder Reflectors But recall that F should reflect across H rather than project onto H Hence we obtain F by going twice as far in the direction of v compared to P H, i.e. F = I 2 vv T v T v Exercise: Show that F is an orthogonal matrix, i.e. that F T F = I 37 / 66

38 Householder Reflectors But in fact, we can see that there are two Householder reflectors that we can choose from 5 5 This picture is not to scale ; H should bisect x e 1 x 38 / 66

39 Householder Reflectors In practice, it s important to choose the better of the two reflectors If x and x 2 e 1 (or x and x 2 e 1 ) are close, we could obtain loss of precision due to cancellation (cf. Unit 0) when computing v To ensure x and its reflection are well separated we should choose the reflection to be sign(x 1 ) x 2 e 1 (more details on next slide) Therefore, we want v to be v = sign(x 1 ) x 2 e 1 x But note that the sign of v does not affect F, hence we scale v by 1 to get an equivalent formula without minus signs : v = sign(x 1 ) x 2 e 1 + x 39 / 66

40 Householder Reflectors Let s compare the two options for v in the potentially problematic case when x ± x 2 e 1, i.e. when x 1 x 2 : v bad sign(x 1 ) x 2 e 1 + x v good sign(x 1 ) x 2 e 1 + x v bad 2 2 = sign(x 1 ) x 2 e 1 + x 2 2 = ( sign(x 1 ) x 2 + x 1 ) 2 + x 2:(m k+1) 2 2 = ( sign(x 1 ) x 2 + sign(x 1 ) x 1 ) 2 + x 2:(m k+1) v good 2 2 = sign(x 1 ) x 2 e 1 + x 2 2 = (sign(x 1 ) x 2 + x 1 ) 2 + x 2:(m k+1) 2 2 = (sign(x 1 ) x 2 + sign(x 1 ) x 1 ) 2 + x 2:(m k+1) 2 2 (2sign(x 1 ) x 2 ) 2 40 / 66

41 Householder Reflectors Recall that v is computed from two vectors of magnitude x 2 The argument above shows that with v bad we can get v 2 x 2 = loss of precision due to cancellation is possible In contrast, with v good we always have v good 2 x 2, which rules out loss of precision due to cancellation 41 / 66

42 Householder Triangularization We can now write out the Householder algorithm: 1: for k = 1 : n do 2: x = a (k:m,k) 3: v k = sign(x 1 ) x 2 e 1 + x 4: v k = v k / v k 2 5: a (k:m,k:n) = a (k:m,k:n) 2v k (vk T a (k:m,k:n)) 6: end for This replaces A with R and stores v 1,..., v n Note that we don t divide by v T k v k in line 5 (as in the formula for F ) since we normalize v k in line 4 Householder algorithm requires 2mn n3 operations 6 6 Compared to 2mn 2 for Gram-Schmidt 42 / 66

43 Householder Triangularization Note that we don t explicitly form Q We can use the vectors v 1,..., v n to compute Q in a post-processing step Recall that Q k = [ I 0 0 F ] and Q (Q n Q 2 Q 1 ) T = Q T 1 QT 2 QT n Also, the Householder reflectors are symmetric (refer to the definition of F ), so Q = Q1 T QT 2 QT n = Q 1 Q 2 Q n 43 / 66

44 Householder Triangularization Hence, we can evaluate Qx = Q 1 Q 2 Q n x using the v k : 1: for k = n : 1 : 1 do 2: x (k:m) = x (k:m) 2v k (v T k x (k:m)) 3: end for Question: How can we use this to form the matrix Q? 44 / 66

45 Householder Triangularization Answer: Compute Q via Qe i, i = 1,..., m Similarly, compute ˆQ (reduced QR factor) via Qe i, i = 1,..., n However, often not necessary to form Q or ˆQ explicitly, e.g. to solve Ax b we only need the product Q T b Note the product Q T b = Q n Q 2 Q 1 b can be evaluated as: 1: for k = 1 : n do 2: b (k:m) = b (k:m) 2v k (v T k b (k:m)) 3: end for 45 / 66

46 Singular Value Decomposition (SVD) 46 / 66

47 Singular Value Decomposition The Singular Value Decomposition (SVD) is a very useful matrix factorization Motivation for SVD: image of the unit sphere, S, from any m n matrix is a hyperellipse A hyperellipse is obtained by stretching the unit sphere in R m by factors σ 1,..., σ m in orthogonal directions u 1,..., u m 47 / 66

48 Singular Value Decomposition For A R 2 2, we have 48 / 66

49 Singular Value Decomposition Based on this picture, we make some definitions: Singular values: σ 1, σ 2,..., σ n 0 (we typically assume σ 1 σ 2...) Left singular vectors: {u 1, u 2,..., u n }, unit vectors in directions of principal semiaxes of AS Right singular vectors: {v 1, v 2,..., v n }, preimages of the u i so that Av i = σ i u i, i = 1,..., n (The names left and right come from the formula for the SVD below) 49 / 66

50 Singular Value Decomposition The key equation above is that Av i = σ i u i, i = 1,..., n Writing this out in matrix form we get A v1 v2 vn = u 1 u 2 u n σ 1 σ 2... σ n Or more compactly: AV = Û Σ 50 / 66

51 Singular Value Decomposition Here Σ R n n is diagonal with non-negative, real entries Û R m n with orthonormal columns V R n n with orthonormal columns Therefore V is an orthogonal matrix (V T V = VV T = I), so that we have the reduced SVD for A R m n : A = Û ΣV T 51 / 66

52 Singular Value Decomposition Just as with QR, we can pad the columns of Û with m n arbitrary orthogonal vectors to obtain U R m m We then need to silence these arbitrary columns by adding rows of zeros to Σ to obtain Σ This gives the full SVD for A R m n : A = UΣV T 52 / 66

53 Full vs Reduced SVD Full SVD Reduced SVD 53 / 66

54 Singular Value Decomposition Theorem: Every matrix A R m n has a full singular value decomposition. Furthermore: The σ j are uniquely determined If A is square and the σ j are distinct, the {u j } and {v j } are uniquely determined up to sign 54 / 66

55 Singular Value Decomposition This theorem justifies the statement that the image of the unit sphere under any m n matrix is a hyperellipse Consider A = UΣV T (full SVD) applied to the unit sphere, S, in R n : 1. The orthogonal map V T preserves S 2. Σ stretches S into a hyperellipse aligned with the canonical axes e j 3. U rotates or reflects the hyperellipse without changing its shape 55 / 66

56 SVD in Matlab Matlab s svd function computes the full SVD of a matrix >> A = rand(4,2); >> [U,S,V] = svd(a) U = S = V = / 66

57 SVD in Matlab As with QR, svd(a,0) returns the reduced SVD >> [U,S,V] = svd(a,0) U = S = V = >> norm(a - U*S*V ) ans = e / 66

58 Matrix Properties via the SVD The rank of A is r, the number of nonzero singular values 7 Proof: In the full SVD A = UΣV T, U and V T have full rank, hence it follows from linear algebra that rank(a) = rank(σ) image(a) = span{u 1,..., u r } and null(a) = span{v r+1,..., v n } Proof: This follows from A = UΣV T and image(σ) = span{e 1,..., e r } R m null(σ) = span{e r+1,..., e n } R n 7 This also gives us a good way to define rank in finite precision: the number of singular values larger than some (small) tolerance 58 / 66

59 Matrix Properties via the SVD A 2 = σ 1 Proof: Recall that A 2 max v 2 =1 Av 2. Geometrically, we see that Av 2 is maximized if v = v 1 and Av = σ 1 u 1. The singular values of A are the square roots of the eigenvalues of A T A or AA T Proof: (Analogous for AA T ) A T A = (UΣV T ) T (UΣV T ) = V ΣU T UΣV T = V (Σ T Σ)V T, hence (A T A)V = V (Σ T Σ), or (A T A)v (:,j) = σ 2 j v (:,j) 59 / 66

60 Matrix Properties via the SVD The pseudoinverse, A +, can be defined more generally in terms of the SVD Define pseudoinverse of a scalar σ to be 1/σ if σ 0 and zero otherwise Define pseudoinverse of a (possibly rectangular) diagonal matrix as transpose of the matrix and taking pseudoinverse of each entry Pseudoinverse of A R m n is defined as A + = V Σ + U T A + exists for any matrix A, and it captures our definitions of pseudoinverse from Unit I 60 / 66

61 Matrix Properties via the SVD We generalize the condition number to rectangular matrices via the definition κ(a) = A A + We can use the SVD to compute the 2-norm condition number: A 2 = σ max Largest singular value of A + is 1/σ min so that A + 2 = 1/σ min Hence κ(a) = σ max /σ min 61 / 66

62 Matrix Properties via the SVD These results indicate the importance of the SVD, both theoretically and as a computational tool Algorithms for calculating the SVD are an important topic in Numerical Linear Algebra, but outside scope of this course Requires 4mn n3 operations For more details on algorithms, see Trefethen & Bau, or Golub & van Loan 62 / 66

63 Low-Rank Approximation via the SVD One of the most useful properties of the SVD is that it allows us to obtain an optimal low-rank approximation to A See Lecture: We can recast SVD as A = r j=1 σ j u j v T j Follows from writing Σ as sum of r matrices, Σ j, where Σ j diag(0,..., 0, σ j, 0,..., 0) Each u j v T u j j is a rank one matrix: each column is a scaled version of 63 / 66

64 Low-Rank Approximation via the SVD Theorem: For any 0 ν r, let A ν ν j=1 σ ju j v T, then A A ν 2 = min B R m n, rank(b) ν A B 2 = σ ν+1 j That is: A ν gives us the closest rank ν matrix to A, measured in the 2-norm The error in A ν is given by the first omitted singular value 64 / 66

65 Low-Rank Approximation via the SVD A similar result holds in the Frobenius norm: A A ν F = min A B B R m n F = σν , rank(b) ν σ2 r 65 / 66

66 Low-Rank Approximation via the SVD These theorems indicate that the SVD is an effective way to compress data encapsulated by a matrix! If singular values of A decay rapidly, can approximate A with few rank one matrices (only need to store σ j, u j, v j for j = 1,..., ν) Matlab example: Image compression via the SVD 66 / 66

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition AM 205: lecture 8 Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition QR Factorization A matrix A R m n, m n, can be factorized