Why the QR Factorization can be more Accurate than the SVD

Size: px

Start display at page:

Download "Why the QR Factorization can be more Accurate than the SVD"

Stuart Green
6 years ago
Views:

1 Why the QR Factorization can be more Accurate than the SVD Leslie V. Foster Department of Mathematics San Jose State University San Jose, CA May 10, 2004

2 Problem: or Ax = b for A square (1) min b Ax for A m n, m n (2) where A is very ill-conditioned, assuming that b = b 0 + δb. Goal: recover x 0 where Ax 0 = b 0. Applications: inverse problems, image reconstruction, computer-assisted tomography (CAT), backward heat equation, inverse scattering,... 1

3 Regularization using low rank approximations: Replace A with a lower rank approximation Â. Solve for the minimum norm solution to min b Âx (3) Low-rank approximations can be constructed by: the SVD (Golub-65 and others), LAPACK s xgelsd UTV decomposition (Stewart-93, Mathias-93, Fierro-97,...), rank revealing QR factorizations (Foster-86, Chan-87, Hansen-90, Bischof-91, Chandrasekaran-94...), pivoted QR algorithm (Golub-65, Lawson & Hanson-74), LAPACK s xgelsy 2

4 Why is this Interesting? There are a variety of commonly used matrix decompositions. In the full rank case the advantages / disadvantages of the decompositions is relatively well known. In the case that the rank is not full rank there are unanswered questions. For this case we provide results comparing the accuracy and efficiency of two important decompositions. 3

5 Truncated SVD (assuming m = n) A = U S DV T S = ( ÛS U S0 ) ( D 0 0 D 0 ) ( V S V S0 ) T. Â = ÛS D V T S, Let x S = Â+ b. Â+ = V S D 1 Û T S Truncated QRP (assuming m = n) A = Q 1 RP T = ( Û U 0 ) ( R F 0 G Â = Û ( R F ) P T. ) P T. Factor ( R F ) = ( L 0 ) Q T 2 so Â = Û ( L 0 ) Q T 2 P T = Û ( L 0 ) V T where V = P Q 2 = ( V V 0 ) Â + = Let x T = Â+ b. V L 1Û T 4

6 Modification xgelsz of xgelsy: G is not needed in the calculated solution x T xgelsy uses a complete factorization of A since G is not needed it does not need to be factored Properties of modification: calculates same numerical rank and essentially the same solution as xgelsy does not interfere with BLAS-3 calls in xgelsy O(mnr) flops for low-rank problems, much quicker (source of observation: Golub and Van Loan , page 240) 5

7 80 time for solving a 1600 by 1600 linear system DGELSY (modified) DGELSY DGELSD time in seconds numerical rank Timings for DGELSZ (x), DGELSY (o) and DGELSD (+). Conclusion: QRP is faster than SVD, especially for low-rank problems 6

8 Accuracy of Regularized Solution Suppose that x = Â + b is the regularization solution to (1)and that x 0 is the underlying noiseless solution then x x 0 = Â + b x 0 = (Â + A I)x 0 +Â + (δb) and x x 0 2 = (Â + A I)x Â + (δb) 2. The first term on the right hand side is called the regularization error and the second term the perturbation error. 7

9 x x 0 2 = (Â + A I)x Â + (δb) 2. The regularization error decreases as the rank r increases The perturbation error increases as the rank r increases Need to choose the minimum error using a technique such as Generalized Cross Validation or the L-Curve Our goal is to compare the minimum errors when using two different factorizations 8

10 10 1 x / x 0 vs. truncation index (+ svd, x qrp) 10 0 x / x effective rank x / x 0 for SVD (+) and QRP (x) versus the rank of the low rank approximation Note that the minimum is smaller for QRP for this example. WHY? 9

11 Theorem 1 Let x T be the solution to (3) calculated using the truncated QRP and x S be the solution calculated using the truncated SVD. Then x T x 0 2 = x S x δb T M δb + x T 0 N x 0. (4) where Ũ = U T U S = ( Û T ÛS U T 0 ÛS Û T U S0 U0 T U S0 ) ( Ũ11 = Ũ12 Ũ 21 Ũ22 ), ( V Ṽ = V T V S = T V V T S V S0 V0 T V S V0 T V S0 ) ( Ṽ11 = Ṽ12 Ṽ 21 Ṽ22 ), 10

12 ( M = D 1Ṽ 21 T Ṽ21 D 1 Ũ 11 T T T T 1 Ũ 12 Ũ12 T T T T 1 Ũ 11 Ũ12 T T T T 1 Ũ 12 ) N = ( Ṽ T 21 Ṽ21 Ṽ T 22 Ṽ21 Ṽ T 21 Ṽ22 Ṽ T 12 Ṽ12 ) δb = U T S δb and x 0 = V T S x 0. 11

13 Meaning of x T x 0 2 = x S x δb T M δb + x T 0 N x 0. M = N = ( ) M11 M 12 M T 21 M 22 ( ) N11 N 12 N T 21 N 22,, where M 11, M 22, N 11, and N 22 are positive definite. M and N are definitely indefinite There will be cases where x T x 0 is smaller than x S and also where x S is smaller than x T. HOW OFTEN? 12

14 Perturbation Error (N = 0, M 0) Large Gap in Singular Values Theorem 2 Assume that N = 0 in x T x 0 2 = x S x δb T M δb + x T 0 N x 0, the QR factorization is rank revealing, and the components of the noise δb come from Gaussian white noise (uncorrelated zero mean Gaussian random variables with common variance) then as the gap in the singular values, s r+1 /s r, approaches 0 it follows that the probability that x T is less than x S approaches onehalf. 13

15 Ideas in the proof of theorem 2: Using a result of Bunch, Fierro and Hansen (95) we can show M = ( M11 M 12 where M = M T 21 M 22 ) = M ( 0 M12 M21 T 0 ) M has eigenvalues that come in + and - pairs of equal magnitudes therefore for white noise δb the distribution of δb T M δb is nearly symmetric about 0. 14

16 Perturbation Error No Gap in Singular Values (Extreme Case where A is orthogonal) Theorem 3 Assume that A is orthogonal, N is 0 in (4) and that the components of the noise δb come from Gaussian white noise. Then the probability that x T x 0 is less than x S x 0 is one-half. Note: In this case the QR and SVD factorizations are not unique. The result is true for each choice of Q, R, U S, D and V S. 15

17 Perturbation Error Summary For perturbation errors in the case where there is a large gap in the singular values or where there is no gap in the singular values then, with our assumptions, the probability that x T x 0 is less than x S is nearly one-half. We can examine cases in between experimentally. The (very simple) case of a rank one approximation to a 2 by 2 system is informative. We will look at more realistic cases later. 16

18 Perturbation Error Histogram of x T x 0 x S x 0 E( x S x 0 ) A is 2 by 2 singular values = 1,.01 (large gap) Histogram ( trials) mean = e 006 median = e 006 % QR better = standard dev. = count ( x T x S ) / E( x S ) trials QR better than SVD in 50.02% of the cases 17

19 Perturbation Error Histogram of x T x 0 x S x 0 E( x S x 0 ) A is 2 by 2 singular values = 1, 1 (no gap) Histogram ( trials) mean = median = % QR better = standard dev. = count ( x T x S ) / E( x S ) trials QR better than SVD in 50.14% of the cases 18

20 Perturbation Error Histogram of x T x 0 x S x 0 E( x S x 0 ) A is 2 by 2 singular values = 1,.6 (small gap) Histogram ( trials) mean = median = % QR better = standard dev. = count ( x T x S ) / E( x S ) trials QR better than SVD in 44.4% of the cases 19

21 Regularization error term: Consider x T x 0 2 = x S x δb T M δb + x T 0 N x 0. with M = 0 and N 0. N in the term x T 0 N x 0 is indefinite as discussed earlier. Therefore x T 0 N x 0 can be positive or negative depending on x 0. To examine how often we will use a model for x 0 used by others. 20

22 Class used by Bertero, et. al. 1980, and Neumaier, 1998 is x 0 = V S D p w (5) where w is governed by white noise. Attractive Properties: x 0 is a weighted combination of the singular vectors, x 0 is usually a smoothly varying solution and for p > 0, the discrete Picard condition is true. The condition is that the components of US T b 0 should decrease faster than the singular values of A. p is called the relative decay rate in the Picard coefficients. 21

23 Using this model in x T x 0 2 = x S x δb T M δb + x T 0 N x 0. where x 0 = V T S x 0 it follows that x T 0 N x 0 = w T N p w with N p = ( D p Ṽ T 21 Ṽ21 D p D p Ṽ T 21 Ṽ22 D p 0 D p 0 Ṽ T 22 Ṽ21 D p D p 0 Ṽ T 12 Ṽ12 D p 0 ) 22

24 Regularization Error (M = 0, N 0) Large Gap in Singular Values Theorem 4 Assume that M = 0 in x T x 0 2 = x S x δb T M δb + x T 0 N x 0, the QR factorization is rank revealing, and x 0 satisfies x 0 = V S D p w with 0 p 1 where w follows white Gaussian noise then as the gap in the singular values, s r+1 /s r, approaches 0 it follows that the probability that x T is less than x S approaches onehalf. 23

25 Regularization Error No Gap in Singular Values (Extreme Case where A is orthogonal) Theorem 5 Assume that A is orthogonal, M is 0 in (4) and that x 0 satisfies (5) where w follows white Gaussian noise. Then the probability that x T is less than x S is one-half. Note: Theorem 5 has no restrictions on p. Theorem 4 has the condition 0 p 1. Values of p in the range 0 p 1 are common in practice. Also note that numerical experiments suggest that Theorem 4 is true for 0 p 2. 24

26 Regularization Error Summary For regularization errors if there is a large gap in the singular values or if there is no gap in the singular values and assuming that p 2 then, with our assumptions, the probability that x T is less than x S is nearly one-half. Numerical experiments with rank one approximations to 2 by 2 systems are very similar to those presented earlier for perturbation errors, in the case that p 2. 25

27 Summary - Both Perturbation and Regularization Errors In the case that M 0 and N 0 if there is a large gap in the singular values or if there is no gap in the singular values then, with our assumptions including p 2 which is common in practice, the probability that x T is less than x S is nearly one-half. 26

28 Numerical Experiments We will illustrate the above results by using regularization to solve for the regularized solution to Ax = b for 64 by 64 random matrices A with a variety of choices of singular values 64 by 64 matrices A from Hansen s Regularization Tools. Examples in this collection have characteristic features of ill-posed problems. 27

29 Samples of 64 by 64 random matrices with perturbation and regularization errors matrices A chosen use Per Christian Hansen s regutm the singular values chosen according to specified distributions noise δb is Gaussian white noise x 0 chosen according to (5) where w follows white Gaussian noise noise to signal ratios set to seven values:.3,.1,.01,.001,.0001, 10 6 and rank set to specified value in some cases and chosen dynamically in other cases trials for each case 28

30 Random 64 by 64 matrices, large gap in singular values rank of approximation = 16 gap = 100, p = 1 Histogram of x T x 0 x S x 0 max( x S x 0, x T x 0 ) Histogram (70000 trials) 8000 count mean = e 006 median = e 006 % QR better = standard dev. = ( x T x S ) / max( x S, x T ) 10 0 singular values singingular value k distribution of singular values QR better than SVD in 49.9% of the cases 29

31 Random 64 by 64 matrices, moderate gap in singular values rank of approximation = 16 gap = 4, p = 1 count Histogram of x T x 0 x S x 0 max( x S x 0, x T x 0 ) Histogram (70000 trials) mean = median = % QR better = standard dev. = ( x T x S ) / max( x S, x T ) 10 0 singular values singingular value k distribution of singular values QR better than SVD in 41.3% of the cases 30

32 Random 64 by 64 matrices, in cluster of singular values rank of approximation in cluster of 10 sing. val s, p = 1 count Histogram of x T x 0 x S x 0 max( x S x 0, x T x 0 ) Histogram (70000 trials) mean = median = % QR better = standard dev. = ( x T x S ) / max( x S, x T ) 10 0 singular values singingular value k distribution of singular values QR better than SVD in 45% of the cases 31

33 Random 64 by 64 matrices, singular values decreasing rapidly rank chosen dynamically mean gap = 10, p = 1 count Histogram of x T x 0 x S x 0 max( x S x 0, x T x 0 ) Histogram (70000 trials) mean = median = % QR better = standard dev. = ( x T x S ) / max( x S, x T ) 10 0 singular values singingular value k distribution of singular values QR better than SVD in 49.1% of the cases 32

34 Random 64 by 64 matrices, singular value decrease moderate rank chosen dynamically mean gap = 2, p = 1 count Histogram of x T x 0 x S x 0 max( x S x 0, x T x 0 ) Histogram (70000 trials) mean = median = % QR better = standard dev. = ( x T x S ) / max( x S, x T ) 10 0 singular values singingular value k distribution of singular values QR better than SVD in 43% of the cases 33

35 Caution With probabilistic models of the accuracy of solutions to (1) one can skew the results in favor of either SVD or QRP solutions by a careful choice of the model. 34

36 Random 64 by 64 matrices, SVD better than QRP rank chosen dynamically mean gap = 2, p = 3, x 0 = V S D p w Histogram of 3000 x T x 0 x S x 0 max( x S x 0, x T x 0 ) Histogram (70000 trials) count mean = median = % QR better = standard dev. = ( x T x S ) / max( x S, x T ) 10 0 singular values singingular value k distribution of singular values QR better than SVD in 14.7% of the cases 35

37 Random 64 by 64 matrices, QRP better than SVD rank chosen dynamically mean gap = 2, p = 3, x 0 = V D p w Histogram of x T x 0 x S x 0 max( x S x 0, x T x 0 ) Histogram (70000 trials) mean = median = % QR better = standard dev. = count ( x T x S ) / max( x S, x T ) 10 0 singular values singingular value k distribution of singular values QR better than SVD in 85.4% of the cases 36

38 Summary of numerical experiments with random matrices The numerical experiments are consistent with the theory. If there is a large gap in the singular values and if p the decay rate of the Picard coefficients is not large then using the QRP is, on average, nearly as accurate at the SVD. For matrices with small gaps in the singular values the truncated SVD solutions may be, on average, better than truncated QRP solutions but the difference may be modest. By selecting the model for the true solutions x 0 one can skew the results to favor the SVD or to favor the QRP 37

39 Numerical Experiments for Examples from Hansen s Regularization Tools These examples have characteristic features of ill-posed problems A, x 0, b 0 not random Examples come from integral equations, numerical differentiation, inverse heat equations and inverse Laplace transforms Most examples do not have a gap in the singular values Noise δb chosen from white noise with seven noise to signal ratios:.3,.1,.01,.001,.0001, 10 6 and

40 Examples from Hansen s Regularization Tools A, x 0 and b 0 not random 16 examples, 7 noise levels, 100 noise vectors in each case rank chosen dynamically Histogram of x T x 0 x S x 0 max( x S x 0, x T x 0 ) Histogram (10700 trials) mean = % QR better = standard dev. = count ( x T x S ) / max( x S, x T ) QR better than SVD in 50.5% of the cases 39

41 Examples from Hansen s Regularization Tools For each of 112 cases we calculated mean( x T x S ) mean( x S ) The number (of 112) of instances of the ratio in various ranges are listed below. less -50% -10% -1% 1% 10% 50% than to to to to to or -50% -10% -1% 1% 10% 50% more

42 Summary of numerical experiments with matrices from Regularization Tools Overall the truncated QRP appears to be better for these nonrandom examples than for the random examples. In some cases the truncated SVD solution is closer to the true solution and in others the truncated QRP solution is. Overall the SVD and QRP have very similar accuracies for these examples. 41

43 Conclusions: Consider regularized solutions to Ax = b = b 0 + δb for an ill-conditioned matrix A. 1. For systems governed by statistical assumptions used by others, with reasonable parameter values, if the matrices have a gap in their singular values then the truncated QRP is better than the truncated SVD approximately half the time. 2. For these systems, if there is not a large gap, truncated SVD solutions appear overall somewhat better than truncated QRP solutions but the difference may be modest. The analysis in this case is not complete. 3. For problems from the Regularization Tools collection truncated QRP solutions appear to be better close to half the time. 42

44 References: L. Foster, Solving Rank-Deficient and Ill-posed Problems using UTV and QR Factorizations, SIAM J. Matrix Anal. Appl. 25, pp , L. Foster and R. Kommu, An Efficient Algorithm for Solving Rank- Deficient Least Squares Problems, submitted to the ACM Transactions on Mathematical Software. 43

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for