Lecture4 INF-MAT : 5. Fast Direct Solution of Large Linear Systems

Lecture4 INF-MAT 4350 2010: 5. Fast Direct Solution of Large Linear Systems Tom Lyche Centre of Mathematics for Applications, Department of Informatics, University of Oslo September 16, 2010

Test Matrix T 1 R m,m T 1 := d a 0 a d a 0 a d a 0 a d where a, d R. We call this the 1D test matrix. Second Derivative Matrix T: a = 1, d = 2 symmetric, tridiagonal, diagonally dominant if d 2 a, nonsingular if diagonally dominant and a 0, strictly diagonally dominant if d > 2 a., (1)

Properties of T 2 = T 1 I + I T 1 T 2 := 2d a 0 a 0 0 0 0 0 a 2d a 0 a 0 0 0 0 0 a 2d 0 0 a 0 0 0 a 0 0 2d a 0 a 0 0 0 a 0 a 2d a 0 a 0 0 0 a 0 a 2d 0 0 a 0 0 0 a 0 0 2d a 0 0 0 0 0 a 0 a 2d a 0 0 0 0 0 a 0 a 2d 1. Poisson matrix A for a = 1, d = 2. 2. It is symmetric positive definite if d > 0 and d 2 a. 3. It is banded. 4. It is block-tridiagonal. 5. We know the eigenvalues and eigenvectors. 6. The eigenvectors are orthogonal..

Fast Methods Consider solving Ax = b, where A = T 1 or A = T 2. O(n) algorithm for T 1. What about T 2? Consider d = 2, a = 1, the Poisson matrix. Call it A.

Full Cholesky Symmetric positive definite system Ax = b of order n = m 2 A = R T R with R upper triangular. Full Cholesky: #flops O(n 3 /3). If m = 10 3 then it takes years of computing time. Need to use matrix structure. 0 5 10 15 20 25 0 5 10 15 20 25 nz = 105

Banded Cholesky System Ax = b of order n = m 2 and bandwidth d = m = n. band Cholesky: #flops is O(nd 2 ) = O(n 2 ). But if m = 10 3 then it still takes hours of computing time 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 nz = 1009 We get fill-inn and the O(n 2 ) count is realistic.

Use Blockstructure of A A = Block tridiagonal 4 1 0 1 0 0 0 0 0 1 4 1 0 1 0 0 0 0 0 1 4 0 0 1 0 0 0 1 0 0 4 1 0 1 0 0 0 1 0 1 4 1 0 1 0 0 0 1 0 1 4 0 0 1 0 0 0 1 0 0 4 1 0 0 0 0 0 1 0 1 4 1 0 0 0 0 0 1 0 1 4 A = T + 2I I 0 I T + 2I I 0 I T + 2I

LU of Block Tridiagonal Matrix [ ] D1 C 1 0 A 2 D 2 C 2 = 0 A 3 D 3 [ ] [ ] I 0 0 R1 C 1 0 L 2 I 0 0 R 2 C 2. 0 L 3 I 0 0 R 3 Given A i, D i, C i with D i square. Find L i, R i. Compare (i, j) block entries on both sides (1, 1) : D 1 = R 1 R 1 = D 1 (2, 1) : A 2 = L 2 R 1 L 2 = A 2 R 1 1 (2, 2) : D 2 = L 2 C 1 + R 2 R 2 = D 2 L 2 C 1 (2, 3) : A 3 = L 3 R 2 L 3 = A 3 R 1 2 (3, 3) : D 3 = L 3 C 2 + R 3 R 3 = D 3 L 3 C 2 In general R 1 = D 1, L k = A k R 1 k 1, R k = D k L k C k 1, k = 2, 3,..., m.

LU of Block Tridiagonal Matrix Ax = b = [b T 1,..., b T m] T D 1 C 1 A 2 D 2 C 2......... A m 1 D m 1 C m 1 A m D m = I L 2 I...... L m I R 1 C 1...... R m 1 C m 1 R m. R 1 = D 1, 2, 3,..., m. L k = A k R 1 k 1, R k = D k L k C k 1, k = y 1 = b 1, y k = b k L k y k 1, k = 2, 3,..., m, x m = R 1 m y m, x k = R 1 k (y k C k x k+1 ), k = m 1,..., 2, 1. (2) The solution is then x T = [x T 1,..., xt m].

BlockCholesky factor R for n = 100 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 nz = 1018 We get fill-inn in the R i blocks Computing R i requires O(m 3 ) flops Need m O(m 3 ) = O(m 4 ) flops and the O(n 2 ) count remains. Block methods can be faster because we replace some loops with matrix multiplications.

Other Methods Iterative methods. Multigrid. Fast solvers based on diagonalization of T = tridiag( 1, 2, 1) and the Fast Fourier Transform. For this solve Ax = b in the form TV + VT = h 2 F.

Eigenpairs of T = tridiag( 1, 2, 1) R m,m Let h = 1/(m + 1). We know that Ts j = λ j s j for j = 1,..., m, where s j = [sin (jπh), sin (2jπh),..., sin (mjπh)] T, λ j = 4 sin 2 ( jπh 2 ), s T j s k = 1 2h δ j,k, j, k = 1,..., m.

The sine matrix S S := [s 1,..., s m ], D = diag(λ 1,..., λ n ), TS = SD, S T S = S 2 = 1 2h I, S is almost, but not quite orthogonal.

Find V using diagonalization We define a matrix X by V = SXS, where V is the solution of TV + VT = h 2 F TV + VT = h 2 F V=SXS TSXS + SXST = h 2 F S( )S STSXS 2 + S 2 XSTS = h 2 SFS TS=SD S 2 DXS 2 + S 2 XS 2 D = h 2 SFS S 2 =I/(2h) DX + XD = 4h 4 SFS =: B. λ j x jk + x jk λ k = b jk. x jk = b jk /(λ j + λ k ) for all j, k.

Algorithm [A Simple Fast Poisson Solver] 1. h = 1/(m + 1); F = ( f (jh, kh) ) m j,k=1 ; S = ( sin(jkπh) ) m j,k=1 ; σ = ( sin 2 ((jπh)/2) ) m j=1 2. G = (g j,k ) = SFS; 3. X = (x j,k ) m j,k=1, where x j,k = h 4 g j,k /(σ j + σ k ); 4. V = SXS; (3) Output is the exact solution of the discrete Poisson equation on a square computed in O(n 3/2 ) operations. Only a couple of m m matrices are required for storage. Next: Use FFT to reduce the complexity to O(n log 2 n)

Discrete Sine Transform, DST Given v = [v 1,..., v m ] T R m we say that the vector w = [w 1,..., w m ] T given by w j = m k=1 ( ) jkπ sin v k, m + 1 is the Discrete Sine Transform (DST) of v. j = 1,..., m In matrix form we can write the DST as the matrix times vector w = Sv, where S is the sine matrix. We can identify the matrix B = SA as the DST of A R m,n, i.e. as the DST of the columns of A.

The product B = AS The product B = AS can also be interpreted as a DST. Indeed, since S is symmetric we have B = (SA T ) T which means that B is the transpose of the DST of the rows of A. It follows that we can compute the unknowns V in Algorithm 15 by carrying out Discrete Sine Transforms on 4 m-by-m matrices in addition to the computation of X.

The Euler formula e iφ = cos φ + i sin φ, e iφ = cos φ i sin φ φ R, i = 1 Consider Note that cos φ = eiφ + e iφ, sin φ = eiφ e iφ 2 2i ω N := e 2πi/N = cos ( 2π N ) (2π ) i sin, ω N N N = 1 ω jk 2m+2 = e 2jkπi/(2m+2) = e jkπhi = cos(jkπh) i sin(jkπh).

Discrete Fourier Transform (DFT) If z j = N k=1 ω (j 1)(k 1) N y k, j = 1,..., N then z = [z 1,..., z N ] T is the DFT of y = [y 1,..., y N ] T. z = F N y, where F N := ( ω (j 1)(k 1) ) N N j,k=1, RN,N F N is called the Fourier Matrix

Example ω 4 = exp 2πi/4 = cos( π 2 ) i sin( π 2 ) = i F 4 = 1 1 1 1 1 ω 4 ω 2 4 ω 3 4 1 ω 2 4 ω 4 4 ω 6 4 1 ω 3 4 ω 6 4 ω 9 4 = 1 1 1 1 1 i 1 i 1 1 1 1 1 i 1 i

Connection DST and DFT DST of order m can be computed from DFT of order N = 2m + 2 as follows: Lemma Given a positive integer m and a vector x R m. Component k of S m x is equal to i/2 times component k + 1 of F 2m+2 z where In symbols z = (0, x 1,..., x m, 0, x m, x m 1,..., x 1 ) T R 2m+2. (S m x) k = i 2 (F 2m+2z) k+1, k = 1,..., m.

(0, x 1,..., x m, 0, x m, x m 1,..., x 1 ) T Proof Let ω = ω 2m+2. Component k + 1 of F 2m+2 z is given by (F 2m+2 z) k+1 = = m m x j ω jk x j ω (2m+2 j)k j=1 j=1 m x j (ω jk ω jk ) j=1 = 2i m x j sin(jkπh) = 2i(S m x) k. j=1 Dividing both sides by 2i proves the Lemma.

Fast Fourier Transform (FFT) Suppose N is even. Express F N in terms of F N/2. Reorder the columns in F N so that the odd columns appear before the even ones. P N := (e 1, e 3,..., e N 1, e 2, e 4,..., e N ) F 4 = 1 1 1 1 1 i 1 i 1 1 1 1 1 i 1 i F 4P 4 = 1 1 1 1 1 1 i i 1 1 1 1 1 1 i i [ ] [ ] [ 1 1 1 1 1 0 F 2 = =, D 1 ω 2 1 1 2 = diag(1, ω 4 ) = 0 i [ ] F2 D F 4 P 4 = 2 F 2 F 2 D 2 F 2. ].

F 2m, P 2m, F m and D m Theorem If N = 2m is even then [ Fm D F 2m P 2m = m F m F m D m F m ], (4) where D m = diag(1, ω N, ω 2 N,..., ωm 1 N ). (5)

Proof Fix integers j, k with 0 j, k m 1 and set p = j + 1 and q = k + 1. Since ωm m = 1, ωn 2 = ω m, and ωn m = 1 we find by considering elements in the four sub-blocks in turn (F 2m P 2m ) p,q = ω j(2k) N = ω jk m = (F m ) p,q, (F 2m P 2m ) p+m,q = ω (j+m)(2k) N = ω m (j+m)k = (F m ) p,q, (F 2m P 2m ) p,q+m = ω j(2k+1) N = ω j N ωjk m = (D m F m ) p,q, (F 2m P 2m ) p+m,q+m = ω (j+m)(2k+1) N = ω j(2k+1) N = ( D m F m ) p,q It follows that the four m-by-m blocks of F 2m P 2m have the required structure.

The basic step Using Theorem 2 we can carry out the DFT as a block multiplication. Let y R 2m and set w = P T 2my = (w T 1, wt 2 )T, where w 1, w 2 R m. Then F 2m y = F 2m P 2m P T 2my = F 2m P 2m w and [ ] [ ] [ Fm D F 2m P 2m w = m F m w1 q1 + q = 2 F m D m F m w 2 q 1 q 2 where q 1 = F m w 1 and q 2 = D m (F m w 2 ). In order to compute F 2m y we need to compute F m w 1 and F m w 2. Note that w T 1 = [y 1, y 3,..., y N 1 ], while w T 2 = [y 2, y 4,..., y N ]. This follows since w T = [w T 1, wt 2 ] = yt P 2m and post multiplying a vector by P 2m moves odd indexed components to the left of all the even indexed components. ],

Recursive FFT Algorithm (Recursive FFT) Recursive Matlab function when N = 2 k function z=fftrec(y) n=length(y); if n==1 z=y; else q1=fftrec(y(1:2:n-1)); q2=exp(-2*pi*i/n).ˆ (0:n/2-1).*fftrec(y(2:2:n)); z=[q1+q2 q1-q2]; end Such a recursive version of FFT is useful for testing purposes, but is much too slow for large problems. A challenge for FFT code writers is to develop nonrecursive versions and also to handle efficiently the case where N is not a power of two.

FFT Complexity Let x k be the complexity (the number of flops) when N = 2 k. Since we need two FFT s of order N/2 = 2 k 1 and a multiplication with the diagonal matrix D N/2 it is reasonable to assume that x k = 2x k 1 + γ2 k for some constant γ independent of k. Since x 0 = 0 we obtain by induction on k that x k = γk2 k = γn log 2 N. This also holds when N is not a power of 2. Reasonable implementations of FFT typically have γ 5

Improvement using FFT The efficiency improvement using the FFT to compute the DFT is impressive for large N. The direct multiplication F N y requires O(8n 2 ) flops since complex arithmetic is involved. Assuming that the FFT uses 5N log 2 N flops we find for N = 2 20 10 6 the ratio 8N 2 5N log 2 N 84000. Thus if the FFT takes one second of computing time and if the computing time is proportional to the number of flops then the direct multiplication would take something like 84000 seconds or 23 hours.

Poisson solver based on FFT requires O(n log 2 n) flops, where n = m 2 is the size of the linear system Ax = b. Why? 4m sine transforms: G 1 = SF, G = G 1 S, V 1 = SX, V = V 1 S. A total of 4m FFT s of order 2m + 2 are needed. Since one FFT requires O(γ(2m + 2) log 2 (2m + 2)) flops the 4m FFT s amounts to 8γm(m + 1) log 2 (2m + 2) 8γm 2 log 2 m = 4γn log 2 n, This should be compared to the O(8n 3/2 ) flops needed for 4 straightforward matrix multiplications with S. What is faster will depend on the programming of the FFT and the size of the problem.

Compare exact methods System Ax = b of order n = m 2. Assume m = 1000 so that n = 10 6. Method # flops Storage T = 10 9 flops 1 Full Cholesky 3 n3 = 1 3 1018 n 2 = 10 12 10 years Band Cholesky 2n 2 = 2 10 12 n 3/2 = 10 9 1/2 hour Block Tridiagonal 2n 2 = 2 10 12 n 3/2 = 10 9 1/2 hour Diagonalization 8n 3/2 = 8 10 9 n = 10 6 8 sec FFT 20n log 2 n = 4 10 8 n = 10 6 1/2 sec