A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Size: px

Start display at page:

Download "A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures"

Constance McKenzie
5 years ago
Views:

1 A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 January 18, 2016

$Analyzing the magnitude and phase images of multiple echoes allows the computation of the fat fraction of tissue. (K.$

2 Motivation Optimization problem in magnetic resonance imaging (MRI) using nuclear magnetic resonance (NMR) capabilities of the magnetic resonance scanner for fat / water quantification. Analyzing the magnitude and phase images of multiple echoes allows the computation of the fat fraction of tissue. (K. Bredies) Figure 1: Magnitude and phase images form multiple echoes A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 1

3 Figure 2: Calculated fat and water separation in tissue A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 2

4 Optimization Problem Assume t m = t 0 + m t are equispaced echo times and R2 inhomogeneity image given, then the least squares residual the relaxation parameter r 2 ( B 0 )= min w, f MX m=1 s m ( w w m + f f m )e 2 i B 0t m e R 2 t m 2 (1) is a trigonometric polynomial using the substitution z = e 2 i B 0 t and the local minima of the functional are determined by the roots of the polynomial. Where w, f are the water and fat intensity images, B 0 is the field inhomogeneity image and w m and f m are the ideal water and fat signals. Find the roots of a polynomial for every pixel! A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 3

5 Companion Matrix Let p(z) =a 0 + a 1 z + + a n 1 z n 1 + z n be a complex polynomial. The companion matrix of the polynomial, A = 0 0 a a a a n 1 1 C A 2 Cn n, (2) has the characteristic polynomial A(z) :=det(ze A) =p(z). Companion matrices are Hessenberg matrices! The roots command in Matlab uses this approach to calculate the roots of a polynomial. A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 4

6 The Long History of LAPACK LAPACK was first written in the 70s / 80s with a di erent computer architecture in mind than we have today. Good software: 70s / 80s Sequential algorithms FLOPS are the main concern! Memory access patterns are not so important Good software: Today Massively parallel hardware architectures need e cient parallel algorithms E cient memory access is extremely important! Parallelization usually for large scale problems Expose parallelism for many small scale problems through vectorization! A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 5

7 Vectorization of the QR Algorithm ZLAHQR It is very hard to derive any meaningful parallelism within a single small eigenvalue problem. However it is still possible to vectorize the algorithm and thus solve many small eigenvalue problems simultaneously. The LAPACK implementation of the core QR algorithm ZLAHQR has complex program flow. Can vectorized code of a complex program flow be still e cient? Lets have a look at the algorithm! A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 6

8 Algorithm 1: ZLAHQR Body for I = IHI to ILO step L = ILO; for ITS =0to 30 do Deflation; if L I then break; 1 do I 1 = L; I 2 = I; if ITS = 10 then Exceptional Shift A; else if ITS = 20 then Exceptional Shift B; else Wilkinson Shift; QR Single Shift; for I = IHI to ILO,step E[I] =H[I,I]; 1 do A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 7

9 Algorithm 2: Deflation for K = I to L +1,step 1 do w 4 = H[K, K 1]; if w 4 apple smlnum then break; w 1 = H[K 1,K 1], w 2 = H[K, K], tst = w 1 + w 2 ; if tst =0then if K 2 >= ILO then w 3 = H[K 1,K 2], tst = tst + w 3 ; if K +1<= IHI then w 6 = H[K +1,K], tst = tst + w 6 ; if w 4 apple ulp tst then w 5 = H[K 1,K]; ab = max( w 4, w 5 ), ba = min( w 4, w 5 ); aa = max( w 2, w 1 w 2 ), bb = min( w 2, w 1 w 2 ); s = aa + ab; if ba (ab/s) apple max(smlnum, ulp (bb (aa/s))) then break; L = K; if L>ILOthen H[L, L 1] = 0; A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 8

10 Algorithm 3: Exceptional Shift A t = <(H[L +1,L]) H[L, L]; Algorithm 4: Exceptional Shift B t = <(H[I,I 1]) H[I,I]; Algorithm 5: Wilkinson Shift t = H[I,I]; h 2 = H[I 1,I]; h 3 = H[I,I 1]; u = p h 2 ph 3 ; s = u ; if s 6= 0then h 4 = H[I 1,I 1]; x =(h 4 t) 0.5; s = max(s, x ); z 1 = x/s, z 2 = u/s; q y = z1 2 + z2 2 s; if x > 0 then z = x/ x ; if <(z) <(y) +=(z) =(y) < 0 then y = y; v = u; t = t u (u/(x + y)); A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 9

11 Algorithm 6: QR Single Shift for K = M to I 1 do if K>Mthen v 0 = H[K, K 1]; v 1 = H[K, K]; if v 1 =0and =(v 0 )=0then t 1 =0; Householder reflector; if K>Mthen H[K, K 1] = v 0 ; H[K, K] =0; t 2 = <(t 1 ) <(v 1 ) =(t 1 ) =(v 1 ); Householder reflector from the left; Householder reflector from the right; z = H[I,I 1]; if =(z) 6= 0then d = z ; z = z/d; =(z) = =(z); H[I,I 1] = d; Scale rows with z and columns with z; A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 10

12 Algorithm 7: Householder reflector norm = p <(v 0 ) 2 + =(v 0 ) 2 + <(v 1 ) 2 + =(v 1 ) 2 ; beta = copysign(norm, <(v 0 )); <(z) =<(v 0 )+beta; =(z) ==(v 0 ); <(t) =<(z); =(t) ==(z); z =1/z; v 1 = v 1 z; t = t/beta; <(v 0 )= beta; =(v 0 )=0; A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 11

13 Algorithm 8: Householder reflector from the left i 1 =(K + K n) m off; i 2 = ((K + 1) + K n) m for j = K to I 2 do h 1 = H[i 1 ]; h 2 = H[I 2 ]; z = h 1 t 1 + h 2 t 2 ; H[i 1 ]=H[i 1 ] z; H[i 2 ]=H[i 2 ] z v 1 ; i 1 = i 1 + n m; i 2 = i 2 + n m; off; Algorithm 9: Householder reflector from the right i 1 =(I 1 + K n) m off; i 2 =(I 1 +(K + 1) n) m off; for j = I 1 to min(k +1,I) do h 1 = H[i 1 ]; h 2 = H[I 2 ]; z = h 1 t 1 + h 2 t 2 ; H[i 1 ]=H[i 1 ] z; H[i 2 ]=H[i 2 ] z v 1 ; i 1 = i 1 + m; i 2 = i 2 + m; A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 12

14 Array of Structures (AoS) or Structure of Arrays (SoA) or Hybrid Vectorization requires a uniform coalesced memory access for optimal performance Which data layout is preferable? Array of Structures (AoS) layout is most of the time not suitable for vectorization! Structure of Arrays (SoA) is the natural data layout for vectorization SoA problem: Memory access to far apart addresses leads to performance penalties Optimal: Hybrid: Many small SoA blocks adapted to the hardware vector length CPU: 8 16 matrices in a SoA block. L1 cache friendly layout. GPU: matrices in a SoA block. The vector length is a natural tuning parameter that can be adapted for various hardware. A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 13

15 Hybrid Data Layout for Vectorization Let k 2 N and let A (n) 2 C k k and 1 apple n apple N and let M =2 m be the vector length. Then we get N/M interleaved matrix blocks. A (1) = a (1) 1,1 a (2) 1,1... a (M) 1,1... a (1) 1,k a (2) 1,k... a (M) 1,k... a (1) k,1 a (2) k,1... a (M) k,1... a (1) k,k a (2) k,k... a (M) k,k (3) Continue with the remaining L := N/M blocks A (l), 1 apple l apple L and we get the full hybrid data layout: A (1),...,A (L). A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 14

16 Performance Results for Di erent Libraries and New Code A test image with pixels gives eigenvalue problems. A benchmark run below calculates the eigenvalues of 3 di erent images with a total of matrices of dimension Library Routine Time ATLAS zgeev s ATLAS zlahqr s MKL zgeev s MKL zlahqr s NEW (compatible) s NEW (tuned) s NEW (deflation) s Table 1: Comparison of di erent LAPACK libraries and the new code on a compute node with 2x Intel Xeon 2.0GHz (16 cores) using OpenMP parallelization A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 15

17 Performance Results for Accelerators Hardware Cores Compiler Time 2x Intel Xeon E g++ openmp s 1x Intel Xeon Phi 5110P 60 icpc o oad s 1x Intel Xeon Phi 5110P 60 icpc native s 1x Nvidia GTX pgcpp openacc s 1x Nvidia Tesla K pgcpp openacc s 1x Nvidia GTX nvcc sm s 1x Nvidia Tesla K nvcc sm s Table 2: Timings for calculating all eigenvalues of companion matrices All programs are compiled form the same source code! The only specific modifications are pragmas decorating the outer loop and CUDA thread ids replacing the outer loop. A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 16

18 Conclusions Hardware has changed! Old code isn t e cient any more on modern architectures Hardware characteristics have significantly changed over the years FLOPS are free and memory access is expensive Memory access patterns are very important for e cient code Multicore CPUs and Manycore GPUs evolve towards a common architecture Single Instruction Multiple Thread (SIMT) is a good model for vectorization (GPUs) SIMT emulation for CPUs with AVX vector intrinsics already possible Vectorization is key to e cient code Future work: Explicit vectorization library using the SIMT model for CPUs A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures 17

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal