Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Size: px

Start display at page:

Download "Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009"

Victoria Terry
6 years ago
Views:

1 III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009

2 Background and motivation Running time of an algorithm is the sum of three items: 1. # flops time-per-flop 2. # words moved/bandwith (memory communication) 3. # messages latency (network communication) Hardward trend: exponentially growing gaps Time-per-flop 1/Memory BW Memory Latency improving 59%/year vs 23%/year vs 5.5%/year Time-per-flop 1/Network BW Network Latency improving 59%/year vs 26%/year vs 15%/year Goal: re-design matrix algorithms to reduce communication and optimize the performance Recent work by Dongarra, Demmel, et al,...

3 Background and motivation Experts benchmark Matrix multiplication of MKL library on Intel Xeon EMT64: cores 8 cores 4 cores 2 cores 1 core 16xdgemm-seq Matrix size matrix size and Gflops

4 Matrix multiplication of ESSL library on an IBM Power 6: cores 16 cores 8 cores 4 cores 2 cores 1 core 32xdgemm-seq Matrix size matrix size and Gflops

5 PLASMA DGETRF on Intel Xeon EMT cores 14 cores 12 cores 10 cores 8 cores 6 cores 4 cores 2 cores 1 core 16xdssssm-seq Matrix size matrix size and Gflops Courtesy of Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)

6 PLASMA DGEQRF on Intel Xeon EMT cores 14 cores 12 cores 10 cores 8 cores 6 cores 4 cores 2 cores 1 core 16xdssrfb-seq Matrix size matrix size and Gflops Courtesy of Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)

7 PLASMA DGETRF on IBM Power 6: cores 16 cores 8 cores 4 cores 2 cores 1 core xdssssm-seq Matrix size matrix size and Gflops Courtesy of Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)

8 PLASMA DGEQRF on IBM Power 6: cores 16 cores 8 cores 4 cores 2 cores 1 core 32xdssrfb-seq Matrix size matrix size and Gflops Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)

9 Background and motivation Poor man s benchmark data DGEMM, DGEQRF and DGEQP3 of MKL 10.2 on Intel Core TM i7 Quad 2.66GHz DGEMM DGEQRF DGEQP DGEMM DGEQRF DGEQP Speed (Gflop/s) (quad-core) scalability (n = 12, 000) Message: pivoting is expensive!

10 Background and motivation Poor man s benchmark data PDGEMM, PDGEQRF and PDGEQPF (pivoted QR) of PBLAS and ScaLAPACK on Cray XT4 (Franklin of NERSC): GFlop/s (n = 12000) N. Proc. PDGEMM PDGEQRF PDGEQPF 2x x x x x x Message: pivoting is even more expensive on hybrid systems!

11 10 3 pdgemm pdgeqrf pdgeqpf GFLOPS x2 4x4 6x6 8x8 12x12 16x16 number of processors

12 Question and the rest of my talk Question: How to re-design higher-level matrix algorithms using fastest matrix opertions? Two Case Studies: 1. Communication-reducing algorithm for Green s function calculation and application in Quantum Monte Carlo (QMC) simulations 2. Optimizing Halley s iteration for computing matrix polar decomposition and application in First Principal Molecular Dynamics (FPMD)

13 Communication reducing algorithm for Green s function calculation joint with Roger Lee Wenbin Chen Richard Scalettar Physics, UC Davis National Tsinghua Univ. Taiwan Fudan University, China

Computational Material Science Simulation and understanding

transition, high temperature superconductivity,.

model and quantum Monte Carlo simulation QUEST (QUantum Electron

14 Computational Material Science Simulation and understanding properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity,... Many body simulation on multi-layer lattice using Hubbard model and quantum Monte Carlo simulation QUEST (QUantum Electron Simulation Toolbox): Fortran 90 package for determinant (and hybrid) Monte Carlo simulations

15 QUEST kernel: Green s function (matrix) calcuation where G = (I + B L B L 1 B 2 B 1 ) 1 B i = B V i = e τk diag(e ±λ ), K is the hopping matrix, an adjance matrix of the lattice τ is the time discretization parameter. L is the time slice λ = cosh 1 (e U τ/2 ). U is a potential energy parameter for local repulsion between electrons Challenge: B L B L 1 B 2 B 1 is extremely ill-conditioned!

16 Physicist s method [Loh et al 89,... 05] 1. Compute an UDT decomposition of the product where D is stratified (graded) B L B L 1 B 2 B 1 = U L D L T L D = X X... X 2. Compute G = T 1 L (UT LT 1 L + D L) 1 U T L. The pivoted QR decomposition is used for the stratification. There are a number of work on the product of matrices by numerical linear algebraists [Stewart,...]

17 Physicist s method [Loh et al] 1. Compute pivoted QR decomposition: B 1 = Q 1 R 1 P T 1 2. Set U 1 = Q 1 D 1 = diag(r 1 ) T 1 = D 1 1 R 1P T 1 3. For i = 2, 3,,L. Compute C i = (B i U i 1 )D i 1. (culumn scaling) Compute pivoted QR decomposition: C i = Q i R i Pi T. Set U i = Q i D i = diag(r i ) T i = (Di 1 R i )(Pi TT i 1) 4. Compute G = T 1 L (UT L T 1 L + D L) 1 U T L.

18 However, pivoting is very expensive on multicore Intel Core TM Quad 2.4GHz, DGEMM: matrix-matrix multiplication DGEQRF: QR decomposition DGEQP3: pivoted QR decomposition Gflop/s DGEMM DGEQRF DGEQP3 Gflop/s DGEMM DGEQRF DGEQP problem size Speed (Gflop/s) number of processors scalability

19 Pivoting is even more expensive on distributed systems Cray XT4 (Franklin of NERSC) 10 3 pdgemm pdgeqrf pdgeqpf GFLOPS x2 4x4 6x6 8x8 12x12 16x16 number of processors

20 SOF algorithm - without the pivoting Structured orthogonal factorization (SOF) algorithm 1. Set M 2 = I, A 2 = B 1 2. For i = 2, 3,,L (a) compute QR decomposition: [ ] [ Mi = B i (b) update [ Ai+1 M i+1 ] = [ 3. Compute G = (M L + A L ) 1 M L Q (i) 11 Q(i) 12 Q (i) 21 Q(i) 22 ] [ Ri 0 ] (Q (i) 12 )T (Q (i) 22 )T ] [ A (i) I Computational kernel: QR decomposition without pivoting. Price: SOF takes about 3 more flops than Physicist method. ]

21 SOF algorithm motiviation and correctness Motiviation: Connection betweenthe Green s function and the inverse of the following block p-cyclic matrix: where G = (I + B L B L 1 B 2 B 1 ) 1 = (L,L)-subblock of M 1 M = Correctness: by direct verification I B 1 B 2 I B 3 I B L I

22 SOF algorithm correctness By the QR decomposition, we have ( ( T Q (i) Q 22) (i) Bi = 0 12) T Mi ( ) T Since M 2 = I and M i = Q (i 1) 22 for i > 2, the above equations can be rewritten as ( ) T ( ) T Q (2) 12 = Q (2) 22 B2 ( ) T ( ) T ( T = Q 22) (i) Bi for i 3 Q (i) 12 Q (i 1) 22 Note that Q (i) 22 are invertible, ( ) T ( ) T Q (2) 12 = Q (2) 22 B2 ( ) T ( ) T ( = Bi Q (i) 12 Q (i) 22 Q (i 1) 22 ) T for i 3

23 Hence ( ) T Q (L) 12 ( Q (3) 12 ) T ( Q (2) 12 By steps 1 and 2(b) of SOF, ( ) T A L = ( Hence In summary, (M L + A L ) 1 M L = A L = Q (L) 12 ( Q (L) 22 ) T = ( Q (3) 12 [ ( ) T ( Q (L) 22 + Q (L) 22 ) T ( Q (2) ) T BL B 3 B ) T B1. ) T BL B 3 B 2 B 1. Q (L) 22 = (I + B L B 2 B 1 ) 1 = G. ) T BL B 2 B 1 ] 1 ( Q (L) 22 ) T

24 SOF algorithm Implementation QR decomposition in LAPACK: 1. DGEQRF: compute the QR decomposition, Q is stored implicitly 2. DORGQR: forming the full Q-factor explicitly Q = I V TV T, T is an n n upper triangular, and V is 2n n (DLARFT).

25 SOF algorithm Implementation 1 Call DGEQRF for QR decomposition explicitly form the full Q-factor by DORGQR.

26 SOF algorithm Implementation 2 Modify DORGQR to compute Q (i) 12 Let V be comformally partitioned V = and Q(i) 22 directly: [ Vu V d ], (1) Then the right half of the Q-factor can be computed by Q (i) 12 = V utvd T Q (i) 22 = I V dtvd T. Note that V u is a lower triangular matrix, used to reduce the cost of matrix multiplication.

27 SOF algorithm Implementation 3 Modify DGEQRF to compute the T matrix recursively. If T 11 T 12 T 1b T = T 22 T 2b...., T bb DGEQRF produces diagonal blocks T ii in order explicitly. Consider two consecutive block Householder transformations H 1 and H 2. Then H 1 H 2 = (I V 1 T 1 V1 T [ )(I V 2 T 2 V2 T ) T1 T = I (V 1 V 2 ) 1 V1 T V 2 T 2 0 T 2 ][ V T 1 V T 2 The T matrix of the merged Householder transformation can be computed with additional computation of T 1 V T 1 V 2 T 2. ],

28 Performance I Green s matrix on a lattice (n = 1024) CPU elapsed time 1 core 2 cores 4 cores Loh s Impl Impl G = (I + B 96 B 95 B 2 B 1 ) 1 GFlop/s rate: 1 core 2 cores 4 cores Loh s Impl Impl Intel Core TM i7 Quad 2.66GHz, ifort Fortran and MKL

29 Performance II Green s matrix on a lattice (n = 4096) CPU elapsed time No. of Proc. (4x4) (8x8) (16x16) (32x32) Loh s Impl Impl G = (I + B 12 B 11 B 2 B 1 ) 1 GFlop/s rate: No. of Proc. (4x4) (8x8) (16x16) (32x32) Loh s Impl Impl Intel Core TM i7 Quad 2.66GHz, ifort Fortran and MKL

30 Further reading - I Green s function calculation, may appear in different form, is one of fundamental problems in nano-scale simulations. The physical background and related numerical linear algebra problems are discussed in detail in the following reference Z. Bai, W. Chen, R. Scalettar, I. Yamazaki, Numerical Methods for Quantum Monte Carlo Simulations of the Hubbard Model In Multi-Scale Phenomena in Complex Fluids edited by T. Y. Hou et al, Higher Education Press, 2009 (114 pages) A pre-print of the reference is available at bai/bcsy09.pdf

31 Optimized Halley s iteration for computing matrix polar decomposition joint with Yuji Nakatsukasa Applied Math, UC Davis Francois Gygi Applied Science/CS, UC Davis

32 Electronic structure calculations Investigate the properties of solids, liquids, biomolecules, and nanoparticles... First-principles molecular dynamics (FPMD) based on the plane-wave, pseudopotential formalism Qbox: a C++/MPI implementation of FPMD for massively parallel computers (F. Gygi et al)

33 Subspace alignment in Self-consistent iteration Given wave functions Y (t 1) and Y (t) at time steps t 1 and t: Compute QR decomposition Y (t) = Ŷ (t)r (tall and skinny QR) Solve the Procrustes problem min Q Y (t 1) Y (t)q F Compute Ỹ = 2Ŷ (t)q Y (t 1) Obtain trial wavefunction Ŷ (t + 1) for the time step t + 1 by QR: Ỹ (t) = Ŷ (t + 1)R. Ref: [Arias et al 92, Edelman et al 99, Fattebert and Gygi 04]

34 Matrix polar decomposition Given A C n n, compute the polar decomposition where A = UH, U H U = I and H H = H 0 Application: orthogonal procrustes problem in subspace alignment of SCF iteration for solving Kohn-Sham equation in electronic structure calculation: min U X i X i U

35 Newton s iteration Newton s iteration: X k+1 = 1 2 ( ) Xk + X H k, X0 = A. X k U as k Singular values σ i (X k ) 1 as k The convergence is slow when A is ill-conditioned. Scaled Newton s iteration X k+1 = 1 2 ( ζk X k + (ζ k X k ) H), X 0 = A, The inverse X 1 k has to be computed explicitly....

36 Newton s iteration variant Scaled Newton s iteration variant (SNV): X k+1 = 2η k X k ( I + η 2 k X H k X k ) 1, Y0 = η 0 A, Inverse-free implementation: SNV can be implemented using an QR decomposition (discussed later). Unfortunately, even the inverse-free implementation is not numerical stability? discovered by [Crudge 98] and [Byers and Xu 01], independently

37 Halley s iteration Halley s iteration X k+1 = X k (3I + X H k X k )(I + 3X H k X k ) 1, X 0 = A. There is a family of iterations based on rational function approximation Theorem. X k U and the asymptotic rate is cubic. The initial steps could be still slow Consider the 2 2 matrix: X 0 = A = [ The kth Halley s iterate is given by [ ] 1 X k =, x x k = x k 1(3 + x 2 k 1 ) k 1 + 3x 2. k 1 It takes 24 iterations to satisfy 1 X 24 < ].

38 Dynamically weighted Halley s iteration Dynamically weighted Halley s (DWH) iteration X k+1 = X k (a k I + b k X H k X k )(I + c k X H k X k ) 1, where X 0 = A/α and α A 2. Question: how to dynamically choose a k, b k and c k to optimize the convergence rate?

39 How to dynamically determine a k, b k and c k? - first step Let l 0 be chosen such that σ i (X 1 ) = g 0 (σ i (X 0 )), and where [σ min (X 0 ), σ max (X 0 )] [l 0, 1] (0, 1]. [σ min (X 1 ), σ max (X 1 )] [ming 0 (x), maxg 0 (x)]. g 0 (x) = x a 0 + b 0 x c 0 x. 2 Rational optimization problem: find a 0, b 0, c 0 such that (1) 0 < g 0 (x) 1 for l 0 < x 1. { } (2) the max-min property max a 0,b 0,c 0 Update l 1 = min l0 x 1 g 0 (x), such that min g 0(x) l 0 x 1 [σ min (X 1 ), σ max (X 1 )] [l 1, 1] (0, 1]..

40 How to dynamically determine a k, b k and c k? general case At the (k + 1)st iteration, we have and where [σ min (X k ), σ max (X k )] [l k, 1] (0, 1]. [σ min (X k+1 ), σ max (X k+1 )] [ ] min g k(x), max g k(x), l k x 1 l k x 1 g k (x) = x a k + b k x c k x. 2 Rational optimization problem: find a k, b k and c k such that: (1) 0 < g k (x) 1 for l k < x{ 1. } (2) the max-min property max a k,b k,c k Update l k+1 = min lk x 1 g k (x) min g k(x) l k x 1.

41 Solving rational optimization problem For the special three-parameter case, we have developed a direct method to obtain analytic expressions for a k,b k and c k tedious, but doable In general, for high-order Halley s iteration we can first reformulate the problem, and then use a SDP solver, say SeDuMi or YALMIP Applications of rational optimization problem in shape optimization of transfer functions [Nie et al 09]

42 Convergence theorem and number of of iterations Theorem. X k U and the asymptotic rate is cubic. For the 2 2 example, 5 itertions such that 1 X 5 < In general, we have the following theoretical prediction for the number of iterations κ(a) SN DWH

43 Inverse-free implementation Theorem. Given the QR decomposition [ ] [ ] ηa Q1 = R. I Then Q 2 Q 1 Q H 2 = ηa(i + η 2 A H A) 1. [Zha & Zhang 96, Higham 07]

44 Inverse-free implementation - DWH Dynamically weighted Halley s (DWH) iteration X k+1 = X k (a k I + b k X H k X k )(I + c k X H k X k ) 1 = b k c k X k + (a k b k c k ) X k (I + c k X H k X k ) 1 }{{} QR based implementation of DHW: [ ] [ ] ck X k Q1 = R, I Q 2 X k+1 = b k ck X k + 1 ck (a k b k ck )Q 1 Q H 2,

45 Example: verification of theoretical iteration count Diagonal matrices A. Iter numbers and X k I F κ(a) Halley Gander DWH Halley 4.7e e e e e e-16 Gander 7.6e e e e e e-16 DWH 4.9e e e e e e-16

46 Example: Q-DWH and scaled Newton s iteration Random matrices with varying condition numbers: kappa min max min max min max iter Q-DWH SN res Q-DWH 5.3e e e e e e-16 SN 7.4e e e e e e-15 Multicore implementation is progress.

47 Further reading - II The inverse-free implementation of the Halley s iteration is described in the following paper Y. Nakatsukasa, Z. Bai and F. Gygi, Optimizing Halley s iteration for computing the matrix polar decomposition, submitted, 2009 It is available on our class website.

48 Concluding remarks 1. Our case studies illustrate the ideas of designing communication-reducing numerical linear algebra algorithms to balance flops, communication, flops, stability and optimize the performance on modern computer architecture. 2. Hardware trend means the time has come to do! 3. Lots of prior work in numerical linear algebra community, but more open problems 4. Why stops at numerical linear algebra?

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March