SLIM. University of British Columbia

Size: px

Start display at page:

Download "SLIM. University of British Columbia"

Magdalene Boone
5 years ago
Views:

1 Accelerating an Iterative Helmholtz Solver Using Reconfigurable Hardware Art Petrenko M.Sc. Defence, April 9, 2014 Seismic Laboratory for Imaging and Modelling Department of Earth, Ocean and Atmospheric Sciences, UBC SLIM University of British Columbia 1

2 Oh by the way: I have a stutter. 2

3 Seismic Wave Simulation 3

4 Seismic Exploration for Oil and Gas 4

5 Full-waveform Inversion Seismic Wavefield (u) Earth model (m) 5

6 Full-waveform inversion is SLOW 6

7 The Accelerators Have Arrived Top 10 of Top 500 Supercomputers

8 FPGAs: Reconfigurable Hardware Accelerators 8

9 The Punchline 9

10 Modelling Seismic Waves Mathematical Formulation 10

11 s medium can be written as Modelling Seismic Waves: The Wave Equation 1 2 Ê m+ 2 u = q, z equation. As written above, the Helmho Frequency Laplacian Earth Model Seismic Source (right- hand side) Wavefield (iterate) t density isotropic medium which only su are modelled heuristically by allowing m t 11

12 Modelling Seismic Waves: Discretization [Operto, 2007] A(m, Ê)u = q 12

13 Solving the Helmholtz System 13

14 The Kaczmarz Algorithm [Kaczmarz, 1937] DKSWP(A, u, q, ) Adapted from [van Leeuwen, 2012] 14

15 The Kaczmarz Algorithm: Equivalent to SSOR-NE [Björck and Elfving, 1979] Double Kaczmarz sweep on the original system: Au = q One iteration of SSOR on the normal equations: AA y = q A y = u Both are computed as: u k+1 = u k + (b i ha i, u k i) a i ka i k 2 k :1! 2N i :1! N,N! 1 15

16 Kaczmarz + CG = CGMN [Björck & Elfving 1979] 16

17 CGMN: Solves for Fixed Point of Kaczmarz Row Projections DKSWP(A, u, q, ) =Q 1 Q N Q N Q 1 u + Rq = Qu + Rq. Assume u is a solution and re-arrange: (I Q)u = Rq, 17

18 Contribution of This Work 18

p DKSWP(A, p, 0, ) 6: ΩÎrÎ 2 / Èp, sí 7: u Ω u + p 8: r Ω r s Adapted from [Pell, 2013] 9: ΩÎrÎ 2 curr /

19 Compute Node Overview [Maxeler Technologies, 2011] Algorithm 1 CGMN (Björck and Elfving [4]) Input: A, u, q, 1: Rq Ω DKSWP(A, 0, q, ) 2: r Ω Rq u +DKSWP(A, u, 0, ) 3: p Ω r 4: while ÎrÎ 2 > tol do 5: s Ω (I Q)p = p DKSWP(A, p, 0, ) 6: ΩÎrÎ 2 / Èp, sí 7: u Ω u + p 8: r Ω r s Adapted from [Pell, 2013] 9: ΩÎrÎ 2 curr / ÎrÎ2 prev 10: ÎrÎ 2 prev ΩÎrÎ2 curr 11: p Ω r + p 12: end while Output: u Kernel: running on accelerator 19

20 Low levels of abstraction are scary 20

21 Design at high level of abstraction 21

22 Implementation Details 22

23 Layout of 3D Wavefields in 1D Memory 23

24 Buffering: Overcoming Latency of Memory Access 24

25 Pipelining: Overcoming Latency of Computation 25

26 Pipelining: Overcoming Latency of Computation 26

27 Pipelining: Overcoming Latency of Computation 27

28 Memory Access: 384 bytes / burst Number of bits in a real number Number of bits in a complex number Complex numbers per burst (single precision) (double precision)

29 Backward Sweep: Double Buffering 29

30 Number Representation Matrix row storage efficiency bits used in this work Number of bits in a real number 30

31 Results 31

32 End-to-end Execution Time one Intel Xeon E core Intel core + FPGA- based accelerator 32

33 Kaczmarz Sweeps: No Longer the Bottleneck other CGMN operations (inner products, vector addition, etc.) Kaczmarz sweeps 33

34 Effect of matrix row ordering on CGMN convergence Sequential (1 to N) Accelerator ordering 34

35 FPGA Resource Usage 35

36 Recent Work: Multiple Kaczmarz Sweeps / CGMN Iteration (432 x 240 x 25 system) Total Kaczmarz sweeps on accelerator CGMN on CPU host 36

37 Avoiding Future Communication Bottlenecks 38.4 GB/s: memory bandwidth limit 2 GB/s: PCIe bandwidth limit 37

38 The Next Step Problem: On-chip memory (4 MB) limits block size to 300 x 300 in the two faster dimensions. Solution: Implement domain decomposition for larger systems. 38

39 Straight-forward Extension Goal: Systematically use all 4 accelerators. Solution: Solve several forward problems at once. 39

40 Future Work Problem: Kaczmarz sweeps now account for only approximately 10% of CGMN time. Solution: Port all of CGMN to the DFE. 40

41 Future Work Fact: Reading A from memory limits optimizations like increasing FPGA frequency. Result: Read only earth model m and generate A on the DFE. 41

42 Future Work Problem: Domain size limited by memory size: 24 GB. Solution: Parallelize CGMN to CARP-CG [Gordon & Gordon, 2010]. 42

43 Conclusion Have implemented frequency-domain wave simulation using reconfigurable hardware. A speed-up of 2 x 1 Intel Xeon core results from a dataflow computing paradigm. 43

Acknowledgements Thank you to: Felix Herrmann, Henryk Modzelewski, Diego Oriato, Simon Tilbury, Tristan van Leeuwen, Eddie Hung, Lina Miao, Rafael Lago, my Master s commitee members: Michael

44 Acknowledgements Thank you to: Felix Herrmann, Henryk Modzelewski, Diego Oriato, Simon Tilbury, Tristan van Leeuwen, Eddie Hung, Lina Miao, Rafael Lago, my Master s commitee members: Michael Friedlander, Christian Schoof, my external examiner: Steve Wilton, Maxeler Technologies, and everyone in the SLIM group! This work was financially supported in part by the Natural Sciences and Engineering Research Council of Canada Discovery Grant (RGPIN ) and the Collaborative Research and Development Grant DNOISE II (CDRP J ). This research was carried out as part of the SINBAD II project with support from the following organizations: BG Group, BGP, BP, Chevron, ConocoPhillips, CGG, ION GXT, Petrobras, PGS, Statoil, Total SA, WesternGeco, Woodside. 44

45 References Å. Björck and T. Elfving. Accelerated projection methods for computing pseudoinverse solutions of systems of linear equations. BIT Numerical Mathematics, 19(2): , ISSN doi: /BF URL D. Gordon and R. Gordon. Component- averaged row projections: A robust, block- parallel scheme for sparse linear systems. SIAM Journal on Scientific Computing, 27(3): , doi: / URL D. Gordon and R. Gordon. CARP- CG: A robust and efficient parallel solver for linear systems, applied to strongly convection dominated PDEs. Parallel Computing, 36(9): , ISSN doi: /j.parco URL F. Grüll, M. Kunz, M. Hausmann, and U. Kebschull. An implementation of 3D electron tomography on FPGAs. In Reconfigurable Computing and FPGAs (ReConFig), 2012 International Conference on, pages 1 5, doi: /ReConFig S. Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de l Academie Polonaise des Sciences et des Lettres, 35: , S. Kaczmarz. Approximate solution of systems of linear equations. International Journal of Control, 57(6): , doi: / (translation) T. van Leeuwen, D. Gordon, R. Gordon, and F. J. Herrmann. Preconditioning the Helmholtz equation via row- projections. In EAGE technical program. EAGE, URL H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. Top 500 supercomputer sites, November URL S. Operto, J. Virieux, P. Amestoy, J.- Y. L Excellent, L. Giraud, and H. B. H. Ali. 3D finite- difference frequency- domain modeling of visco- acoustic wave propagation using a massively parallel direct solver: A feasibility study. Geophysics, 72(5):SM195 SM211, doi: / URL SM195.abstract. O. Pell, J. Bower, R. Dimond, O. Mencer, and M. J. Flynn. Finite- difference wave propagation modeling on special- purpose dataflow machines. Parallel and Distributed Systems, IEEE Transactions on, 24(5): , ISSN doi: /TPDS

Wavefield Reconstruction Inversion (WRI) a new take on wave-equation based inversion Felix J. Herrmann

Wavefield Reconstruction Inversion (WRI) a new take on wave-equation based inversion Felix J. Herrmann SLIM University of British Columbia van Leeuwen, T and Herrmann, F J (2013). Mitigating local minima