On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

Size: px

Start display at page:

Download "On the Use of a Many core Processor for Computational Fluid Dynamics Simulations"

Alyson Blake
5 years ago
Views:

1 On the Use of a Many core Processor for Computational Fluid Dynamics Simulations Sebastian Raase, Tomas Nordström Halmstad University, Sweden hh.se

2 Preface based on my master thesis supported by: Swedish Knowledge Foundation (via ESCHER1 project) Volvo Penta AB (source: 1 Embedded Streaming Computations on Heterogeneous Energy efficient architectures 2

3 I will talk about: Lattice Boltzmann algorithm Adapteva Epiphany and Parallella board Implementation Results Conclusion 3

4 Computational Fluid Dynamics uses numerical methods to analyze fluid flows both gases and liquids are fluids widespread applications in aerodynamics, architecture, automotive, chemistry, meteorology, navy, computationally very intensive parallelization is extremely important focus on a single, particle based algorithm Lattice Boltzmann 4

5 Lattice Boltzmann algorithm (I) based on Boltzmann equation, late 19th century: f f = t t collision + f t diffusion + f t external f = f(x, v, t) describes the particle probability density in phase space (i.e. at specific position, velocity and time) collision term is particularly hard to solve 5

6 Lattice Boltzmann algorithm (II) phase space f(x, v, t) is discretized (lattice models) discrete positions, velocities and time (thus, also angles) named DmQn (m: dimensions, n: number of discrete velocities) focus on two models: D2Q9 (single node) D3Q19 (single node) 6

7 I will talk about: Lattice Boltzmann algorithm Adapteva Epiphany and Parallella board Implementation Results Conclusion

Adapteva Epiphany (I) two dimensional mesh network on chip consisting of ecore processor nodes low power (16 cores @ 800 MHz < 1W) single shared, flat 32 bit address

8 Adapteva Epiphany (I) two dimensional mesh network on chip consisting of ecore processor nodes low power ( MHz < 1W) single shared, flat 32 bit address space 1 MiB address space per node, 64x64 (=4096) nodes maximum row bit 31 column mesh address format local address bit 0 mesh structure (source: Architecture Ref.) 8

9 Adapteva Epiphany (II) ecores are 32 bit RISC processors with IALU (integer) and FPU (float) single precision FPU only only 32 KiB local memory per node, divided into independent 8 KiB banks timers allow counting of events, allowing clock cycle precise runtime measurements ecore Reg. File IALU FPU 2ch DMA controller 32 KiB local memory 2 timers mesh controller mesh node ANSI C (gcc) programmable 9

Parallella 16 board currently available reference platform for Epiphany arch Xilinx Zynq (1 GHz, dual core ARM Cortex A9) as host 16 core Epiphany E16G3

10 Parallella 16 board currently available reference platform for Epiphany arch Xilinx Zynq (1 GHz, dual core ARM Cortex A9) as host 16 core Epiphany E16G3 chip connected using FPGA logic 32 MiB of (Epiphany )external shared memory Parallella 16 board Epiphany chip is marked red (source: 10

11 I will talk about: Lattice Boltzmann algorithm Adapteva Epiphany and Parallella board Implementation Results Conclusion

12 Implementation (I) two approaches: collision streaming, boundary bulk applications divided into host and Epiphany part host part: single threaded ARM Linux application running on the Zynq loads ecores with code and starts them reads lattice data (results) from shared memory creates density/velocity grayscale images and GIF animations writes lattice data and time measurements to ASCII files 12

13 Implementation (II) Epiphany part: single threaded, but running on all active ecores simultaneously works on a part of the lattice (block), which is always kept in local memory after iteration, result may be copied to shared memory ( to the host) only next neighbor communication (except for shared memory) all cores run in lockstep, using barriers y x z y x blocking approaches (bold: domain boundaries) 13

14 I will talk about: Lattice Boltzmann algorithm Adapteva Epiphany and Parallella board Implementation Results Conclusion

15 Results (I) excellent computational scalability (a) collision streaming (b) boundary bulk 15

16 Results (II) good computational performance in 2D (45 16 cores), much less impressive in 3D ( cores) but still room for improvements comparable in terms of energy efficiency: system type clock freq. in MHz number of FPUs power in W MLU/s total MLU/s per FPU MLU/s per W Parallella Tesla C Xeon E (* MLU/s: millions of lattice node updates per second) 16

17 Results (III) bulk based optimization ineffective in 3D (too few bulk nodes) limited local memory (32 KiB total, split 8 KiB code / 24 KiB data) lattice sizes far too small lattice model node size (bytes) max. nodes per block maximum block size lattice size (16 cores) D2Q x x 104 D3Q x6x7 28 x 6 x x 35 x 3 12 x 35 x 12 D3Q19 76 achieveable lattice sizes node update rates (16 cores) 17

Results (IV) very small bandwidth to shared memory measured 75 MiB/s theoretical maximum is 600 MiB/s, or 150 MiB/s if non optimal accesses* not enough to stream lattice

18 Results (IV) very small bandwidth to shared memory measured 75 MiB/s theoretical maximum is 600 MiB/s, or 150 MiB/s if non optimal accesses* not enough to stream lattice each iteration no overlap possible between calculation and transmission * currently enforced by FPGA logic computation / host copy comparison (2D, 24x24 block size, 16 cores) 18

19 I will talk about: Lattice Boltzmann algorithm Adapteva Epiphany and Parallella board Implementation Results Conclusion

20 Conclusion computation: memory: excellent scalability, very good energy efficiency, fair performance still room for improvement too small local memory, too small external bandwidth too small lattices for real world problems currently not suitable for the Lattice Boltzmann algorithm, but: this work used the very first publicly available Epiphany chip 20

21 The End. 21

22 host / epiphany connection ARM Epiphany III e Link FPGA logic ARM Xilinx Zynq 22

23 external memory access starvation 23

Janus: FPGA Based System for Scientific Computing Filippo Mantovani

Janus: FPGA Based System for Scientific Computing Filippo Mantovani Physics Department Università degli Studi di Ferrara Ferrara, 28/09/2009 Overview: 1. The physical problem: - Ising model and Spin Glass