Optimization of a parallel 3d-FFT with non-blocking collective operations Chair of Computer Architecture Technical University of Chemnitz Département de Physique Théorique et Appliquée Commissariat à l Énergie Atomique/DAM 3rd International ABINIT Developer Workshop Liège, Belgium 29th January 2007
1 Introduction Short introduction to non-blocking collectives 2 3 Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results
Short introduction to non-blocking collectives 1 Introduction Short introduction to non-blocking collectives 2 3 Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results
Non-blocking Collective Operations Short introduction to non-blocking collectives Advantages - Overlap leverage hardware parallelism (e.g. InfiniBand TM ) overlap similar to non-blocking point-to-point Usage? extension to MPI-2 mixture between non-blocking ptp and collectives uses MPI_Requests and MPI_Test/MPI_Wait MPI_Ibcast(buf1, p, MPI_INT, 0, comm, &req); MPI_Wait(&req);
Short introduction to non-blocking collectives Availability Prototype LibNBC: requires ANSI-C and MPI-2 LibNBC dowload and documentation: http://www.unixer.de/nbc Documentation T. HOEFLER, J. SQUYRES, W. REHM, AND A. LUMSDAINE: A Case for Non-Blocking Collective Operations. In Frontiers of High Performance Computing and Networking, pages 155-164, Springer Berlin, ISBN: 978-3-540-49860-5 Dec. 2006 T. HOEFLER, J. SQUYRES, G. BOSILCA, G. FAGG, A. LUMSDAINE, AND W. REHM: Non-Blocking Collective Operations for MPI-2. Open Systems Lab, Indiana University. presented in Bloomington, IN, USA, Aug. 2006
Performance Benefits? Short introduction to non-blocking collectives Time in microseconds 12000 10000 8000 6000 4000 2000 MPI Latency NBC Latency (max. overlap) 0 0 10000 20000 30000 40000 50000 60000 Datasize Figure: MPI_Alltoall latency on the tantale cluster@cea
1 Introduction Short introduction to non-blocking collectives 2 3 Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results
Domain Decomposition Discretized 3D Domain (FFT-Box) z y x
Domain Decomposition Distributed 3d Domain z 0 1 2 y x
1D Transformation Introduction 1D Transformation in y Direction z y x
1D Transformation Introduction 1D Transformation in z Direction z y x
1D Transformation Introduction 1D Transformation in x Direction z y x
1 Introduction Short introduction to non-blocking collectives 2 3 Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results
Non-blocking 3D-FFT Derivation from normal implementation distribution identical to normal 3D-FFT first FFT in y direction and local data transpose Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time Solution start NBC_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly... collect multiple xz-planes (A2A data size)
Non-blocking 3D-FFT Derivation from normal implementation distribution identical to normal 3D-FFT first FFT in y direction and local data transpose Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time Solution start NBC_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly... collect multiple xz-planes (A2A data size)
Non-blocking 3D-FFT Derivation from normal implementation distribution identical to normal 3D-FFT first FFT in y direction and local data transpose Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time Solution start NBC_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly... collect multiple xz-planes (A2A data size)
Transformation of in z Direction Data already transformed in y direction z y x 1 block = 1 double value (3x3x3 grid)
Transformation of in z Direction Transform first xz plane in z direction in parallel z y x pattern means that data was transformed in y and z direction
Transformation of in z Direction start NBC_Ialltoall of first xz plane and transform second plane z 2 2 2 " " " 4 4 4... 1 1 2 2!! " " 0 0 0 3 3 4 4 R R R - - -... N N N / / / 0 0 0 L L L - - -... / / / P P P Q Q R R - - -... J J J M M N N 0 0 0 / / / 0 0 0 F F K K L L O O O P P P - - - I I I... J J J / / / E E 0 0 0 F F O O O P P P I I I J J J E E F F O O O P P P I I I J J J E E F F & & O O O % % P P P I I I J J J E E F F & & % % 8 8 8 > > > D D & & 7 7 7 % % 8 8 8 = = = > > > C C D D & & 7 7 7 % % 8 8 8 = = = > > > C C D D & & 7 7 7 8 8 8 = = = > > > C C D D ( ( 7 7 7 ' ' 8 8 8 = = = > > > C C D D ( ( ' ' ( ( 5 5 5 ' ' 6 6 6 G G G H H H A A B B ( ( 5 5 5 ' ' 6 6 6 G G G H H H A A B B ( ( 5 5 5 6 6 6 G G G H H H A A B B 5 5 5 6 6 6 G G G H H H A A B B y x cyan color means that data is communicated in the background
Transformation of in z Direction start NBC_Ialltoall of second xz plane and transform third plane z T T T b b b V V V S S T T a a b b U U V V W š š š ] ] ] ^ ^ ^ Œ Œ Œ g g h h œ œ œ _ _ ` ` X ] ] ] ^ ^ ^ g g _ _ W X r r r š š ] ] ] ^ ^ ^ x x x Œ Œ h h ` ` g g h h j j œ œ _ _ ` ` W X q q q r r r ] ] ] ^ ^ ^ w w w x x x g g h h i i j j _ _ ` ` W X q q q r r r w w w i i j j Z Z W X q q q r r r x x x w w w x x x ƒ ƒ ƒ i i j j Y Y Z Z e f q q q r r r w w w x x x ƒ ƒ ƒ i i j j Y Y Z Z e f ƒ ƒ ƒ v v Y Y Z Z e f ƒ ƒ ƒ u u v v Y Y Z Z e f u u Ž Ž Ž v v \ \ e f { { { u u Ž Ž Ž v v [ [ \ \ c d { { { u u Ž Ž Ž v v [ [ \ \ c d { { { Ž Ž Ž l l [ [ \ \ c d { { { k k Ž Ž Ž l l [ [ \ \ c d k k l l c d y y y k k z z z Š Š l l y y y k k z z z Š Š l l y y y z z z Š Š y y y z z z Š Š y x data of two planes is not accessible due to communication
Introduction Transformation of in z Direction start communication of the third plane and... z ž ž ž ž ž ««Ÿ Ÿ À À À Â Â Â ª ª ¾ ¾ ¾ À À µ µ È È È ª ª Á Á Â Â ª ª Ò Ò Ò ½ ½ ½ ¾ ¾ ¾ Þ Þ Þ Ç Ç Ç È È È Ö Ö Ö ³ ³ ª ª Ñ Ñ Ò Ò ½ ½ ½ ¾ ¾ ¾ Ç Ç Ç ³ ³ ä ä ä Ý Ý Þ Þ ½ ½ ½ ¾ ¾ ¾ æ æ æ È È È Õ Õ Ö Ö Ç Ç Ç È È È à à ã ã ã ³ ³ ä ä ä ± ² ½ ½ ½ å å å ¾ ¾ ¾ æ æ æ Ç Ç Ç ß ß È È È à à ã ã ã ³ ³ ä ä ä å å å æ æ æ ß ß à à ± ² ã ã ã ä ä ä å å å æ æ æ ß ß à à Ä Ä ± ² ã ã ã Ã Ã ä ä ä å å å æ æ æ ß ß à à Ä Ä ± ² Ã Ã Ä Ä ± ² É É É Ê Ê Ê Ô Ô Ô Ü Ü Ã Ã Ä Ä É É É Ê Ê Ê Ó Ó Ó Ô Ô Ô Û Û Ü Ü Ã Ã Ä Ä É É É Ê Ê Ê Ó Ó Ó Ô Ô Ô Û Û Ü Ü É É É Ê Ê Ê Ó Ó Ó Ô Ô Ô Û Û Ü Ü É É É Ê Ê Ê Ó Ó Ó Ô Ô Ô Û Û Ü Ü á á á Ï Ï Ï â â â Ð Ð Ð Ù Ù Ú Ú á á á Ï Ï Ï â â â Ð Ð Ð Ù Ù Ú Ú á á á â â â Ï Ï Ï Ð Ð Ð Ù Ù Ú Ú á á á â â â Ï Ï Ï Ð Ð Ð Ù Ù Ú Ú y x we need the first xz plane to go on...
Transformation of in x Direction... so NBC_Wait for the first NBC_Ialltoall! z è è è ö ö ö ê ê ê ç ç è è õ õ ö ö é é ê ê ë ñ ñ ñ ò ò ò û û ü ü ó ó ô ô ì ñ ñ ñ ò ò ò û û ó ó ë ì ÿ ÿ ñ ñ ñ ò ò ò ü ü ô ô û û ü ü þ þ ó ó ô ô ë ì ñ ñ ñ ò ò ò ) ) ) û û ü ü!!! ý ý þ þ ó ó ô ô ë ì ý ý / / / ( ( ) ) þ þ î î ë ì 1 1 1!! + +... ý ý / / / þ þ í í î î ù ú 0 0 0 1 1 1 * * + +... ý ý / / / 0 0 0 1 1 1 * * + + þ þ í í î î ù ú... / / / 0 0 0 1 1 1 * * + + í í î î ù ú... / / / 0 0 0 1 1 1 * * + + í í î î ù ú ð ð ù ú ' ' ï ï ð ð ø & & ' ' ï ï ð ð ø & & ' ' ï ï ð ð ø & & ' ' ï ï ð ð ø & & ' ' ø,,, - - - $ $ % %,,, - - - $ $ % %,,, - - - $ $ % %,,, - - - $ $ % % y x and transform first plane (pattern means xyz transformed)
Transformation of in x Direction Wait and transform second xz plane z 3 3 3 A A A 5 5 5 2 2 3 3 @ @ A A 4 4 5 5 6 [ [ [ < < < = = = K K K F F G G > >?? 7 < < < = = = F F G G T T U U U > > 6 7 S S S Z Z [ [ J J K K < < < = = = Y Y Y?? F F G G I I U U > >?? 6 7 c c c R R R S S S < < < = = = m m m X X X Y Y Y F F G G g g g H H I I > >?? 6 7 b b c c R R R S S S X X X H H s s s l l m m I I 9 9 6 7 R R R S S S u u u Y Y Y f f g g X X X Y Y Y o o r r r H H s s s I I 8 8 9 9 D E R R R t t t S S S u u u X X X n n Y Y Y o o r r r H H s s s t t t u u u n n o o I I 8 8 9 9 D E r r r s s s t t t u u u n n o o W W 8 8 9 9 D E r r r V V s s s t t t u u u n n o o W W 8 8 9 9 D E V V W W ; ; D E ^ ^ ^ _ _ _ e e e k k V V W W : : ; ; B C ^ ^ ^ _ _ _ d d d e e e j j k k V V W W : : ; ; B C ^ ^ ^ _ _ _ d d d e e e j j k k M M : : ; ; B C ^ ^ ^ _ _ _ d d d e e e j j k k L L M M : : ; ; B C ^ ^ ^ _ _ _ d d d e e e j j k k L L M M B C p p p ` ` ` L L q q q a a a h h i i M M p p p ` ` ` L L q q q a a a h h i i M M p p p q q q ` ` ` a a a h h i i p p p q q q ` ` ` a a a h h i i y x first plane s data could be accessed
Introduction Transformation of in x Direction wait and transform last xz plane } } } } } Š Š ~ ~ Ž Ž ˆ ˆ Ž Ž š š ˆ ˆ Ÿ Ÿ Ÿ Ž Ž ˆ ˆ ž ž ž Ÿ Ÿ Ÿ Ž Ž ª ª ª «««ˆ ˆ ž ž ž ³ ³ ³ ƒ ƒ Á Á Á Ÿ Ÿ Ÿ ««ž ž ž Ÿ Ÿ Ÿ ² ² ² ³ ³ ³ ƒ ƒ À À À Á Á Á ž ž ž ¾ ¾ Ÿ Ÿ Ÿ ² ² ² ³ ³ ³ À À À Á Á Á ¾ ¾ ƒ ƒ ² ² ² ³ ³ ³ À À À Á Á Á ¾ ¾ ƒ ƒ ² ² ² œ œ ³ ³ ³ À À À Á Á Á ¾ ¾ ƒ ƒ œ œ ¹ ¹ ¹ ½ ½ œ œ Œ ¹ ¹ ¹ ¼ ¼ ½ ½ œ œ Œ ¹ ¹ ¹ ¼ ¼ ½ ½ Œ ¹ ¹ ¹ ¼ ¼ ½ ½ Œ z ¹ ¹ ¹ ¼ ¼ ½ ½ Œ ± ± ± ± ± ± ± ± ± ± ± ± y x done! 1 complete 1D-FFT overlaps a communication
Parameter and Problems Tile factor # of z-planes to gather before NBC_Ialltoall is started very performance critical! not easily predictable Window size and MPI_Test interval Window size = number of outstanding communications not very performance critical fine-tuning MPI_Test progresses internal state of MPI unneccessary in threaded NBC implementation (future) Problems? NOT cache friendly :-(
Parameter and Problems Tile factor # of z-planes to gather before NBC_Ialltoall is started very performance critical! not easily predictable Window size and MPI_Test interval Window size = number of outstanding communications not very performance critical fine-tuning MPI_Test progresses internal state of MPI unneccessary in threaded NBC implementation (future) Problems? NOT cache friendly :-(
Parameter and Problems Tile factor # of z-planes to gather before NBC_Ialltoall is started very performance critical! not easily predictable Window size and MPI_Test interval Window size = number of outstanding communications not very performance critical fine-tuning MPI_Test progresses internal state of MPI unneccessary in threaded NBC implementation (future) Problems? NOT cache friendly :-(
3D-FFT Benchmark Results (small input) Speedup 35 30 25 20 15 10 5 ideal NBC MPI 0 0 5 10 15 20 25 30 35 Nodes Communication Overhead (%) 50 45 40 35 30 25 20 15 10 5 NBC MPI 0 0 5 10 15 20 25 30 35 Nodes tantale @CEA: 128 2 GHz Quad Opteron 844 nodes Interconnect: InfiniBand TM System size 128x128x128 (1 CPU 0.75 s)
1 Introduction Short introduction to non-blocking collectives 2 3 Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results
Cache optimal implementation cache optimality by yz transforming plane by plane (in cache)! z y x we need all yz-planes before we can start x transform :-(
Applying Non-blocking collectives Pipelined communication retain plane-by-plane transform simple pipelined scheme start A2A of plane as soon as it is transformed wait for all before x transform A2A overlapped with computation of remaining planes last A2A blocks (immediate wait :-( ) Issues less overap potential plane coalescing to adjust datasize new parameter: pipeline depth (# of A2As)
3D-FFT Benchmark Results (small input) 10 FORW BACK System tantale @CEA 2 GHz Quad Opteron InfiniBand TM Parameters 128x128x128 16 CPUs, 4 nodes 1 CPU 28 s 8 planes/proc 16kb/plane Time (s) 9 8 7 6 5 4 3 2 1 0 1 2 4 0 1 2 4 coalesced planes (0 = original impl.)
Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results 1 Introduction Short introduction to non-blocking collectives 2 3 Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results
Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results Avoidance of the transformation of zeroes ABINIT Implementation changed routines back, forw, back_wf and forw_wf some minor changes to others (input params...) The routines back_wf and forw_wf avoid transformation of zeroes less computation and less communication changed communication (boxcut=2): forw_wf: nz/p planes, each has nx/2 ny/(2 p) doubles back_wf: nz/(2 p) planes, each has nx/2 ny/p doubles New Parameters all routines have different # planes three parameters
Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results 1 Introduction Short introduction to non-blocking collectives 2 3 Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results
Autotuning of parameters Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results Three new input parameters fftplanes_fourdp, fftplanes_forw_wf, and fftplanes_back_wf default = 0 standard MPI implementation performance critical complicated to determine by hand Autotuning automatically determine them at runtime each planes parameter is benchmarked (after warmup round) fastest is chosen automatically relatively accurate but problems with jitter
Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results 1 Introduction Short introduction to non-blocking collectives 2 3 Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results
Microbenchmarks Time (s) 10 9 8 7 6 5 4 3 2 1 Introduction Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results FORW BACK FORW_WF BACK_WF 0 1 2 4 0 1 2 4 0 1 2 4 0 1 2 4 coalesced planes (0 = original impl.)
ABINIT - Si, 60 bands, 128 3 FFT npbandxnfft = 2x16 Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results npbandxnfft = 4x8 Time (ms) 350 300 250 200 150 FORW_WF BACK_WF FORW_WF BACK_WF 100 50 FOURDP 0 1 0 1 0 1 0 1 0 1 0 1 TOTAL 39.5s > 23.6s 41.1s > 37.1s
Conclusions & Future Work Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results Conclusions applying NBC requires some effort NBC can improve parallel efficience cache usage vs. overlap potential Future Work tune FFT further (reduce serial overhead) better automatic parameter assessment (?) parallel model for 3d-FFT use NBC for parallel orthogonalization apply NBC at higher level (LOBPCG?)
Conclusions & Future Work Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results Conclusions applying NBC requires some effort NBC can improve parallel efficience cache usage vs. overlap potential Future Work tune FFT further (reduce serial overhead) better automatic parameter assessment (?) parallel model for 3d-FFT use NBC for parallel orthogonalization apply NBC at higher level (LOBPCG?)
Discussion Introduction Avoidance of the transformation of zeroes Autotuning of parameters Preliminary Performance Results ABINIT patch (soon): http://www.unixer.de/research/abinit/ Thanks to the CEA/DAM for support of this work and you for your attention!