Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Joint work with Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari and Jean-Yves L Excellent, PhD started on October 1st, 2010 and financed by EDF Clément Weisbecker, ENSEEIHT-IRIT, University of Toulouse, France MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Introduction Multifrontal solver Block Low-Rank approximations to improve multifrontal sparse solvers direct solver for large linear systems objective: A = LU Low-rank approximations compression and flop reduction accuracy controlled by a numerical parameter Combine these two notions to improve multifrontal solvers (in the context of MUMPS) 2/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
The multifrontal method
Separator tree (Schreiber 1982) dependencies between variables topological order 4/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Frontal matrices At each node, a partial factorization of the frontal matrix is performed : CB 1 CB 2 FS assembly elim CB CBs are stacked waiting for assembly peak of CB stack NFS FS separator variables NFS border variables FS 5/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Low-rank approximations
Low-rank matrices Low-rank matrix (Bebendorf) Let A be a m n matrix Let k ε be the approximated numerical rank of A at accuracy ε A is said to be a low-rank matrix if there exist matrices X of size m k ε, Y of size n k ε, E of size m n such that : A = X Y T + E, where E 2 ε and k ε (m + n) < mn n k 2 k 2 Y T 2 k 1 k 1 C = Y T 1 X 2 Low-rank product: X 1 (Y T 1 X 2)Y T 2 m X1 7/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Where can I find low-rank matrices? Problem: Matrices are usually not low-rank! Solution (Börm, Bebendorf): Well defined subblocks A b = A στ can be low-rank! (b = σ τ I I) Define a clustering C to obtain low-rank blocks I 1 I 2 I 3 I 1 I 2 I 3 8/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Admissibility condition How to find a good clustering? Admissibility condition for elliptic PDEs (Börm, Grasedyck, Hackbusch) Let σ I and τ I be clusters of indices The block (σ, τ) is said admissible if σ and τ satisfy diam(σ) + diam(τ) 2ηdist(σ, τ) where η is a problem dependant fixed parameter (usually 1 or 05) diam(σ) dist(σ,τ) diam(τ) σ 9/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013 τ
Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Illustration (1) D 1 D 2 10/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Illustration (2) 200 180 160 rank of 140 128 120 100 80 0 10 20 30 40 50 60 70 80 90 100 distance between and room for compression 11/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Exploiting low-rank subblocks: H matrices à = D 11 X 12 Y T 12 X 36 Y T 36 X 21 Y T 21 D 22 X 63 Y T 63 D 44 X 45 Y T 45 X 54 Y T 54 D 55 I 7 I 7 I 3 I 3 I 3 I 6 I 6 I 3 I 6 I 6 I 1 I 1 I 1 I 2 I 2 I 1 I 2 I 2 I 4 I 4 I 4 I 5 I 5 I 4 I 5 I 5 (Börm, Bebendorf, Hackbush, Grasedyck) 12/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Exploiting low-rank subblocks: HSS matrices D 1 X 1 B 1 Y T 2 X 1 R 1 B 3 W T 4 YT 4 X 1 R 1 B 3 W T 5 YT 5 Ã = X 2 B 2 Y1 T D 2 X 2 R 2 B 3 W4 T YT 4 X 2 R 2 B 3 W5 T YT 5 X 4 R 4 B 6 W1 T YT 1 X 4 R 4 B 6 W2 T YT 2 D 4 X 4 B 4 Y5 T X 5 R 5 B 6 W1 T YT 1 X 5 R 5 B 6 W2 T YT 2 X 5 B 5 Y4 T D 5 7 3 B 3 B 6 6 R 1,W 1 B 1 R 2,W 2 R 4,W 4 B 4 R 5,W 5 1 B 2 2 4 B 5 5 X 1,Y 1 X 2,Y 2 X 4,Y 4 X 5,Y 5 D 1 D 2 D 4 D 5 (Li, Napov, Xia, Gu, Wang, Rouet Gillman, Martinsson) 13/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Exploiting low-rank subblocks: BLR matrices D 1 X 12 Y T 12 X 13 Y T 13 X 14 Y T 14 Ã = X 21 Y T 21 D 2 X 23 Y T 23 X 24 Y T 24 X 31 Y T 31 X 32 Y T 32 D 3 X 34 Y T 34 X 41 Y T 41 X 42 Y T 42 X 43 Y T 43 D 4 particular case of H-matrices no tree natural matrix structure Idea: representing fronts with BLR matrices 14/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Comparative study: compression rates Compression rates of the frontal matrix at the root of a multifrontal tree, on two 3D stencils (discretization of a 128 x 128 x 128 cube) fraction of dense storage 035 03 025 02 015 01 005 Laplacian 0 10 14 10 12 10 10 10 8 10 6 10 4 10 2 low-rank threshold H HSS BLR fraction of dense storage 1 09 08 07 06 05 04 03 02 01 27-pts Geoazur stencil 0 10 14 10 12 10 10 10 8 10 6 10 4 10 2 low-rank threshold H HSS BLR 15/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Comparative study: compression cost Compression cost of the frontal matrix at the root of a multifrontal tree, on two 3D stencils (discretization of a 128 x 128 x 128 cube) log 10 (number of operations for compression) 125 12 115 11 105 10 95 Laplacian H HSS BLR log 10 (number of operations for compression) 27-pts Geoazur stencil 9 9 10 14 10 12 10 10 10 8 10 6 10 4 10 2 10 14 10 12 10 10 10 8 10 6 10 4 10 2 low-rank threshold low-rank threshold 13 125 12 115 11 105 10 95 H HSS BLR 16/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Adaptation to a multifrontal solver 17/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Adaptation to a multifrontal solver no relative order efficient clustering 17/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Adaptation to a multifrontal solver no relative order efficient compr and decompr 200 efficient clustering distribution assembly 17/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Adaptation to a multifrontal solver no relative order efficient compr and decompr flat non hierarchical structure efficient clustering distribution assembly pivoting 17/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering variables
Admissible clustering Constraint : the admissibility condition should be satisfied large diameters fraction of memory used 83% small diameters fraction of memory used 57% 19/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 1 The separator 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 1 The separator 2 The halo 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 1 The separator 2 The halo 3 Extraction of the halo 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool 1 The separator 2 The halo 3 Extraction of the halo 4 Partition of the halo 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Halo algorithm for the clustering of a separator Designed to catch the geometry of the problem Computed with the graph instead of the mesh Coupled with a third party partitioning tool P 1 P 2 P 3 1 The separator 2 The halo 3 Extraction of the halo 4 Partition of the halo 5 Partition of the separator (block size is fixed) 20/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering of the variables of a front front = separator + border S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering of the variables of a front front = separator + border 1- separator : halo S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT INHERITED (top down) S S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT INHERITED (top down) S S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT INHERITED (top down) S S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering of the variables of a front front = separator + border 1- separator : halo 2- border? 2 choices : S EXPLICIT INHERITED (top down) S S 21/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Structural comparison optimal optimal = optimal block 22/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Structural comparison optimal optimal = optimal block small optimal = large enough block 22/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Structural comparison optimal optimal = optimal block small optimal = large enough block small small = too small block 22/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Structural comparison 1 2 3 4 optimal optimal = optimal block small optimal = large enough block small small = too small block reclustering strategies 1 2 3 S 4 22/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Illustration with a top-level separator from SCOTCH side face view perspective view 23/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Block Low-Rank multifrontal method
General idea fronts approximated with low-rank matrices σ L i 1,1 U i 1,1 L i 2,2 U i 2,2 τ U i di,di L i di,di CB 25/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Switch level (1) percentof the entries of the factor computed in fronts larger than N 80 70 60 50 40 30 20 10 2 10 2 11 2 12 2 13 2 14 2 15 2 16 mesh size N =200 N =400 N =600 N =800 N =1000 a large ratio of the memory consumption is targeted 26/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Switch level (1) 100 percentof operations performed in fronts larger than N 95 90 85 80 75 70 65 60 N =200 N =400 N =600 N =800 N =1000 55 2 10 2 11 2 12 2 13 2 14 2 15 2 16 mesh size a large ratio of the flop count is targeted 26/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Switch level (2) low-rank techniques only used on large fronts mesh size 2 10 2 11 2 12 2 13 2 14 2 15 N = 200 161 168 171 173 174 174 N = 400 037 040 042 042 043 043 N = 600 017 020 021 022 023 023 N = 800 007 009 010 011 011 011 N = 1000 005 006 006 006 006 006 (Proportion of fronts larger than N with respect to the total number of fronts in the multifrontal tree, for different mesh sizes) Only a few fronts are actually compressed! 27/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 Standard FR algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 Standard FR algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 Standard FR algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 Standard FR algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve C Compress U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 BLR FSCU algorithm: F Factor S Solve C Compress U Update 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 F Factor S Solve C Compress U Update FCSU version more efficient less stability 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 F Factor S Solve C Compress U Update FSUC version no flop reduction more stability 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Factorization algorithms task operation type dense low-rank Factor (F) B = LU T (2/3)n 3 (2/3)n 3 Compress (C) C = XY T kn 2 Solve (S) D = X(Y T L 1 ) n 3 kn 2 CB update (U) D = D X 1 (Y1 T X 2)Y2 T 2n 3 2kn 2 F Factor S Solve C Compress U Update High algorithmic flexibility 28/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Experiments
Set of problems Name Prop Arith N NZ mem LU flops LU CSR appli ( 10 6 ) ( 10 9 ) (GB) ( 10 12 ) Curl 5000 2 2D/sym D 50 02 29 5 2E-15 Geoazur 128 3 3D/unsym Z 2 55 54 60 3E-4 wave prop EDF_A_MECA_R12 2D/sym D 134 1 151 200 4E-15 mechanics EDF_D_THER_R7 3D/sym D 8 118 229 100 8E-15 thermics CSR = Componentwise Scaled Residual Code_Aster tpll01{a,d} test cases (refined) large matrices applicative problems 30/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Metrics memory: L = fraction of FR factors storage obtained with BLR (%) CB = fraction of FR maximum size of CB stack obtained with BLR (%) flops: fraction of FR operations needed for the BLR factorization (in percent or absolute data, including the compression cost) 31/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Clustering strategy memory flops time L CB facto clustering full rank analysis clustering inh exp inh exp inh exp inh exp Curl5000 637% 624% 70% 55% 109% 111% 13 s 27 s 897 s Geoazur128 790% 770% 470% 450% 608% 591% 5 s 42 s 62 s TH_RAFF7 341% 307% 175% 162% 72% 66% 34 s 206 s 387 s ME_RAFF12 529% 511% 48% 41% 61% 61% 121 s 239 s 1971 s inherited version is more than 2 times faster same results on L 11 comparable results on L 21 a little less good on CBs inherited clustering used for all the experiments 32/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Global ordering of the matrix: general results Results with different orderings on Geoazur128 problem FR ordering mry flops peak L CB flops AMD 109GB 39E + 14 92GB 735% 400% 594% AMF 72GB 18E + 14 45GB 699% 535% 483% PORD 55GB 10E + 14 28GB 704% 490% 489% METIS 46GB 62E + 13 20GB 787% 464% 626% SCOTCH 49GB 66E + 13 21GB 794% 481% 634% LR 33/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Global ordering of the matrix: general results Results with different orderings on Geoazur128 problem FR ordering mry flops peak L CB flops AMD 109GB 39E + 14 92GB 735% 400% 594% AMF 72GB 18E + 14 45GB 699% 535% 483% PORD 55GB 10E + 14 28GB 704% 490% 489% METIS 46GB 62E + 13 20GB 787% 464% 626% SCOTCH 49GB 66E + 13 21GB 794% 481% 634% LR Algebraic approach 33/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Global ordering of the matrix: AMF tree [visualization tool developed by M Bremond talk tomorrow 11:30] 34/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Global ordering of the matrix: SCOTCH tree [visualization tool developed by M Bremond talk tomorrow 11:30] 35/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Influence of the size of the problem different refinements of problem EDF_A_MECA done with Homard memory flops Refinement N L CB facto R8 2, 101, 258 71% 43% 37% R11 33, 570, 826 59% 29% 12% R12 134, 250, 506 53% 24% 7% the larger the problem, the more efficient the method target = challenging problems 36/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Scalability with respect to the size of the problem Laplacian problem, ε = 10 14 16 15 Full Rank O(N 2 ) BLR O(N 4/3 ) 14 flops facto 13 12 11 10 32 64 96 128 160 192 224256 Mesh size M O(N 4/3 ) complexity 37/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Influence of the size of the clusters memory flops block size L CB facto compr other 100 88% 55% 75% 98E+11 46E+13 150 83% 50% 68% 15E+12 41E+13 200 79% 46% 63% 19E+12 37E+13 250 76% 45% 60% 24E+12 34E+13 300 75% 44% 58% 27E+12 33E+13 350 74% 44% 56% 31E+12 32E+13 400 73% 45% 56% 34E+12 31E+13 450 73% 44% 56% 37E+12 31E+13 500 73% 44% 57% 44E+12 31E+13 flexibility (no need for a fine tuning) can be dynamically adapted for pivoting, parallelism, 38/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Global results: 2D problems 1 00 90 80 70 60 50 40 30 20 1 0 0 EDF_A_MECA_R12 QR DROPPING PARAMETER L CB 1 0 1 4 1 0 1 2 1 0 1 0 1 0 8 1 0 6 1 0 4 1 0 2 1 0 2 1 0 4 1 0 6 1 0 8 1 0 1 0 1 0 1 2 1 0 1 4 flops facto (%) CSR 1 1 0 1 00 90 80 70 60 50 40 30 20 1 0 0 CURL-CURL 2D 5000x5000 1 0 1 4 1 0 1 2 1 0 1 0 1 0 8 1 0 6 1 0 4 1 0 2 QR DROPPING PARAMETER 1 0 6 1 0 8 1 0 1 0 1 0 1 2 1 0 1 4 1 0 1 6 Name N mem LU flops LU CSR EDF_A_MECA_R12 134E+6 151 GB 2E+14 4E-15 Curl-Curl 5000 2 50E+6 29 GB 5E+12 2E-15 39/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Global results: 3D problems L CB flops facto (%) CSR 1 00 90 80 70 60 50 40 30 20 1 0 0 EDF_D_THER_R7 1 0 2 1 0 4 1 0 6 1 0 8 1 0 1 0 1 0 1 2 1 00 90 80 70 60 50 40 30 20 1 0 0 GEOAZUR 3D 128x128x128 1 0 1 0 2 1 0 4 1 0 6 1 0 8 1 0 1 0 1 0 1 2 1 0 1 4 1 0 1 2 1 0 1 0 1 0 8 1 0 6 1 0 4 1 0 2 1 0 1 4 1 0 1 2 1 0 1 0 1 0 8 1 0 6 1 0 4 1 0 2 QR DROPPING PARAMETER QR DROPPING PARAMETER Name N mem LU flops LU CSR EDF_D_THER_R7 8E+6 229 GB 1E+14 8E-15 Geoazur 128 3 2E+6 54 GB 6E+13 3E-4 40/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Application to geophysics (1) Helmholtz equation for seismic modeling (SEISCOPE project) EAGE overthrust ground model 0 20 Dip (km) 5 10 15 20 10 Cross (km) 15 5 Depth (km) 0 1 2 3 4 3000 4000 5000 6000 m/s fqcy Flops LU Mem LU Peak memory 2 Hz 8957E+11 3 GB 4 GB 4 Hz 1639E+13 22 GB 25 GB 8 Hz 5769E+14 247 GB 283 GB 41/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Application to geophysics (2) 5 Dip (km) 10 15 b) 20 ) Dip (km) 10 15 e) m ) 20 0 5 Dip (km) 10 15 Dip (km) 10 15 Depth (km) h) 20 20 0 5 Dip (km) 10 15 i) 20 15 k) 20 20 0 5 Dip (km) 10 15 C 10 Depth (km) 20 0 5 Dip (km) 10 15 20 15 10 5 5 0 1 2 3 4 20 (k ro ss s ro s 10 15 15 l) 20 15 (k m ) 15 5 Dip (km) 10 0 1 2 3 4 m ) 15 5 s Dip (km) 10 0 ro s 5 0 1 2 3 4 C 0 20 10 Depth (km) 20 20 5 Depth (km) Depth (km) j) 15 C ro ss C ro ss C ro ss 10 5 0 1 2 3 4 Dip (km) 10 (k (k m ) 15 10 5 0 1 2 3 4 m ) 5 0 C ro ss (k Depth (km) 0 20 15 10 0 1 2 3 4 5 (k m ) f) 20 15 C ro ss C ro ss 20 20 5 (k m ) g) 15 0 1 2 3 4 10 0 1 2 3 4 Dip (km) 10 ro ss C Depth (km) 20 10 5 10 m ) 5 0 (k 0 15 (k m ) 20 20 15 5 5 C c) 20 (k m Depth (km) Depth (km) Depth (km) 15 0 1 2 3 4 5 Depth (km) Dip (km) 10 10 0 1 2 3 4 Depth (km) 5 5 d) 42/51 0 C ro ss C ro ss 10 5 0 1 2 3 4 20 15 ) 0 (k m 20 15 (k m ) a) a) 0 1 2 3 4 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Application to geophysics (3) Dip (km) 10 15 20 0 5 Dip (km) 10 15 d) 20 15 0 5 Dip (km) 10 15 20 (k C ro 10 10 5 5 0 1 2 3 4 20 15 ss ss C ro 10 0 1 2 3 4 flops 43/51 c) 20 Depth (km) m ) 5 5 Depth (km) Depth (km) 0 (k (k ss C ro 10 5 0 1 2 3 4 20 15 m ) b) 20 ) 15 m Dip (km) 10 (k 5 C ro ss 0 Depth (km) 20 15 m ) a) 0 1 2 3 4 memory ε fqcy facto F CB (10 5 ) 2 Hz 4 Hz 8 Hz 418 % 274 % 218 % 618 % 500 % 416 % 323% 244% 239% (10 4 ) 2 Hz 4 Hz 8 Hz 329 % 200 % 152 % 534 % 422 % 289 % 239% 217% 194% (10 3 ) 2 Hz 4 Hz 8 Hz 246 % 138 % 98 % 447 % 345 % 213 % 168% 190% 159% MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Preconditioning with BLR: set of problems N NZ Cond application Piston 13E+6 547E+6 51E+5 external pressure force on the top perf001d 20E+6 758E+6 15E+11 cavity hook subjected to internal pressure force (challenging for EDF) 44/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Preconditioning with BLR: number of iterations Number of iterations needed for the convergence FR is MUMPS 410 single precision means no convergence in 1, 000 seconds FR 10 8 10 7 10 6 10 5 10 4 10 3 10 2 Piston 3 3 3 3 4 7 115 perf001d 47 65 65 65 65 77 perf001d: only 2 iterations to convergence with a BLR double precision precond at ε = 10 8 good solution versus coarse sword of usual single precision! 45/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Preconditioning with BLR: Piston (compr and time) 70 (%) PISTON 60 L CB flops facto (%) 50 40 30 20 10 1 0 8 1 0 6 1 0 5 1 0 4 1 0 3 1 0 7 QR DROPPING PARAMETER 46/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Preconditioning with BLR: Piston (compr and time) (s) 350 300 Total execution time Factorization time Solves time (GCPC) PISTON 250 time in seconds 200 150 100 50 FR = 313s FR = 308s FR = 5s 0 1 0 8 1 0 6 1 0 5 1 0 4 1 0 3 1 0 7 QR DROPPING PARAMETER 46/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Preconditioning with BLR: perf001d (compr and time) (%) PERF 60 50 L CB flops facto (%) 40 30 20 10 1 0 8 1 0 7 1 0 6 1 0 5 1 0 4 QR DROPPING PARAMETER 47/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Preconditioning with BLR: perf001d (compr and time) PERF (s) 450 Total execution time Solves time (GCPC) 400 Factorization time 350 time in seconds 300 250 200 FR = 569s FR = 117s FR = 452s 150 100 1 0 8 1 0 7 1 0 6 1 0 5 1 0 4 QR DROPPING PARAMETER 47/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Software work new FR factorization with two levels of blocking: internal panels: efficiency of BLAS 3 operations external panels: correspond to low-rank blocks done in factorization algorithm: BLR seq unsymm with all MUMPS features and double panel BLR seq symm prototype (single panel, no pivoting) actual flop compression is achieved in both cases MPI symm and unsymm: panel facto rewritten for BLR very short term: BLR seq symm with all MUMPS features and double panel short term: long term: BLR MPI symm and unsymm BLR solution phase actual memory gains 48/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Conclusion & perspectives efficient method on various applicative problems considerable memory reduction & substantial decrease in computations O(N 4/3 ) complexity, comparable to HSS can be used as a preconditioner or as a direct solver good potential for parallelism code is stable on tested problems MPI study on larger and more difficult problems error propagation study absolute or relative dropping parameter? relative to what? (work with S Gratton, M Ngom and D Titley-Peloquin started) 49/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Aknowledgements O Boiteau, B Quinnez and N Tardieu (EDF R&D) S Operto, R Brossier and J Virieux (SEISCOPE Project) for their contribution to the geophysics study the Toulouse Computing Center (CICT) and N Renon S Li, A Napov and F-H Rouet (LBNL Berkeley) Details on this work can be found in: P Amestoy, C Ashcraft, O Boiteau, A Buttari, J-Y L Excellent and C Weisbecker, Improving multifrontal methods by means of block low-rank representations, IRIT technical report RT/APO/12/6, INRIA technical report INRIA/RR-8199, submitted to SIAM SISC, http://weisbeckerperso enseeihtfr/documents/rt_apo_12_6_blrpdf, 2012 50/51 MUMPS Users Group Meeting Clamart, France, May 29-30, 2013
Thank you! Any questions?