UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

Size: px

Start display at page:

Download "UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement"

Kathryn Cook
5 years ago
Views:

1 UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement Wuxi Li, Meng Li, Jiajun Wang, and David Z. Pan University of Texas at Austin wuxili@utexas.edu November 14, 2017 UT DA Wuxi Li (UT-Austin) UTPlaceF / 26

2 Outline 1 Introduction 2 Motivation & Challenges 3 UTPlcaeF 3.0 Algorithms 4 Experimental Results 5 Conclusion & Future Work Wuxi Li (UT-Austin) UTPlaceF / 26

3 Introduction: FPGA Scaling 15 Years of Xilinx/Intel(Altera) FPGA Size Trend Gate Count 10M 3M 1M 333K 100K 33K 10K Spartan-3 Arria GX Stratix II Virtex-4 Stratix III Virtex-5 Spartan-6 Arria II Virtex-6 Stratix IV Spartan-7 Artix-7 Arria V 5.5M Kintex-7 Stratix V Virtex-7 Kintex US Arria 10 Virtex US Kintex US+ Virtex US+ Stratix 10 3K 1K 90nm 65nm 40/45nm 28nm 20nm 14/16nm Wuxi Li (UT-Austin) UTPlaceF / 26

4 Introduction: The Era of Multi-Core CPUs Years of CPU Trend Data Number of Transistors (K) Single-Thread Performance (SpecINT 10 3 ) Frequency (MHz) Number of Cores Source: Original data up to the year 2010 was collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. Data for was collected by K. Rupp. Wuxi Li (UT-Austin) UTPlaceF / 26

5 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Wuxi Li (UT-Austin) UTPlaceF / 26

6 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

7 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

8 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Analytical Placement Gopalakrishman+ DAC 06 Xu+ FPL 05 Gort+ FPL 12 Chen+ ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

9 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement What s Next?? Analytical Placement Gopalakrishman+ DAC 06 Xu+ FPL 05 Gort+ FPL 12 Chen+ ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03

10 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement What s Next?? This Work Parallel Analytical Placement Analytical Placement Gopalakrishman+ DAC 06 Xu+ FPL 05 Gort+ FPL 12 Chen+ ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

11 A Representative Quadratic Placement Flow Quadratic placers minimize cost function W (x, y) = 1 2 xt Q x x + c T x x yt Q y y + c T y y + const Circuit Info. Construct Linear Systems It is equivalent to solving Q x x = c x Q y y = c y (Conjugate Gradient (CG) method) Rough legalization [Kim+, TCAD 12] to remove cell overlapping Solve Linear Systems Rough Legalization Converge? Yes Done No Wuxi Li (UT-Austin) UTPlaceF / 26

12 A Simple Parallelization Attempt 100K-cell (FPGA-1) 1.1M-cell (FPGA-12) Intel Math Kernel Library (MKL) for CG Solve X and Y in parallel, if #thread 2 Enable parallelization for matrix-vector operations in CG, if #thread > 2 Use parallel sorting algorithms in libstdc++ parallel mode for rough legalization Runtime Normalized to Single-Threaded Execution # Threads Designs FPGA-1 FPGA FPGA-1 FPGA FPGA-1 FPGA FPGA-1 FPGA Linear System Construction Rough Legalization FPGA-1 FPGA-12 Solving Linear Systems Wuxi Li (UT-Austin) UTPlaceF / 26

13 Previous Works POLAR 3.0 [Lin+, ICCAD 15] ASIC quadratic placer Solve X and Y directions in parallel Solve geometric partitions in parallel Periodically solve flat placement 4X speedup with 16 threads Small to medium quality degradation Wuxi Li (UT-Austin) UTPlaceF / 26

14 UTPlaceF 3.0 Overview Circuit Info. Initial Placement Construct & Solve Linear Systems Rough Legalization & Update Anchor Force Parallelized Placement Construct & Solve Linear Systems for Sub-Netlists Synchronization Parallelized Incremental Placement Correction Synchronization Partition Netlist Rough Legalization & Update Anchor Force Converge? No Done Wuxi Li (UT-Austin) UTPlaceF / 26 Yes

block-jacobi preconditioner Q of Q Qx = c Q i x i =

15 Block-Jacobi Preconditioning The matrix Q Q after 8-way partitioning by row/column permutation The block-jacobi preconditioner Q of Q Qx = c Q i x i = c i, i = Wuxi Li (UT-Austin) UTPlaceF / 26

16 Physical Interpretation of Block-Jacobi Preconditioning A B C D E F Q x c A B C D E F x A x B x C xd = xe xf a b c d e f A B E C D F Q x c A B E C D F x A x B xe x C = xd xf a b e c d f A B E C D F bq x c A B E C D F x A x B xe x C = xd xf a b e c d f b d b d 0 b d B D B D B D A C E F A C E F A C E F a c e f a c e f a c e f Wuxi Li (UT-Austin) UTPlaceF / 26

17 Placement-Driven Block-Jacobi Preconditioning (PDBJP) bq x c bq x bc A B E C D F A B E C D F A B E C D F x A x B xe x C = xd xf a b e c d f A B E C D F x A x B xe x C = xd xf a b + x D e c d + x B f 0 b d b d B D B B D D A C E F A C E F a c e f Block-Jacobi Preconditioning a c e f Placement-Driven Block-Jacobi Preconditioning Wuxi Li (UT-Austin) UTPlaceF / 26

18 Can We Do Better? PDBJP is effective for a small number of partitions. Its quality degrades dramatically as the partition count increases. The quality degradation in PDBJP is introduced by the discrepancy of inter-partition nets in different partitions. A (k) A (k) A (k+1) B (k) B (k+1) B (k) An inter-partition net in an intermediate placement The solution after applying PDBJP with large discrepancy Wuxi Li (UT-Austin) UTPlaceF / 26

19 Parallelized Incremental Placement Correction (PIPC) A local and smooth perturbation of the post PDBJP placement Reduce the discrepancy of inter-partition nets A (k) A (k) A (k+1) A (k) A (k+2) A (k+1) B (k) B (k+1) B (k) B (k+1) B (k+2) B (k) An inter-partition net in an intermediate placement The solution after applying PDBJP with large discrepancy An incremental improvement that reduces the discrepancy Wuxi Li (UT-Austin) UTPlaceF / 26

20 Mathematical Formulation of PIPC PIPC is based on the idea of Additive Correction Multigrid Method The original problem we are trying to solve is Qx = c. (1) Let x (0) be the solution of the PDBJP linear system (1), that is If we can find x satisfying Qx (0) = ĉ. (2) Q x = r (0) = c Qx (0), the solution of the linear system (1) will be x (0) + x. Wuxi Li (UT-Austin) UTPlaceF / 26

21 Mathematical Formulation of PIPC x can be approached by the iterative method x (k+1) = x (k) + Q 1 (r (0) Q x (k) ) Computing Q 1 (r (0) Q x (k) ) is equivalent to solving which can be parallelized as well. Q 1 Q2 Q 3 Q4 Qx = r (0) Q x (k), Q x r (0) Q x (k) x 1 x 2 x 3 x 4 r (0) 1 r (0) 2 r (0) 3 r (0) 4 Q 1 Q 2 = Q 3 Q 4 Wuxi Li (UT-Austin) UTPlaceF / 26

22 Runtime and Quality Tradeoff of PIPC Theorem: The Convergence of PIPC By picking any placement-driven block-jacobi preconditioner of Q as Q, PIPC defined by iterative method x (k+1) = x (k) + Q 1 (r (0) Q x (k) ) is monotonically convergent for any choice of x (0). The Theorem reveals two important properties of PIPC: With sufficient iterations, the exact solution, x (0) + x, can be reached More iterations guarantees better solutions for Qx = c The runtime and quality can be perfectly traded to each other in PIPC. Wuxi Li (UT-Austin) UTPlaceF / 26

23 Experiments Setup C++ Intel Xeon E CPUs (2.6 GHz, 24 cores, and 30M L3 cache) 64 GB RAM Eigen 3 for linear system solving OpenMP 4.0 for multi-threading libstdc++ parallel mode for parallel sorting ISPD 16 FPGA placement contest benchmark suite Wuxi Li (UT-Austin) UTPlaceF / 26

24 Parallelization Configuration Solve X and Y in parallel Enable PDBJP for #thread > 2 PIPC is only enable for > 4 threads PIPC is not called for every placement iteration for 8 threads Less CG iterations for PIPC compared with PDBJP #Threads # Partitions # CG Iter. for PDBJP # PIPC Period # Correction Iter. of each PIPC # CG Iter. for PIPC Wuxi Li (UT-Austin) UTPlaceF / 26

25 Runtime Result w/o PIPC Speedup 7 1 thread 2 threads 4 threads 8 threads 16 threads FPGA-1 FPGA-2 FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads Avg. Speedup Wuxi Li (UT-Austin) UTPlaceF / 26

26 Wirelength Result w/o PIPC Norm. shpwl Incr. (%) FPGA-1 FPGA-2 1 thread 2 threads 4 threads 8 threads 16 threads FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads Avg. shpwl Incr. 0.0% 0.0% -0.1% 2.4% 6.5% Wuxi Li (UT-Austin) UTPlaceF / 26

27 Runtime Result w/ PIPC Speedup thread w/o PIPC 8 threads w/ PIPC 16 threads w/o PIPC 16 threads w/ PIPC FPGA-1 FPGA-2 FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads 8 w/o PIPC 8 w/ PIPC 16 w/o PIPC 16 w/ PIPC Avg. Speedup Wuxi Li (UT-Austin) UTPlaceF / 26

28 Wirelength Result w/ PIPC Norm. shpwl Incr. (%) FPGA-1 FPGA-2 8 thread w/o PIPC 8 threads w/ PIPC 16 threads w/o PIPC 16 threads w/ PIPC FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads 8 w/o PIPC 8 w/ PIPC 16 w/o PIPC 16 w/ PIPC Avg. shpwl Incr. 2.4% 1.7% 6.5% 3.0% Wuxi Li (UT-Austin) UTPlaceF / 26

29 Runtime Breakdown 1 # Threads Norm. Runtime Solving Linear System Linear System Construction Rough Legalization Netlist Partitioning PIPC Other Wuxi Li (UT-Austin) UTPlaceF / 26

30 Conclusion & Future Work Conclusion UTPlaceF 3.0, a parallelization framework for FPGA quadratic placement Placement-driven block-jacobi preconditioning Parallelized incremental placement correction 5X speedup with 3.0% wirelength increase Futher Work Parallelize packing and detailed placement Hardware acceleration: FPGA, GPU Wuxi Li (UT-Austin) UTPlaceF / 26

31 Thank You! Wuxi Li (UT-Austin) UTPlaceF / 26

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,