UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement
|
|
- Kathryn Cook
- 5 years ago
- Views:
Transcription
1 UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement Wuxi Li, Meng Li, Jiajun Wang, and David Z. Pan University of Texas at Austin wuxili@utexas.edu November 14, 2017 UT DA Wuxi Li (UT-Austin) UTPlaceF / 26
2 Outline 1 Introduction 2 Motivation & Challenges 3 UTPlcaeF 3.0 Algorithms 4 Experimental Results 5 Conclusion & Future Work Wuxi Li (UT-Austin) UTPlaceF / 26
3 Introduction: FPGA Scaling 15 Years of Xilinx/Intel(Altera) FPGA Size Trend Gate Count 10M 3M 1M 333K 100K 33K 10K Spartan-3 Arria GX Stratix II Virtex-4 Stratix III Virtex-5 Spartan-6 Arria II Virtex-6 Stratix IV Spartan-7 Artix-7 Arria V 5.5M Kintex-7 Stratix V Virtex-7 Kintex US Arria 10 Virtex US Kintex US+ Virtex US+ Stratix 10 3K 1K 90nm 65nm 40/45nm 28nm 20nm 14/16nm Wuxi Li (UT-Austin) UTPlaceF / 26
4 Introduction: The Era of Multi-Core CPUs Years of CPU Trend Data Number of Transistors (K) Single-Thread Performance (SpecINT 10 3 ) Frequency (MHz) Number of Cores Source: Original data up to the year 2010 was collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. Data for was collected by K. Rupp. Wuxi Li (UT-Austin) UTPlaceF / 26
5 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Wuxi Li (UT-Austin) UTPlaceF / 26
6 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26
7 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26
8 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Analytical Placement Gopalakrishman+ DAC 06 Xu+ FPL 05 Gort+ FPL 12 Chen+ ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26
9 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement What s Next?? Analytical Placement Gopalakrishman+ DAC 06 Xu+ FPL 05 Gort+ FPL 12 Chen+ ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26
10 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement What s Next?? This Work Parallel Analytical Placement Analytical Placement Gopalakrishman+ DAC 06 Xu+ FPL 05 Gort+ FPL 12 Chen+ ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26
11 A Representative Quadratic Placement Flow Quadratic placers minimize cost function W (x, y) = 1 2 xt Q x x + c T x x yt Q y y + c T y y + const Circuit Info. Construct Linear Systems It is equivalent to solving Q x x = c x Q y y = c y (Conjugate Gradient (CG) method) Rough legalization [Kim+, TCAD 12] to remove cell overlapping Solve Linear Systems Rough Legalization Converge? Yes Done No Wuxi Li (UT-Austin) UTPlaceF / 26
12 A Simple Parallelization Attempt 100K-cell (FPGA-1) 1.1M-cell (FPGA-12) Intel Math Kernel Library (MKL) for CG Solve X and Y in parallel, if #thread 2 Enable parallelization for matrix-vector operations in CG, if #thread > 2 Use parallel sorting algorithms in libstdc++ parallel mode for rough legalization Runtime Normalized to Single-Threaded Execution # Threads Designs FPGA-1 FPGA FPGA-1 FPGA FPGA-1 FPGA FPGA-1 FPGA Linear System Construction Rough Legalization FPGA-1 FPGA-12 Solving Linear Systems Wuxi Li (UT-Austin) UTPlaceF / 26
13 Previous Works POLAR 3.0 [Lin+, ICCAD 15] ASIC quadratic placer Solve X and Y directions in parallel Solve geometric partitions in parallel Periodically solve flat placement 4X speedup with 16 threads Small to medium quality degradation Wuxi Li (UT-Austin) UTPlaceF / 26
14 UTPlaceF 3.0 Overview Circuit Info. Initial Placement Construct & Solve Linear Systems Rough Legalization & Update Anchor Force Parallelized Placement Construct & Solve Linear Systems for Sub-Netlists Synchronization Parallelized Incremental Placement Correction Synchronization Partition Netlist Rough Legalization & Update Anchor Force Converge? No Done Wuxi Li (UT-Austin) UTPlaceF / 26 Yes
15 Block-Jacobi Preconditioning The matrix Q Q after 8-way partitioning by row/column permutation The block-jacobi preconditioner Q of Q Qx = c Q i x i = c i, i = Wuxi Li (UT-Austin) UTPlaceF / 26
16 Physical Interpretation of Block-Jacobi Preconditioning A B C D E F Q x c A B C D E F x A x B x C xd = xe xf a b c d e f A B E C D F Q x c A B E C D F x A x B xe x C = xd xf a b e c d f A B E C D F bq x c A B E C D F x A x B xe x C = xd xf a b e c d f b d b d 0 b d B D B D B D A C E F A C E F A C E F a c e f a c e f a c e f Wuxi Li (UT-Austin) UTPlaceF / 26
17 Placement-Driven Block-Jacobi Preconditioning (PDBJP) bq x c bq x bc A B E C D F A B E C D F A B E C D F x A x B xe x C = xd xf a b e c d f A B E C D F x A x B xe x C = xd xf a b + x D e c d + x B f 0 b d b d B D B B D D A C E F A C E F a c e f Block-Jacobi Preconditioning a c e f Placement-Driven Block-Jacobi Preconditioning Wuxi Li (UT-Austin) UTPlaceF / 26
18 Can We Do Better? PDBJP is effective for a small number of partitions. Its quality degrades dramatically as the partition count increases. The quality degradation in PDBJP is introduced by the discrepancy of inter-partition nets in different partitions. A (k) A (k) A (k+1) B (k) B (k+1) B (k) An inter-partition net in an intermediate placement The solution after applying PDBJP with large discrepancy Wuxi Li (UT-Austin) UTPlaceF / 26
19 Parallelized Incremental Placement Correction (PIPC) A local and smooth perturbation of the post PDBJP placement Reduce the discrepancy of inter-partition nets A (k) A (k) A (k+1) A (k) A (k+2) A (k+1) B (k) B (k+1) B (k) B (k+1) B (k+2) B (k) An inter-partition net in an intermediate placement The solution after applying PDBJP with large discrepancy An incremental improvement that reduces the discrepancy Wuxi Li (UT-Austin) UTPlaceF / 26
20 Mathematical Formulation of PIPC PIPC is based on the idea of Additive Correction Multigrid Method The original problem we are trying to solve is Qx = c. (1) Let x (0) be the solution of the PDBJP linear system (1), that is If we can find x satisfying Qx (0) = ĉ. (2) Q x = r (0) = c Qx (0), the solution of the linear system (1) will be x (0) + x. Wuxi Li (UT-Austin) UTPlaceF / 26
21 Mathematical Formulation of PIPC x can be approached by the iterative method x (k+1) = x (k) + Q 1 (r (0) Q x (k) ) Computing Q 1 (r (0) Q x (k) ) is equivalent to solving which can be parallelized as well. Q 1 Q2 Q 3 Q4 Qx = r (0) Q x (k), Q x r (0) Q x (k) x 1 x 2 x 3 x 4 r (0) 1 r (0) 2 r (0) 3 r (0) 4 Q 1 Q 2 = Q 3 Q 4 Wuxi Li (UT-Austin) UTPlaceF / 26
22 Runtime and Quality Tradeoff of PIPC Theorem: The Convergence of PIPC By picking any placement-driven block-jacobi preconditioner of Q as Q, PIPC defined by iterative method x (k+1) = x (k) + Q 1 (r (0) Q x (k) ) is monotonically convergent for any choice of x (0). The Theorem reveals two important properties of PIPC: With sufficient iterations, the exact solution, x (0) + x, can be reached More iterations guarantees better solutions for Qx = c The runtime and quality can be perfectly traded to each other in PIPC. Wuxi Li (UT-Austin) UTPlaceF / 26
23 Experiments Setup C++ Intel Xeon E CPUs (2.6 GHz, 24 cores, and 30M L3 cache) 64 GB RAM Eigen 3 for linear system solving OpenMP 4.0 for multi-threading libstdc++ parallel mode for parallel sorting ISPD 16 FPGA placement contest benchmark suite Wuxi Li (UT-Austin) UTPlaceF / 26
24 Parallelization Configuration Solve X and Y in parallel Enable PDBJP for #thread > 2 PIPC is only enable for > 4 threads PIPC is not called for every placement iteration for 8 threads Less CG iterations for PIPC compared with PDBJP #Threads # Partitions # CG Iter. for PDBJP # PIPC Period # Correction Iter. of each PIPC # CG Iter. for PIPC Wuxi Li (UT-Austin) UTPlaceF / 26
25 Runtime Result w/o PIPC Speedup 7 1 thread 2 threads 4 threads 8 threads 16 threads FPGA-1 FPGA-2 FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads Avg. Speedup Wuxi Li (UT-Austin) UTPlaceF / 26
26 Wirelength Result w/o PIPC Norm. shpwl Incr. (%) FPGA-1 FPGA-2 1 thread 2 threads 4 threads 8 threads 16 threads FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads Avg. shpwl Incr. 0.0% 0.0% -0.1% 2.4% 6.5% Wuxi Li (UT-Austin) UTPlaceF / 26
27 Runtime Result w/ PIPC Speedup thread w/o PIPC 8 threads w/ PIPC 16 threads w/o PIPC 16 threads w/ PIPC FPGA-1 FPGA-2 FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads 8 w/o PIPC 8 w/ PIPC 16 w/o PIPC 16 w/ PIPC Avg. Speedup Wuxi Li (UT-Austin) UTPlaceF / 26
28 Wirelength Result w/ PIPC Norm. shpwl Incr. (%) FPGA-1 FPGA-2 8 thread w/o PIPC 8 threads w/ PIPC 16 threads w/o PIPC 16 threads w/ PIPC FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads 8 w/o PIPC 8 w/ PIPC 16 w/o PIPC 16 w/ PIPC Avg. shpwl Incr. 2.4% 1.7% 6.5% 3.0% Wuxi Li (UT-Austin) UTPlaceF / 26
29 Runtime Breakdown 1 # Threads Norm. Runtime Solving Linear System Linear System Construction Rough Legalization Netlist Partitioning PIPC Other Wuxi Li (UT-Austin) UTPlaceF / 26
30 Conclusion & Future Work Conclusion UTPlaceF 3.0, a parallelization framework for FPGA quadratic placement Placement-driven block-jacobi preconditioning Parallelized incremental placement correction 5X speedup with 3.0% wirelength increase Futher Work Parallelize packing and detailed placement Hardware acceleration: FPGA, GPU Wuxi Li (UT-Austin) UTPlaceF / 26
31 Thank You! Wuxi Li (UT-Austin) UTPlaceF / 26
Sparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationMinimizing Clock Latency Range in Robust Clock Tree Synthesis
Minimizing Clock Latency Range in Robust Clock Tree Synthesis Wen-Hao Liu Yih-Lang Li Hui-Chi Chen You have to enlarge your font. Many pages are hard to view. I think the position of Page topic is too
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationAlgorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method
Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk
More informationNetwork Flow-based Simultaneous Retiming and Slack Budgeting for Low Power Design
Outline Network Flow-based Simultaneous Retiming and Slack Budgeting for Low Power Design Bei Yu 1 Sheqin Dong 1 Yuchun Ma 1 Tao Lin 1 Yu Wang 1 Song Chen 2 Satoshi GOTO 2 1 Department of Computer Science
More informationIncremental Latin Hypercube Sampling
Incremental Latin Hypercube Sampling for Lifetime Stochastic Behavioral Modeling of Analog Circuits Yen-Lung Chen +, Wei Wu *, Chien-Nan Jimmy Liu + and Lei He * EE Dept., National Central University,
More informationEE582 Physical Design Automation of VLSI Circuits and Systems
EE582 Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University Placement Metrics for Placement Wirelength Timing Power Routing congestion 2 Wirelength Estimation
More informationScalable Non-blocking Preconditioned Conjugate Gradient Methods
Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William
More informationSkew Management of NBTI Impacted Gated Clock Trees
International Symposium on Physical Design 2010 Skew Management of NBTI Impacted Gated Clock Trees Ashutosh Chakraborty and David Z. Pan ECE Department, University of Texas at Austin ashutosh@cerc.utexas.edu
More informationEE582 Physical Design Automation of VLSI Circuits and Systems
EE582 Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University Partitioning What We Will Study Partitioning Practical examples Problem definition Deterministic
More informationOptimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationAn Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators
An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators Hao Zhuang 1, Wenjian Yu 2, Ilgweon Kang 1, Xinan Wang 1, and Chung-Kuan Cheng 1 1. University of California, San
More informationUsing AmgX to accelerate a PETSc-based immersed-boundary method code
29th International Conference on Parallel Computational Fluid Dynamics May 15-17, 2017; Glasgow, Scotland Using AmgX to accelerate a PETSc-based immersed-boundary method code Olivier Mesnard, Pi-Yueh Chuang,
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationMinimum Cost Flow Based Fast Placement Algorithm
Graduate Theses and Dissertations Graduate College 2012 Minimum Cost Flow Based Fast Placement Algorithm Enlai Xu Iowa State University Follow this and additional works at: http://lib.dr.iastate.edu/etd
More informationBuffered Clock Tree Sizing for Skew Minimization under Power and Thermal Budgets
Buffered Clock Tree Sizing for Skew Minimization under Power and Thermal Budgets Krit Athikulwongse, Xin Zhao, and Sung Kyu Lim School of Electrical and Computer Engineering Georgia Institute of Technology
More informationHyperspherical Clustering and Sampling for Rare Event Analysis with Multiple Failure Region Coverage
Hyperspherical Clustering and Sampling for Rare Event Analysis with Multiple Failure Region Coverage Wei Wu 1, Srinivas Bodapati 2, Lei He 1,3 1 Electrical Engineering Department, UCLA 2 Intel Corporation
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationAlgebraic Multigrid as Solvers and as Preconditioner
Ò Algebraic Multigrid as Solvers and as Preconditioner Domenico Lahaye domenico.lahaye@cs.kuleuven.ac.be http://www.cs.kuleuven.ac.be/ domenico/ Department of Computer Science Katholieke Universiteit Leuven
More informationAn Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem
An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem Juyeon Kim 1 juyeon@ssl.snu.ac.kr Deokjin Joo 1 jdj@ssl.snu.ac.kr Taewhan Kim 1,2 tkim@ssl.snu.ac.kr 1
More informationEfficient Reluctance Extraction for Large-Scale Power Grid with High- Frequency Consideration
Efficient Reluctance Extraction for Large-Scale Power Grid with High- Frequency Consideration Shan Zeng, Wenjian Yu, Jin Shi, Xianlong Hong Dept. Computer Science & Technology, Tsinghua University, Beijing
More informationHIGH-PERFORMANCE circuits consume a considerable
1166 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL 17, NO 11, NOVEMBER 1998 A Matrix Synthesis Approach to Thermal Placement Chris C N Chu D F Wong Abstract In this
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationProblems in VLSI design
Problems in VLSI design wire and transistor sizing signal delay in RC circuits transistor and wire sizing Elmore delay minimization via GP dominant time constant minimization via SDP placement problems
More informationIncomplete Cholesky preconditioners that exploit the low-rank property
anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationResearch on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method
NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationFast FPGA Placement Based on Quantum Model
Fast FPGA Placement Based on Quantum Model Lingli Wang School of Microelectronics Fudan University, Shanghai, China Email: llwang@fudan.edu.cn 1 Outline Background & Motivation VPR placement Quantum Model
More informationThe Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University Houghton, Michigan
More informationUnit 1A: Computational Complexity
Unit 1A: Computational Complexity Course contents: Computational complexity NP-completeness Algorithmic Paradigms Readings Chapters 3, 4, and 5 Unit 1A 1 O: Upper Bounding Function Def: f(n)= O(g(n)) if
More informationSaving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors
20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,
More information上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose
上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU
More informationTHE UNIVERSITY OF MICHIGAN. Faster Static Timing Analysis via Bus Compression
Faster Static Timing Analysis via Bus Compression by David Van Campenhout and Trevor Mudge CSE-TR-285-96 THE UNIVERSITY OF MICHIGAN Computer Science and Engineering Division Department of Electrical Engineering
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationEE582 Physical Design Automation of VLSI Circuits and Systems
EE582 Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University Partitioning What We Will Study Partitioning Practical examples Problem definition Deterministic
More informationA Framework for Layout-Level Logic Restructuring. Hosung Leo Kim John Lillis
A Framework for Layout-Level Logic Restructuring Hosung Leo Kim John Lillis Motivation: Logical-to-Physical Disconnect Logic-level Optimization disconnect Physical-level Optimization fixed netlist Limited
More informationiretilp : An efficient incremental algorithm for min-period retiming under general delay model
iretilp : An efficient incremental algorithm for min-period retiming under general delay model Debasish Das, Jia Wang and Hai Zhou EECS, Northwestern University, Evanston, IL 60201 Place and Route Group,
More informationA Temperature-aware Synthesis Approach for Simultaneous Delay and Leakage Optimization
A Temperature-aware Synthesis Approach for Simultaneous Delay and Leakage Optimization Nathaniel A. Conos and Miodrag Potkonjak Computer Science Department University of California, Los Angeles {conos,
More informationParallelizing large scale time domain electromagnetic inverse problem
Parallelizing large scale time domain electromagnetic inverse problems Eldad Haber with: D. Oldenburg & R. Shekhtman + Emory University, Atlanta, GA + The University of British Columbia, Vancouver, BC,
More informationMaking Fast Buffer Insertion Even Faster via Approximation Techniques
Making Fast Buffer Insertion Even Faster via Approximation Techniques Zhuo Li, C. N. Sze, Jiang Hu and Weiping Shi Department of Electrical Engineering Texas A&M University Charles J. Alpert IBM Austin
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationA Novel Cell Placement Algorithm for Flexible TFT Circuit with Mechanical Strain and Temperature Consideration
A Novel Cell Placement Algorithm for Flexible TFT Circuit with Mechanical Strain and Temperature Consideration Jiun-Li Lin, Po-Hsun Wu, and Tsung-Yi Ho Department of Computer Science and Information Engineering,
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationClock Buffer Polarity Assignment Utilizing Useful Clock Skews for Power Noise Reduction
Clock Buffer Polarity Assignment Utilizing Useful Clock Skews for Power Noise Reduction Deokjin Joo and Taewhan Kim Department of Electrical and Computer Engineering, Seoul National University, Seoul,
More informationRuntime Analysis of 4 VA HiCuM Versions with and without Internal Solver
Runtime Analysis of 4 VA HiCuM Versions with and without Internal Solver Didier Céli, Jean Remy 28 th ArbeitsKreis Bipolar - Letter Session Unterpremstaetten, Austria, November 5/6, 215 dm23a.15 Outline
More informationTowards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters
Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM
More informationConstraint Satisfaction in Incremental Placement with Application to Performance Optimization under Power Constraints
Constraint Satisfaction in Incremental Placement with Application to Performance Optimization under Power Constraints Huan Ren and Shantanu Dutt Dept. of ECE, University of Illinois-Chicago Abstract We
More informationJacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA
Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline CuSolver library - cusolverdn: dense LAPACK - cusolversp: sparse LAPACK - cusolverrf: refactorization
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationFine-grained Parallel Incomplete LU Factorization
Fine-grained Parallel Incomplete LU Factorization Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Sparse Days Meeting at CERFACS June 5-6, 2014 Contribution
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More informationAdding a New Dimension to Physical Design. Sachin Sapatnekar University of Minnesota
Adding a New Dimension to Physical Design Sachin Sapatnekar University of Minnesota 1 Outline What is 3D about? Why 3D? 3D-specific challenges 3D analysis and optimization 2 Planning a city: Land usage
More informationPetaBricks: Variable Accuracy and Online Learning
PetaBricks: Variable Accuracy and Online Learning Jason Ansel MIT - CSAIL May 4, 2011 Jason Ansel (MIT) PetaBricks May 4, 2011 1 / 40 Outline 1 Motivating Example 2 PetaBricks Language Overview 3 Variable
More informationEnergy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems
Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Jian-Jia Chen *, Chuan Yue Yang, Tei-Wei Kuo, and Chi-Sheng Shih Embedded Systems and Wireless Networking Lab. Department of Computer
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationQuantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)
Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander
More informationReview: From problem to parallel algorithm
Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:
More informationConjugate Gradients: Idea
Overview Steepest Descent often takes steps in the same direction as earlier steps Wouldn t it be better every time we take a step to get it exactly right the first time? Again, in general we choose a
More informationSOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationSOLUTION of linear systems of equations of the form:
Proceedings of the Federated Conference on Computer Science and Information Systems pp. Mixed precision iterative refinement techniques for the WZ factorization Beata Bylina Jarosław Bylina Institute of
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationMulti-Level Logic Optimization. Technology Independent. Thanks to R. Rudell, S. Malik, R. Rutenbar. University of California, Berkeley, CA
Technology Independent Multi-Level Logic Optimization Prof. Kurt Keutzer Prof. Sanjit Seshia EECS University of California, Berkeley, CA Thanks to R. Rudell, S. Malik, R. Rutenbar 1 Logic Optimization
More informationCycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators
CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators Sandeep D souza and Ragunathan (Raj) Rajkumar Carnegie Mellon University High (Energy) Cost of Accelerators Modern-day
More informationBuilding a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI
Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering
More informationS0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA
S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,
More informationEfficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized
More informationStochastic Dynamic Thermal Management: A Markovian Decision-based Approach. Hwisung Jung, Massoud Pedram
Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach Hwisung Jung, Massoud Pedram Outline Introduction Background Thermal Management Framework Accuracy of Modeling Policy Representation
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 2, pp. C169 C193 c 2015 Society for Industrial and Applied Mathematics FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper
More informationUtilisation de la compression low-rank pour réduire la complexité du solveur PaStiX
Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationSolution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI
Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Sagar Bhatt Person Number: 50170651 Department of Mechanical and Aerospace Engineering,
More informationABHELSINKI UNIVERSITY OF TECHNOLOGY
On Repeated Squarings in Binary Fields Kimmo Järvinen Helsinki University of Technology August 14, 2009 K. Järvinen On Repeated Squarings in Binary Fields 1/1 Introduction Repeated squaring Repeated squaring:
More informationBreaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points
Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Michael Griebel Christian Rieger Peter Zaspel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationPetascale Quantum Simulations of Nano Systems and Biomolecules
Petascale Quantum Simulations of Nano Systems and Biomolecules Emil Briggs North Carolina State University 1. Outline of real-space Multigrid (RMG) 2. Scalability and hybrid/threaded models 3. GPU acceleration
More informationMultilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota
Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work
More information392 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013
392 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 An Optimal Allocation Algorithm of Adjustable Delay Buffers and Practical Extensions for Clock
More informationFine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning
Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationPower-Driven Global Routing for Multi-Supply Voltage Domains
Power-Driven Global Routing for Multi-Supply Voltage Domains Tai-Hsuan Wu, Azadeh Davoodi, and Jeffrey T. Linderoth Department of Electrical and Computer Engineering Department of Industrial and Systems
More informationGPU accelerated Arnoldi solver for small batched matrix
15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA
More informationn-level Graph Partitioning
Vitaly Osipov, Peter Sanders - Algorithmik II 1 Vitaly Osipov: KIT Universität des Landes Baden-Württemberg und nationales Grossforschungszentrum in der Helmholtz-Gemeinschaft Institut für Theoretische
More informationIntroduction to Benchmark Test for Multi-scale Computational Materials Software
Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)
More informationDynamic Adaptation for Resilient Integrated Circuits and Systems
Dynamic Adaptation for Resilient Integrated Circuits and Systems Krishnendu Chakrabarty Department of Electrical and Computer Engineering Duke University Durham, NC 27708, USA Department of Computer and
More informationMPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory
MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL
More informationChapter 7 Iterative Techniques in Matrix Algebra
Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationSome Geometric and Algebraic Aspects of Domain Decomposition Methods
Some Geometric and Algebraic Aspects of Domain Decomposition Methods D.S.Butyugin 1, Y.L.Gurieva 1, V.P.Ilin 1,2, and D.V.Perevozkin 1 Abstract Some geometric and algebraic aspects of various domain decomposition
More informationDetecting Support-Reducing Bound Sets using Two-Cofactor Symmetries 1
3A-3 Detecting Support-Reducing Bound Sets using Two-Cofactor Symmetries 1 Jin S. Zhang Department of ECE Portland State University Portland, OR 97201 jinsong@ece.pdx.edu Malgorzata Chrzanowska-Jeske Department
More informationOptimal Folding Of Bit Sliced Stacks+
Optimal Folding Of Bit Sliced Stacks+ Doowon Paik University of Minnesota Sartaj Sahni University of Florida Abstract We develop fast polynomial time algorithms to optimally fold stacked bit sliced architectures
More information