UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

Size: px
Start display at page:

Download "UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement"

Transcription

1 UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement Wuxi Li, Meng Li, Jiajun Wang, and David Z. Pan University of Texas at Austin wuxili@utexas.edu November 14, 2017 UT DA Wuxi Li (UT-Austin) UTPlaceF / 26

2 Outline 1 Introduction 2 Motivation & Challenges 3 UTPlcaeF 3.0 Algorithms 4 Experimental Results 5 Conclusion & Future Work Wuxi Li (UT-Austin) UTPlaceF / 26

3 Introduction: FPGA Scaling 15 Years of Xilinx/Intel(Altera) FPGA Size Trend Gate Count 10M 3M 1M 333K 100K 33K 10K Spartan-3 Arria GX Stratix II Virtex-4 Stratix III Virtex-5 Spartan-6 Arria II Virtex-6 Stratix IV Spartan-7 Artix-7 Arria V 5.5M Kintex-7 Stratix V Virtex-7 Kintex US Arria 10 Virtex US Kintex US+ Virtex US+ Stratix 10 3K 1K 90nm 65nm 40/45nm 28nm 20nm 14/16nm Wuxi Li (UT-Austin) UTPlaceF / 26

4 Introduction: The Era of Multi-Core CPUs Years of CPU Trend Data Number of Transistors (K) Single-Thread Performance (SpecINT 10 3 ) Frequency (MHz) Number of Cores Source: Original data up to the year 2010 was collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. Data for was collected by K. Rupp. Wuxi Li (UT-Austin) UTPlaceF / 26

5 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Wuxi Li (UT-Austin) UTPlaceF / 26

6 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

7 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

8 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement Analytical Placement Gopalakrishman+ DAC 06 Xu+ FPL 05 Gort+ FPL 12 Chen+ ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

9 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement What s Next?? Analytical Placement Gopalakrishman+ DAC 06 Xu+ FPL 05 Gort+ FPL 12 Chen+ ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

10 The Evolution of FPGA Placement 20 Years of Selected Academic Papers on FPGA Placement What s Next?? This Work Parallel Analytical Placement Analytical Placement Gopalakrishman+ DAC 06 Xu+ FPL 05 Gort+ FPL 12 Chen+ ICCAD 14 Xu+ Integration, VLSI 11 Li+ ICCAD 16 Pui+ ICCAD 16 Pattison+ ICCAD 16 Min-Cut Partitioning Maidee+ DAC 03 Maidee+ TCAD 05 Simulated Annealing Betz+ FPL 97 Chen+ FPL 04 Chen+ TODAES Wuxi Li (UT-Austin) UTPlaceF / 26

11 A Representative Quadratic Placement Flow Quadratic placers minimize cost function W (x, y) = 1 2 xt Q x x + c T x x yt Q y y + c T y y + const Circuit Info. Construct Linear Systems It is equivalent to solving Q x x = c x Q y y = c y (Conjugate Gradient (CG) method) Rough legalization [Kim+, TCAD 12] to remove cell overlapping Solve Linear Systems Rough Legalization Converge? Yes Done No Wuxi Li (UT-Austin) UTPlaceF / 26

12 A Simple Parallelization Attempt 100K-cell (FPGA-1) 1.1M-cell (FPGA-12) Intel Math Kernel Library (MKL) for CG Solve X and Y in parallel, if #thread 2 Enable parallelization for matrix-vector operations in CG, if #thread > 2 Use parallel sorting algorithms in libstdc++ parallel mode for rough legalization Runtime Normalized to Single-Threaded Execution # Threads Designs FPGA-1 FPGA FPGA-1 FPGA FPGA-1 FPGA FPGA-1 FPGA Linear System Construction Rough Legalization FPGA-1 FPGA-12 Solving Linear Systems Wuxi Li (UT-Austin) UTPlaceF / 26

13 Previous Works POLAR 3.0 [Lin+, ICCAD 15] ASIC quadratic placer Solve X and Y directions in parallel Solve geometric partitions in parallel Periodically solve flat placement 4X speedup with 16 threads Small to medium quality degradation Wuxi Li (UT-Austin) UTPlaceF / 26

14 UTPlaceF 3.0 Overview Circuit Info. Initial Placement Construct & Solve Linear Systems Rough Legalization & Update Anchor Force Parallelized Placement Construct & Solve Linear Systems for Sub-Netlists Synchronization Parallelized Incremental Placement Correction Synchronization Partition Netlist Rough Legalization & Update Anchor Force Converge? No Done Wuxi Li (UT-Austin) UTPlaceF / 26 Yes

15 Block-Jacobi Preconditioning The matrix Q Q after 8-way partitioning by row/column permutation The block-jacobi preconditioner Q of Q Qx = c Q i x i = c i, i = Wuxi Li (UT-Austin) UTPlaceF / 26

16 Physical Interpretation of Block-Jacobi Preconditioning A B C D E F Q x c A B C D E F x A x B x C xd = xe xf a b c d e f A B E C D F Q x c A B E C D F x A x B xe x C = xd xf a b e c d f A B E C D F bq x c A B E C D F x A x B xe x C = xd xf a b e c d f b d b d 0 b d B D B D B D A C E F A C E F A C E F a c e f a c e f a c e f Wuxi Li (UT-Austin) UTPlaceF / 26

17 Placement-Driven Block-Jacobi Preconditioning (PDBJP) bq x c bq x bc A B E C D F A B E C D F A B E C D F x A x B xe x C = xd xf a b e c d f A B E C D F x A x B xe x C = xd xf a b + x D e c d + x B f 0 b d b d B D B B D D A C E F A C E F a c e f Block-Jacobi Preconditioning a c e f Placement-Driven Block-Jacobi Preconditioning Wuxi Li (UT-Austin) UTPlaceF / 26

18 Can We Do Better? PDBJP is effective for a small number of partitions. Its quality degrades dramatically as the partition count increases. The quality degradation in PDBJP is introduced by the discrepancy of inter-partition nets in different partitions. A (k) A (k) A (k+1) B (k) B (k+1) B (k) An inter-partition net in an intermediate placement The solution after applying PDBJP with large discrepancy Wuxi Li (UT-Austin) UTPlaceF / 26

19 Parallelized Incremental Placement Correction (PIPC) A local and smooth perturbation of the post PDBJP placement Reduce the discrepancy of inter-partition nets A (k) A (k) A (k+1) A (k) A (k+2) A (k+1) B (k) B (k+1) B (k) B (k+1) B (k+2) B (k) An inter-partition net in an intermediate placement The solution after applying PDBJP with large discrepancy An incremental improvement that reduces the discrepancy Wuxi Li (UT-Austin) UTPlaceF / 26

20 Mathematical Formulation of PIPC PIPC is based on the idea of Additive Correction Multigrid Method The original problem we are trying to solve is Qx = c. (1) Let x (0) be the solution of the PDBJP linear system (1), that is If we can find x satisfying Qx (0) = ĉ. (2) Q x = r (0) = c Qx (0), the solution of the linear system (1) will be x (0) + x. Wuxi Li (UT-Austin) UTPlaceF / 26

21 Mathematical Formulation of PIPC x can be approached by the iterative method x (k+1) = x (k) + Q 1 (r (0) Q x (k) ) Computing Q 1 (r (0) Q x (k) ) is equivalent to solving which can be parallelized as well. Q 1 Q2 Q 3 Q4 Qx = r (0) Q x (k), Q x r (0) Q x (k) x 1 x 2 x 3 x 4 r (0) 1 r (0) 2 r (0) 3 r (0) 4 Q 1 Q 2 = Q 3 Q 4 Wuxi Li (UT-Austin) UTPlaceF / 26

22 Runtime and Quality Tradeoff of PIPC Theorem: The Convergence of PIPC By picking any placement-driven block-jacobi preconditioner of Q as Q, PIPC defined by iterative method x (k+1) = x (k) + Q 1 (r (0) Q x (k) ) is monotonically convergent for any choice of x (0). The Theorem reveals two important properties of PIPC: With sufficient iterations, the exact solution, x (0) + x, can be reached More iterations guarantees better solutions for Qx = c The runtime and quality can be perfectly traded to each other in PIPC. Wuxi Li (UT-Austin) UTPlaceF / 26

23 Experiments Setup C++ Intel Xeon E CPUs (2.6 GHz, 24 cores, and 30M L3 cache) 64 GB RAM Eigen 3 for linear system solving OpenMP 4.0 for multi-threading libstdc++ parallel mode for parallel sorting ISPD 16 FPGA placement contest benchmark suite Wuxi Li (UT-Austin) UTPlaceF / 26

24 Parallelization Configuration Solve X and Y in parallel Enable PDBJP for #thread > 2 PIPC is only enable for > 4 threads PIPC is not called for every placement iteration for 8 threads Less CG iterations for PIPC compared with PDBJP #Threads # Partitions # CG Iter. for PDBJP # PIPC Period # Correction Iter. of each PIPC # CG Iter. for PIPC Wuxi Li (UT-Austin) UTPlaceF / 26

25 Runtime Result w/o PIPC Speedup 7 1 thread 2 threads 4 threads 8 threads 16 threads FPGA-1 FPGA-2 FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads Avg. Speedup Wuxi Li (UT-Austin) UTPlaceF / 26

26 Wirelength Result w/o PIPC Norm. shpwl Incr. (%) FPGA-1 FPGA-2 1 thread 2 threads 4 threads 8 threads 16 threads FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads Avg. shpwl Incr. 0.0% 0.0% -0.1% 2.4% 6.5% Wuxi Li (UT-Austin) UTPlaceF / 26

27 Runtime Result w/ PIPC Speedup thread w/o PIPC 8 threads w/ PIPC 16 threads w/o PIPC 16 threads w/ PIPC FPGA-1 FPGA-2 FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads 8 w/o PIPC 8 w/ PIPC 16 w/o PIPC 16 w/ PIPC Avg. Speedup Wuxi Li (UT-Austin) UTPlaceF / 26

28 Wirelength Result w/ PIPC Norm. shpwl Incr. (%) FPGA-1 FPGA-2 8 thread w/o PIPC 8 threads w/ PIPC 16 threads w/o PIPC 16 threads w/ PIPC FPGA-3 FPGA-4 FPGA-5 FPGA-6 FPGA-7 FPGA-8 FPGA-9 FPGA-10 FPGA-11 FPGA-12 #Threads 8 w/o PIPC 8 w/ PIPC 16 w/o PIPC 16 w/ PIPC Avg. shpwl Incr. 2.4% 1.7% 6.5% 3.0% Wuxi Li (UT-Austin) UTPlaceF / 26

29 Runtime Breakdown 1 # Threads Norm. Runtime Solving Linear System Linear System Construction Rough Legalization Netlist Partitioning PIPC Other Wuxi Li (UT-Austin) UTPlaceF / 26

30 Conclusion & Future Work Conclusion UTPlaceF 3.0, a parallelization framework for FPGA quadratic placement Placement-driven block-jacobi preconditioning Parallelized incremental placement correction 5X speedup with 3.0% wirelength increase Futher Work Parallelize packing and detailed placement Hardware acceleration: FPGA, GPU Wuxi Li (UT-Austin) UTPlaceF / 26

31 Thank You! Wuxi Li (UT-Austin) UTPlaceF / 26

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Minimizing Clock Latency Range in Robust Clock Tree Synthesis

Minimizing Clock Latency Range in Robust Clock Tree Synthesis Minimizing Clock Latency Range in Robust Clock Tree Synthesis Wen-Hao Liu Yih-Lang Li Hui-Chi Chen You have to enlarge your font. Many pages are hard to view. I think the position of Page topic is too

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk

More information

Network Flow-based Simultaneous Retiming and Slack Budgeting for Low Power Design

Network Flow-based Simultaneous Retiming and Slack Budgeting for Low Power Design Outline Network Flow-based Simultaneous Retiming and Slack Budgeting for Low Power Design Bei Yu 1 Sheqin Dong 1 Yuchun Ma 1 Tao Lin 1 Yu Wang 1 Song Chen 2 Satoshi GOTO 2 1 Department of Computer Science

More information

Incremental Latin Hypercube Sampling

Incremental Latin Hypercube Sampling Incremental Latin Hypercube Sampling for Lifetime Stochastic Behavioral Modeling of Analog Circuits Yen-Lung Chen +, Wei Wu *, Chien-Nan Jimmy Liu + and Lei He * EE Dept., National Central University,

More information

EE582 Physical Design Automation of VLSI Circuits and Systems

EE582 Physical Design Automation of VLSI Circuits and Systems EE582 Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University Placement Metrics for Placement Wirelength Timing Power Routing congestion 2 Wirelength Estimation

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

Skew Management of NBTI Impacted Gated Clock Trees

Skew Management of NBTI Impacted Gated Clock Trees International Symposium on Physical Design 2010 Skew Management of NBTI Impacted Gated Clock Trees Ashutosh Chakraborty and David Z. Pan ECE Department, University of Texas at Austin ashutosh@cerc.utexas.edu

More information

EE582 Physical Design Automation of VLSI Circuits and Systems

EE582 Physical Design Automation of VLSI Circuits and Systems EE582 Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University Partitioning What We Will Study Partitioning Practical examples Problem definition Deterministic

More information

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators Hao Zhuang 1, Wenjian Yu 2, Ilgweon Kang 1, Xinan Wang 1, and Chung-Kuan Cheng 1 1. University of California, San

More information

Using AmgX to accelerate a PETSc-based immersed-boundary method code

Using AmgX to accelerate a PETSc-based immersed-boundary method code 29th International Conference on Parallel Computational Fluid Dynamics May 15-17, 2017; Glasgow, Scotland Using AmgX to accelerate a PETSc-based immersed-boundary method code Olivier Mesnard, Pi-Yueh Chuang,

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Minimum Cost Flow Based Fast Placement Algorithm

Minimum Cost Flow Based Fast Placement Algorithm Graduate Theses and Dissertations Graduate College 2012 Minimum Cost Flow Based Fast Placement Algorithm Enlai Xu Iowa State University Follow this and additional works at: http://lib.dr.iastate.edu/etd

More information

Buffered Clock Tree Sizing for Skew Minimization under Power and Thermal Budgets

Buffered Clock Tree Sizing for Skew Minimization under Power and Thermal Budgets Buffered Clock Tree Sizing for Skew Minimization under Power and Thermal Budgets Krit Athikulwongse, Xin Zhao, and Sung Kyu Lim School of Electrical and Computer Engineering Georgia Institute of Technology

More information

Hyperspherical Clustering and Sampling for Rare Event Analysis with Multiple Failure Region Coverage

Hyperspherical Clustering and Sampling for Rare Event Analysis with Multiple Failure Region Coverage Hyperspherical Clustering and Sampling for Rare Event Analysis with Multiple Failure Region Coverage Wei Wu 1, Srinivas Bodapati 2, Lei He 1,3 1 Electrical Engineering Department, UCLA 2 Intel Corporation

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

Algebraic Multigrid as Solvers and as Preconditioner

Algebraic Multigrid as Solvers and as Preconditioner Ò Algebraic Multigrid as Solvers and as Preconditioner Domenico Lahaye domenico.lahaye@cs.kuleuven.ac.be http://www.cs.kuleuven.ac.be/ domenico/ Department of Computer Science Katholieke Universiteit Leuven

More information

An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem

An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem Juyeon Kim 1 juyeon@ssl.snu.ac.kr Deokjin Joo 1 jdj@ssl.snu.ac.kr Taewhan Kim 1,2 tkim@ssl.snu.ac.kr 1

More information

Efficient Reluctance Extraction for Large-Scale Power Grid with High- Frequency Consideration

Efficient Reluctance Extraction for Large-Scale Power Grid with High- Frequency Consideration Efficient Reluctance Extraction for Large-Scale Power Grid with High- Frequency Consideration Shan Zeng, Wenjian Yu, Jin Shi, Xianlong Hong Dept. Computer Science & Technology, Tsinghua University, Beijing

More information

HIGH-PERFORMANCE circuits consume a considerable

HIGH-PERFORMANCE circuits consume a considerable 1166 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL 17, NO 11, NOVEMBER 1998 A Matrix Synthesis Approach to Thermal Placement Chris C N Chu D F Wong Abstract In this

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Problems in VLSI design

Problems in VLSI design Problems in VLSI design wire and transistor sizing signal delay in RC circuits transistor and wire sizing Elmore delay minimization via GP dominant time constant minimization via SDP placement problems

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

Fast FPGA Placement Based on Quantum Model

Fast FPGA Placement Based on Quantum Model Fast FPGA Placement Based on Quantum Model Lingli Wang School of Microelectronics Fudan University, Shanghai, China Email: llwang@fudan.edu.cn 1 Outline Background & Motivation VPR placement Quantum Model

More information

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University Houghton, Michigan

More information

Unit 1A: Computational Complexity

Unit 1A: Computational Complexity Unit 1A: Computational Complexity Course contents: Computational complexity NP-completeness Algorithmic Paradigms Readings Chapters 3, 4, and 5 Unit 1A 1 O: Upper Bounding Function Def: f(n)= O(g(n)) if

More information

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors 20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,

More information

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU

More information

THE UNIVERSITY OF MICHIGAN. Faster Static Timing Analysis via Bus Compression

THE UNIVERSITY OF MICHIGAN. Faster Static Timing Analysis via Bus Compression Faster Static Timing Analysis via Bus Compression by David Van Campenhout and Trevor Mudge CSE-TR-285-96 THE UNIVERSITY OF MICHIGAN Computer Science and Engineering Division Department of Electrical Engineering

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

EE582 Physical Design Automation of VLSI Circuits and Systems

EE582 Physical Design Automation of VLSI Circuits and Systems EE582 Prof. Dae Hyun Kim School of Electrical Engineering and Computer Science Washington State University Partitioning What We Will Study Partitioning Practical examples Problem definition Deterministic

More information

A Framework for Layout-Level Logic Restructuring. Hosung Leo Kim John Lillis

A Framework for Layout-Level Logic Restructuring. Hosung Leo Kim John Lillis A Framework for Layout-Level Logic Restructuring Hosung Leo Kim John Lillis Motivation: Logical-to-Physical Disconnect Logic-level Optimization disconnect Physical-level Optimization fixed netlist Limited

More information

iretilp : An efficient incremental algorithm for min-period retiming under general delay model

iretilp : An efficient incremental algorithm for min-period retiming under general delay model iretilp : An efficient incremental algorithm for min-period retiming under general delay model Debasish Das, Jia Wang and Hai Zhou EECS, Northwestern University, Evanston, IL 60201 Place and Route Group,

More information

A Temperature-aware Synthesis Approach for Simultaneous Delay and Leakage Optimization

A Temperature-aware Synthesis Approach for Simultaneous Delay and Leakage Optimization A Temperature-aware Synthesis Approach for Simultaneous Delay and Leakage Optimization Nathaniel A. Conos and Miodrag Potkonjak Computer Science Department University of California, Los Angeles {conos,

More information

Parallelizing large scale time domain electromagnetic inverse problem

Parallelizing large scale time domain electromagnetic inverse problem Parallelizing large scale time domain electromagnetic inverse problems Eldad Haber with: D. Oldenburg & R. Shekhtman + Emory University, Atlanta, GA + The University of British Columbia, Vancouver, BC,

More information

Making Fast Buffer Insertion Even Faster via Approximation Techniques

Making Fast Buffer Insertion Even Faster via Approximation Techniques Making Fast Buffer Insertion Even Faster via Approximation Techniques Zhuo Li, C. N. Sze, Jiang Hu and Weiping Shi Department of Electrical Engineering Texas A&M University Charles J. Alpert IBM Austin

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

A Novel Cell Placement Algorithm for Flexible TFT Circuit with Mechanical Strain and Temperature Consideration

A Novel Cell Placement Algorithm for Flexible TFT Circuit with Mechanical Strain and Temperature Consideration A Novel Cell Placement Algorithm for Flexible TFT Circuit with Mechanical Strain and Temperature Consideration Jiun-Li Lin, Po-Hsun Wu, and Tsung-Yi Ho Department of Computer Science and Information Engineering,

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Clock Buffer Polarity Assignment Utilizing Useful Clock Skews for Power Noise Reduction

Clock Buffer Polarity Assignment Utilizing Useful Clock Skews for Power Noise Reduction Clock Buffer Polarity Assignment Utilizing Useful Clock Skews for Power Noise Reduction Deokjin Joo and Taewhan Kim Department of Electrical and Computer Engineering, Seoul National University, Seoul,

More information

Runtime Analysis of 4 VA HiCuM Versions with and without Internal Solver

Runtime Analysis of 4 VA HiCuM Versions with and without Internal Solver Runtime Analysis of 4 VA HiCuM Versions with and without Internal Solver Didier Céli, Jean Remy 28 th ArbeitsKreis Bipolar - Letter Session Unterpremstaetten, Austria, November 5/6, 215 dm23a.15 Outline

More information

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM

More information

Constraint Satisfaction in Incremental Placement with Application to Performance Optimization under Power Constraints

Constraint Satisfaction in Incremental Placement with Application to Performance Optimization under Power Constraints Constraint Satisfaction in Incremental Placement with Application to Performance Optimization under Power Constraints Huan Ren and Shantanu Dutt Dept. of ECE, University of Illinois-Chicago Abstract We

More information

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline CuSolver library - cusolverdn: dense LAPACK - cusolversp: sparse LAPACK - cusolverrf: refactorization

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

Fine-grained Parallel Incomplete LU Factorization

Fine-grained Parallel Incomplete LU Factorization Fine-grained Parallel Incomplete LU Factorization Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Sparse Days Meeting at CERFACS June 5-6, 2014 Contribution

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

Adding a New Dimension to Physical Design. Sachin Sapatnekar University of Minnesota

Adding a New Dimension to Physical Design. Sachin Sapatnekar University of Minnesota Adding a New Dimension to Physical Design Sachin Sapatnekar University of Minnesota 1 Outline What is 3D about? Why 3D? 3D-specific challenges 3D analysis and optimization 2 Planning a city: Land usage

More information

PetaBricks: Variable Accuracy and Online Learning

PetaBricks: Variable Accuracy and Online Learning PetaBricks: Variable Accuracy and Online Learning Jason Ansel MIT - CSAIL May 4, 2011 Jason Ansel (MIT) PetaBricks May 4, 2011 1 / 40 Outline 1 Motivating Example 2 PetaBricks Language Overview 3 Variable

More information

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Jian-Jia Chen *, Chuan Yue Yang, Tei-Wei Kuo, and Chi-Sheng Shih Embedded Systems and Wireless Networking Lab. Department of Computer

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

Conjugate Gradients: Idea

Conjugate Gradients: Idea Overview Steepest Descent often takes steps in the same direction as earlier steps Wouldn t it be better every time we take a step to get it exactly right the first time? Again, in general we choose a

More information

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

SOLUTION of linear systems of equations of the form:

SOLUTION of linear systems of equations of the form: Proceedings of the Federated Conference on Computer Science and Information Systems pp. Mixed precision iterative refinement techniques for the WZ factorization Beata Bylina Jarosław Bylina Institute of

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Multi-Level Logic Optimization. Technology Independent. Thanks to R. Rudell, S. Malik, R. Rutenbar. University of California, Berkeley, CA

Multi-Level Logic Optimization. Technology Independent. Thanks to R. Rudell, S. Malik, R. Rutenbar. University of California, Berkeley, CA Technology Independent Multi-Level Logic Optimization Prof. Kurt Keutzer Prof. Sanjit Seshia EECS University of California, Berkeley, CA Thanks to R. Rudell, S. Malik, R. Rutenbar 1 Logic Optimization

More information

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators Sandeep D souza and Ragunathan (Raj) Rajkumar Carnegie Mellon University High (Energy) Cost of Accelerators Modern-day

More information

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering

More information

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information

Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach. Hwisung Jung, Massoud Pedram

Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach. Hwisung Jung, Massoud Pedram Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach Hwisung Jung, Massoud Pedram Outline Introduction Background Thermal Management Framework Accuracy of Modeling Policy Representation

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 2, pp. C169 C193 c 2015 Society for Industrial and Applied Mathematics FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Sagar Bhatt Person Number: 50170651 Department of Mechanical and Aerospace Engineering,

More information

ABHELSINKI UNIVERSITY OF TECHNOLOGY

ABHELSINKI UNIVERSITY OF TECHNOLOGY On Repeated Squarings in Binary Fields Kimmo Järvinen Helsinki University of Technology August 14, 2009 K. Järvinen On Repeated Squarings in Binary Fields 1/1 Introduction Repeated squaring Repeated squaring:

More information

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Michael Griebel Christian Rieger Peter Zaspel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Petascale Quantum Simulations of Nano Systems and Biomolecules

Petascale Quantum Simulations of Nano Systems and Biomolecules Petascale Quantum Simulations of Nano Systems and Biomolecules Emil Briggs North Carolina State University 1. Outline of real-space Multigrid (RMG) 2. Scalability and hybrid/threaded models 3. GPU acceleration

More information

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work

More information

392 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013

392 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 392 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 3, MARCH 2013 An Optimal Allocation Algorithm of Adjustable Delay Buffers and Practical Extensions for Clock

More information

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Power-Driven Global Routing for Multi-Supply Voltage Domains

Power-Driven Global Routing for Multi-Supply Voltage Domains Power-Driven Global Routing for Multi-Supply Voltage Domains Tai-Hsuan Wu, Azadeh Davoodi, and Jeffrey T. Linderoth Department of Electrical and Computer Engineering Department of Industrial and Systems

More information

GPU accelerated Arnoldi solver for small batched matrix

GPU accelerated Arnoldi solver for small batched matrix 15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA

More information

n-level Graph Partitioning

n-level Graph Partitioning Vitaly Osipov, Peter Sanders - Algorithmik II 1 Vitaly Osipov: KIT Universität des Landes Baden-Württemberg und nationales Grossforschungszentrum in der Helmholtz-Gemeinschaft Institut für Theoretische

More information

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Introduction to Benchmark Test for Multi-scale Computational Materials Software Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)

More information

Dynamic Adaptation for Resilient Integrated Circuits and Systems

Dynamic Adaptation for Resilient Integrated Circuits and Systems Dynamic Adaptation for Resilient Integrated Circuits and Systems Krishnendu Chakrabarty Department of Electrical and Computer Engineering Duke University Durham, NC 27708, USA Department of Computer and

More information

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Some Geometric and Algebraic Aspects of Domain Decomposition Methods Some Geometric and Algebraic Aspects of Domain Decomposition Methods D.S.Butyugin 1, Y.L.Gurieva 1, V.P.Ilin 1,2, and D.V.Perevozkin 1 Abstract Some geometric and algebraic aspects of various domain decomposition

More information

Detecting Support-Reducing Bound Sets using Two-Cofactor Symmetries 1

Detecting Support-Reducing Bound Sets using Two-Cofactor Symmetries 1 3A-3 Detecting Support-Reducing Bound Sets using Two-Cofactor Symmetries 1 Jin S. Zhang Department of ECE Portland State University Portland, OR 97201 jinsong@ece.pdx.edu Malgorzata Chrzanowska-Jeske Department

More information

Optimal Folding Of Bit Sliced Stacks+

Optimal Folding Of Bit Sliced Stacks+ Optimal Folding Of Bit Sliced Stacks+ Doowon Paik University of Minnesota Sartaj Sahni University of Florida Abstract We develop fast polynomial time algorithms to optimally fold stacked bit sliced architectures

More information