Saving Energy in Sparse and Dense Linear Algebra Computations
|
|
- Polly George
- 5 years ago
- Views:
Transcription
1 Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain de Castellón, Spain at Austin, TX quintana@icc.uji.es Saving Energy in Sparse and Dense Linear Algebra Computations 1 SIAM PPSC Savannah (GA), Feb. 2012
2 Introduction and motivation Motivation Reduce energy consumption! Costs over lifetime of an HPC facility often exceed acquisition costs Carbon dioxide is a risk for health and environment Heat reduces hardware reliability Personal view Hardware features mechanisms and modes to save energy Scientific apps. are in general energy-oblivious Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC Savannah (GA), Feb. 2012
3 Introduction and motivation Motivation Reduce energy consumption! Costs over lifetime of an HPC facility often exceed acquisition costs Carbon dioxide is a risk for health and environment Heat reduces hardware reliability Personal view Hardware features mechanisms and modes to save energy Scientific apps. are in general energy-oblivious Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC Savannah (GA), Feb. 2012
4 Introduction and motivation Motivation (Cont d) Scheduling of task parallel linear algebra algorithms Examples: Cholesky, QR and LU factorizations, etc. ILUPACK Energy saving tools available for multi-core processors Example: Dynamic Voltage and Frequency Scaling (DVFS) Scheduling tasks + DVFS Energy-aware execution on multi-core processors Different scheduling strategies: SRA: Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrificing total performance of the algorithm RIA: Execute all tasks at highest frequency to enjoy longer inactive periods Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC Savannah (GA), Feb. 2012
5 Introduction and motivation Motivation (Cont d) Scheduling of task parallel linear algebra algorithms Examples: Cholesky, QR and LU factorizations, etc. ILUPACK Energy saving tools available for multi-core processors Example: Dynamic Voltage and Frequency Scaling (DVFS) Scheduling tasks + DVFS Energy-aware execution on multi-core processors Different scheduling strategies: SRA: Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrificing total performance of the algorithm RIA: Execute all tasks at highest frequency to enjoy longer inactive periods Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC Savannah (GA), Feb. 2012
6 Introduction and motivation Outline 1 LU factorization with partial pivoting 1 Slack Reduction Algorithm 2 Race-to-Idle Algorithm 3 Simulation 4 Experimental Results 2 ILUPACK 1 Race-to-Idle Algorithm 2 Experimental Results 3 Conclusions Saving Energy in Sparse and Dense Linear Algebra Computations 4 SIAM PPSC Savannah (GA), Feb. 2012
7 1. LU Factorization with Partial Pivoting (LUPP) LU factorization Factor A = LU, with L/U R n n unit lower/upper triangular matrices For numerical stability, permutations are introduced to prevent operation with small pivot elements PA = LU, with P R n n a permutation matrix that interchanges rows of A Saving Energy in Sparse and Dense Linear Algebra Computations 5 SIAM PPSC Savannah (GA), Feb. 2012
8 Blocked algorithm for LUPP (no PP, for simplicity) for k = 1 : s do A k:s,k = L k:s,k U kk for j = k + 1 : s do A kj L 1 kk A kj A k+1:s,j A k+1:s,j A k+1:s,k A kj end for end for LU factorization Triangular solve Matrix-matrix product DAG with a matrix consisting of 5 5 blocks (s=5) T 51 M 51 T 52 M 52 T 21 M 21 G 22 G 11 T 32 M 32 G 33 T 53 M 53 T 54 M 54 G 55 T 31 M 31 T 43 M 43 G 44 T 41 M 41 T 42 M 42 Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC Savannah (GA), Feb. 2012
9 Blocked algorithm for LUPP (no PP, for simplicity) for k = 1 : s do A k:s,k = L k:s,k U kk for j = k + 1 : s do A kj L 1 kk A kj A k+1:s,j A k+1:s,j A k+1:s,k A kj end for end for LU factorization Triangular solve Matrix-matrix product DAG with a matrix consisting of 5 5 blocks (s=5) T 51 M 51 T 52 M 52 T 21 M 21 G 22 G 11 T 32 M 32 G 33 T 53 M 53 T 54 M 54 G 55 T 31 M 31 T 43 M 43 G 44 T 41 M 41 T 42 M 42 Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC Savannah (GA), Feb. 2012
10 Slack Reduction Algorithm 1.1 Slack Reduction Algorithm Strategy Search for slacks (idle periods) in the DAG associated with the algorithm, and try to minimize them applying e.g. DVFS 1 Search slacks via the Critical Path Method (CPM): DAG of dependencies Nodes Tasks Edges Dependencies ES i /LF i : Early and latest times task T i, with cost C i, can start/finalize without increasing the total execution time of the algorithm S i : Slack (time) task T i can be delayed without increasing the total execution time Critical path: Collection of tasks directly connected, from initial to final node of the graph, with total slack = 0 2 Minimize the slack of tasks with S i > 0, reducing their execution frequency via SRA Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC Savannah (GA), Feb. 2012
11 Slack Reduction Algorithm 1.1 Slack Reduction Algorithm Strategy Search for slacks (idle periods) in the DAG associated with the algorithm, and try to minimize them applying e.g. DVFS 1 Search slacks via the Critical Path Method (CPM): DAG of dependencies Nodes Tasks Edges Dependencies ES i /LF i : Early and latest times task T i, with cost C i, can start/finalize without increasing the total execution time of the algorithm S i : Slack (time) task T i can be delayed without increasing the total execution time Critical path: Collection of tasks directly connected, from initial to final node of the graph, with total slack = 0 2 Minimize the slack of tasks with S i > 0, reducing their execution frequency via SRA Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC Savannah (GA), Feb. 2012
12 Slack Reduction Algorithm Application of CPM to the DAG of the LUPP of a 5 5 blocked matrix: T 51 M 51 T 52 M 52 T 21 T 53 M 53 M 21 G 22 G 11 T 32 M 32 G 33 T 31 T 43 M 43 G 44 M 31 T 42 M 42 M 41 T 41 T 54 M 54 G 55 Task C ES LF S G T T T T Duration of tasks (cost, C) of type G and M depends on the iteration! Evaluate the time of 1 flop for each type of task and, from its theoretical cost, approximate the execution time.. Saving Energy in Sparse and Dense Linear Algebra Computations 8 SIAM PPSC Savannah (GA), Feb. 2012
13 Slack Reduction Algorithm Slack Reduction Algorithm 1 Frequency assignment: Set initial frequencies 2 Critical subpath extraction 3 Slack reduction 1 Frequency assignment G 11 T 21 T 31 T 41 T 51 M 21 M 31 M 51 G 22 M 41 T 52 T 32 T 42 M 32 M 42 M 52 T 53 M 53 G 33 T 43 M 43 G 44 T 54 M 54 G 55 Discrete range of frequencies: {, 1.50, 1.20, 1.00, 0.80} GHz Duration of tasks of type G and M depends on the iteration. Initially, all tasks to run at the highest frequency: GHz Saving Energy in Sparse and Dense Linear Algebra Computations 9 SIAM PPSC Savannah (GA), Feb. 2012
14 Slack Reduction Algorithm Slack Reduction Algorithm 1 Frequency assignment 2 Critical subpath extraction: Identify critical subpaths 3 Slack reduction 2 Critical subpath extraction G 11 T 21 T 31 T 41 T 51 M 21 M 31 M 51 G 22 M 41 T 52 T 32 T 42 M 32 M 42 M 52 T 53 M 53 G 33 T 43 M 43 G 44 T 54 M 54 G 55 Critical subpath extraction: 1 Identify and extract the critical (sub)path(s) 2 Eliminate the graph nodes/edges that belong to it 3 Repeat process until the graph is empty Results: Critical path: CP 0 3 critical subpaths (from largest to shortest): CP 1 > CP 2 > CP 3 Saving Energy in Sparse and Dense Linear Algebra Computations 10 SIAM PPSC Savannah (GA), Feb. 2012
15 Slack Reduction Algorithm Slack Reduction Algorithm 1 Frequency assignment 2 Critical subpath extraction 3 Slack reduction: Determine execution frequency for each task 3 Slack reduction G 11 T 21 T T T M 21 M M G 22 M Slack reduction algorithm: T T 32 T M 32 M M T M G 33 T 43 M 43 G 44 T 54 1 Reduce frequency of tasks of the largest unprocessed subpath 2 Check the lowest frequency reduction ratio for each task of that subpath 3 Repeat process until all subpaths are processed M 54 G 55 Results: Tasks of CSP 1, CSP 2 and CSP 3 are assigned to run at 1.5 GHz! Saving Energy in Sparse and Dense Linear Algebra Computations 11 SIAM PPSC Savannah (GA), Feb. 2012
16 Race-to-Idle Algorithm 1.2 Race-to-Idle Algorithm Race-to-Idle complete execution as soon as possible by executing tasks of the algorithm at the highest frequency to enjoy longer inactive periods Tasks are executed at highest frequency During idle periods CPU frequency is reduced to lowest possible Why? Current processors are quite efficient at saving power when idle Power of an idle core is much lower than power during working periods DAG requires no processing, unlike SRA Saving Energy in Sparse and Dense Linear Algebra Computations 12 SIAM PPSC Savannah (GA), Feb. 2012
17 Simulation 1.3 Simulation Use of a simulator to evaluate the performance of the two strategies Input parameters: DAG capturing tasks and dependencies of a blocked algorithm and frequencies recommended by the Slack Reduction Algorithm and Race-to-Idle Algorithm A simple description of the target architecture: Number of sockets (physical processors) Number of cores per socket Discrete range of frequencies and associated voltages Frequency changes at core/socket level Collection of real power for each combination of frequency idle/busy state per core Cost (overhead) required to perform frequency changes Static priority list scheduler: Duration of tasks at each available frequency is known in advance Tasks that lie on critical path are prioritized Saving Energy in Sparse and Dense Linear Algebra Computations 13 SIAM PPSC Savannah (GA), Feb. 2012
18 Simulation AMD Opteron 6128 (8 cores): Evaluate the time of 1 flop for each type of task and, from its theoretical cost, approximate the execution time Frequency changes at core level, with f {, 1.50, 1.20, 1.00, 0.80} GHz Blocked algorithm for LUPP: Metrics: Simulation independent of actual implementation (LAPACK, libflame, etc.) Matrix size n from 768 to 5,632; block size: b = 256 Compare simulated time and energy consumption of original algorithm with those of modified SRA/RIA algorithms Simulated time T SRA/RIA T original Impact of SRA/RIA on time IT = T SRA/RIA T original 100 Simulated energy E SRA/RIA = P n i=1 Wn Tn E original = v 2 T(f max) Impact of SRA/RIA on energy IE = E SRA/RIA E original 100 Saving Energy in Sparse and Dense Linear Algebra Computations 14 SIAM PPSC Savannah (GA), Feb. 2012
19 Simulation Impact of SRA/RIA on simulated time/energy for LUPP: only power/energy due to workload (application)! RIA SRA Time Energy RIA SRA % Impact on time % Impact on energy Matrix size (n) Matrix size (n) SRA: Time is compromised, increasing the consumption for largest problem sizes Increase in execution time due to SRA being oblivious to the actual resources RIA: Time is not compromised and consumption is reduced for large problem sizes Saving Energy in Sparse and Dense Linear Algebra Computations 15 SIAM PPSC Savannah (GA), Feb. 2012
20 Simulation Impact of SRA/RIA on simulated time/energy for LUPP: only power/energy due to workload (application)! RIA SRA Time Application energy RIA SRA % Impact on time % Impact on application energy Matrix size (n) Matrix size (n) SRA: Time is compromised, increasing the consumption for largest problem sizes Increase in execution time due to the SRA being oblivious to the actual resources RIA: Time is not compromised and consumption is reduced for large problem sizes Saving Energy in Sparse and Dense Linear Algebra Computations 16 SIAM PPSC Savannah (GA), Feb. 2012
21 1.4 Experimental Results LUPP Experimental Results Integrate RIA into a runtime for task-parallel execution of dense linear algebra algorithms: libflame+supermatrix Algorithm Worker Th. 1 Core 1 Symbolic Analysis Queue of pending tasks + dependencies (DAG) Dispatch Queue of ready tasks (no dependencies) Worker Th. 2. Worker Th. p Core 2. Core p Two energy-aware techniques: RIA1: Reduce operation frequency when there are no ready tasks (Linux governor) RIA2: Remove polling when there are no ready tasks (while ensuring a quick recovery) Applicable to any runtime: SuperMatrix, SMPSs, Quark, etc! Saving Energy in Sparse and Dense Linear Algebra Computations 17 SIAM PPSC Savannah (GA), Feb. 2012
22 Experimental Results Doing nothing well David E. Culler AMD Opteron 6128, 1 core: Power for different thread activities MKL dgemm at GHz Blocking at 800 MHz Polling at GHz Polling at 800 MHz Power (Watts) Time (s) Saving Energy in Sparse and Dense Linear Algebra Computations 18 SIAM PPSC Savannah (GA), Feb. 2012
23 Experimental Results AMD Opteron 6128 (8 cores): Frequency changes at core level, with fmax = GHz and f min = 800 MHz Blocked algorithm for LUPP: Routine FLA LU piv from libflame v5.0 Metrics: Matrix size n from 2,048 to 12,288; block size: b = 256 Compare actual time and energy consumption of original runtime with those of modified RIA1/RIA2 runtime Actual time T RIA T original Impact of RIA on time IT = T RIA T original 100 Actual energy E RIA and E original measured using internal powermeter with sampling f =25 Hz Impact of RIA on energy IE = E RIA E original 100 Saving Energy in Sparse and Dense Linear Algebra Computations 19 SIAM PPSC Savannah (GA), Feb. 2012
24 Experimental Results Impact of RIA1/RIA2 on actual time/energy for LUPP: only power/energy due to workload (application)! RIA1 RIA2 RIA1+RIA2 Time RIA1 RIA2 RIA1+RIA2 Energy % Impact on time % Impact on energy Matrix size (n) Matrix size (n) Small impact on execution time Consistent savings around 5 % for RIA1+RIA2: Not much hope for larger savings because there is no opportunity for that: the cores are busy most of the time Saving Energy in Sparse and Dense Linear Algebra Computations 20 SIAM PPSC Savannah (GA), Feb. 2012
25 Experimental Results Impact of RIA1/RIA2 on actual time/energy for LUPP: only power/energy due to workload (application)! % Impact on time RIA1 RIA2 RIA1+RIA2 Time % Impact on application energy RIA1 RIA2 RIA1+RIA2 Application energy Matrix size (n) Matrix size (n) Small impact on execution time Consistent savings around 7 8% for RIA1+RIA2: Not much hope for larger savings because there is no opportunity for that: the cores are busy most of the time Saving Energy in Sparse and Dense Linear Algebra Computations 21 SIAM PPSC Savannah (GA), Feb. 2012
26 ILUPACK 2. ILUPACK Sequential version M. Bollhöfer Iterative solver for large-scale sparse linear systems Multilevel ILU preconditioners for general and complex symmetric/hermitian positive definite systems Based on inverse-based ILUs: incomplete LU decompositions that control the growth of the inverse triangular factors Multi-threaded version J.I. Aliaga, M. Bollhöfer, A. F. Martín, E. S. Quintana-Ortí Real symmetric positive definite systems Construction of preconditioner and PCG solver Algebraic parallelization based on a task tree Leverage task-parallelism in the tree Dynamic scheduling via tailored run-time (OpenMP) Saving Energy in Sparse and Dense Linear Algebra Computations 22 SIAM PPSC Savannah (GA), Feb. 2012
27 ILUPACK PA A GA (1,2) (3,1) (1,4) (2,1) (2,2) TT (3,1) (1,1) (3,1) (1,3) (2,1) (2,2) (1,1) (1,2) (1,3) (1,4) Saving Energy in Sparse and Dense Linear Algebra Computations 23 SIAM PPSC Savannah (GA), Feb. 2012
28 ILUPACK Integrate RIA into multi-threaded version of ILUPACK Problem data: A Worker Th. 1 Core 1 Partitioning: METIS/SCOTCH Queue of pending tasks (binary tree). Worker Th. 2. Worker Th. p Core 2. Core p Queues of ready tasks (no dependencies) Same energy-aware techniques: RIA1+RIA2: Reduce operation frequency when there are no ready tasks (Linux governor) and remove polling when there are no ready tasks (while ensuring a quick recovery) Saving Energy in Sparse and Dense Linear Algebra Computations 24 SIAM PPSC Savannah (GA), Feb. 2012
29 ILUPACK AMD Opteron 6128 (8 cores) Linear system associated with Laplacian equation (n 16M) Impact of RIA1+RIA2 on actual power/application power for ILUPACK (only preconditioner): % Impact on power % Impact on application power Power 1.6 RIA1+RIA Sample Application power 1.6 RIA1+RIA Sample Saving Energy in Sparse and Dense Linear Algebra Computations 25 SIAM PPSC Savannah (GA), Feb. 2012
30 3. Conclusions Conclusions Goal: Combine scheduling+dvfs to save energy during the execution of linear algebra algorithms on multi-threaded architectures For (dense) LUPP: Slack Reduction Algorithm DAG requires a processing Currently does not take into account number of resources Increases execution time as matrix size grows Increases also energy consumption Race-to-Idle Algorithm Algorithm is applied on-the-fly, no pre-processing needed Maintains in all of cases execution time Reduce application energy consumption (around 7 8%) As the ratio between the problem size and the number of resources grows, the opportunities to save energy decrease! Saving Energy in Sparse and Dense Linear Algebra Computations 26 SIAM PPSC Savannah (GA), Feb. 2012
31 Conclusions For (sparse) ILUPACK: Race-to-Idle Algorithm DAG requires no processing Algorithm is applied on-the-fly Maintains in all of cases execution time Reduce power dissipated by application up to 40% Significant opportunities to save energy! General: We need architectures that know how to do nothing better! Saving Energy in Sparse and Dense Linear Algebra Computations 27 SIAM PPSC Savannah (GA), Feb. 2012
32 More information Improving power-efficiency of dense linear algebra algorithms on multi-core processors via slack control P. Alonso, M. F. Dolz, R. Mayo, E. S. Quintana-Ortí Workshop on Optimization Issues in Energy Efficient Distributed Systems OPTIM 2011, Istanbul (Turkey), July 2011 DVFS-control techniques for dense linear algebra operations on multi-core processors P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí 2nd Int. Conf. on Energy-Aware High Performance Computing EnaHPC 2011, Hamburg (Germany), Sept Saving energy in the LU factorization with partial pivoting on multi-core processors P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí 20th Euromicro Conf. on Parallel, Distributed and Network based Processing PDP 2012, Garching (Germany). Feb Saving Energy in Sparse and Dense Linear Algebra Computations 28 SIAM PPSC Savannah (GA), Feb. 2012
33 Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain de Castellón, Spain at Austin, TX quintana@icc.uji.es Saving Energy in Sparse and Dense Linear Algebra Computations 29 SIAM PPSC Savannah (GA), Feb. 2012
Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors
20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,
More informationLeveraging Task-Parallelism in Energy-Efficient ILU Preconditioners
Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga
More informationPower-Aware Execution of Sparse and Dense Linear Algebra Libraries
Power-Aware Execution of Sparse and Dense Linear Algebra Libraries Enrique S. Quintana-Ortí quintana@icc.uji.es Power-aware execution of linear algebra libraries 1 CECAM Lausanne, Sept. 2011 Motivation
More informationParallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationJ.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009
Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.
More informationTools for Power and Energy Analysis of Parallel Scientific Applications
The 41st International Conference on Parallel Processing Tools for Power and Energy Analysis of Parallel Scientific Applications Pedro Alonso 1, Rosa M. Badia 2, Jesús Labarta 2, María Barreda 3, Manuel
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationImproving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra
Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra FLAME Working Note #73 Pedro Alonso, Manuel F. Dolz 2, Francisco D. Igual 3, Rafael Mayo 4, and Enrique S. Quintana-Ortí
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationPerformance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures
Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationParallel Performance Theory
AMS 250: An Introduction to High Performance Computing Parallel Performance Theory Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline
More informationDesign of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization
Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan 2, Robert A. van de Geijn 2, and Field
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More informationScientific Computing
Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting
More informationParallel Performance Theory - 1
Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis
More informationSTATISTICAL PERFORMANCE
STATISTICAL PERFORMANCE PROVISIONING AND ENERGY EFFICIENCY IN DISTRIBUTED COMPUTING SYSTEMS Nikzad Babaii Rizvandi 1 Supervisors: Prof.Albert Y.Zomaya Prof. Aruna Seneviratne OUTLINE Introduction Background
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 2: Direct Methods PD Dr.
More informationEnergy-efficient Mapping of Big Data Workflows under Deadline Constraints
Energy-efficient Mapping of Big Data Workflows under Deadline Constraints Presenter: Tong Shu Authors: Tong Shu and Prof. Chase Q. Wu Big Data Center Department of Computer Science New Jersey Institute
More informationIncomplete Cholesky preconditioners that exploit the low-rank property
anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationNumerical Linear Algebra
Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost
More informationProgram Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects
Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationarxiv: v1 [cs.dc] 19 Nov 2016
A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting arxiv:1611.06365v1 [cs.dc] 19 Nov 2016 Sandra Catalán a, José R. Herrero b, Enrique S. Quintana-Ortí
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationDirect and Incomplete Cholesky Factorizations with Static Supernodes
Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationHow to deal with uncertainties and dynamicity?
How to deal with uncertainties and dynamicity? http://graal.ens-lyon.fr/ lmarchal/scheduling/ 19 novembre 2012 1/ 37 Outline 1 Sensitivity and Robustness 2 Analyzing the sensitivity : the case of Backfilling
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationTile QR Factorization with Parallel Panel Processing for Multicore Architectures
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University
More informationResilient and energy-aware algorithms
Resilient and energy-aware algorithms Anne Benoit ENS Lyon Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/~abenoit CR02-2016/2017 Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 1/
More informationScientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix
Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008
More informationLU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU
LU Factorization To further improve the efficiency of solving linear systems Factorizations of matrix A : LU and QR LU Factorization Methods: Using basic Gaussian Elimination (GE) Factorization of Tridiagonal
More informationPower Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
Noname manuscript No. (will be inserted by the editor) Power Profiling of Cholesky and QR Factorizations on Distributed s George Bosilca Hatem Ltaief Jack Dongarra Received: date / Accepted: date Abstract
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 2, pp. C169 C193 c 2015 Society for Industrial and Applied Mathematics FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper
More informationBalanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems
Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Jos M. Badía 1, Peter Benner 2, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, Gregorio Quintana-Ortí 1, A. Remón 1 1 Depto.
More informationNumerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms
Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms Peter Benner 1, Pablo Ezzatti 2, Hermann Mena 3, Enrique S. Quintana-Ortí 4, Alfredo Remón 4 1 Fakultät für Mathematik,
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationPerformance and Fault Tolerance of Preconditioned Iterative Solvers on Low-Power ARM Architectures
Performance and Fault Tolerance of Preconditioned Iterative Solvers on Low-Power ARM Architectures Aliaga, J. I., Catalan, S., Chalios, C., Nikolopoulos, D. S., & Quintana-Orti, E. S. (2015). Performance
More informationA robust multilevel approximate inverse preconditioner for symmetric positive definite matrices
DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric
More informationMultilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota
Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationEnergy-efficient scheduling
Energy-efficient scheduling Guillaume Aupy 1, Anne Benoit 1,2, Paul Renaud-Goud 1 and Yves Robert 1,2,3 1. Ecole Normale Supérieure de Lyon, France 2. Institut Universitaire de France 3. University of
More informationAccelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction
Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Peter Benner 1, Ernesto Dufrechou 2, Pablo Ezzatti 2, Pablo Igounet 2, Enrique S. Quintana-Ortí 3, and Alfredo Remón
More informationNumerical Analysis Fall. Gauss Elimination
Numerical Analysis 2015 Fall Gauss Elimination Solving systems m g g m m g x x x k k k k k k k k k 3 2 1 3 2 1 3 3 3 2 3 2 2 2 1 0 0 Graphical Method For small sets of simultaneous equations, graphing
More informationAnalytical Modeling of Parallel Programs. S. Oliveira
Analytical Modeling of Parallel Programs S. Oliveira Fall 2005 1 Scalability of Parallel Systems Efficiency of a parallel program E = S/P = T s /PT p Using the parallel overhead expression E = 1/(1 + T
More informationUsing Postordering and Static Symbolic Factorization for Parallel Sparse LU
Using Postordering and Static Symbolic Factorization for Parallel Sparse LU Michel Cosnard LORIA - INRIA Lorraine Nancy, France Michel.Cosnard@loria.fr Laura Grigori LORIA - Univ. Henri Poincaré Nancy,
More informationToday s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn
Today s class Linear Algebraic Equations LU Decomposition 1 Linear Algebraic Equations Gaussian Elimination works well for solving linear systems of the form: AX = B What if you have to solve the linear
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationLinear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4
Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationTDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II
TDDB68 Concurrent programming and operating systems Lecture: CPU Scheduling II Mikael Asplund, Senior Lecturer Real-time Systems Laboratory Department of Computer and Information Science Copyright Notice:
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationTradeoffs between synchronization, communication, and work in parallel linear algebra computations
Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,
More informationIntroduction to Mathematical Programming
Introduction to Mathematical Programming Ming Zhong Lecture 6 September 12, 2018 Ming Zhong (JHU) AMS Fall 2018 1 / 20 Table of Contents 1 Ming Zhong (JHU) AMS Fall 2018 2 / 20 Solving Linear Systems A
More informationSpecial Nodes for Interface
fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster
More informationComputation of the mtx-vec product based on storage scheme on vector CPUs
BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More informationA parallel tiled solver for dense symmetric indefinite systems on multicore architectures
A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris
More informationModule 5: CPU Scheduling
Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation 5.1 Basic Concepts Maximum CPU utilization obtained
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationSolving linear systems (6 lectures)
Chapter 2 Solving linear systems (6 lectures) 2.1 Solving linear systems: LU factorization (1 lectures) Reference: [Trefethen, Bau III] Lecture 20, 21 How do you solve Ax = b? (2.1.1) In numerical linear
More informationChapter 6: CPU Scheduling
Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation 6.1 Basic Concepts Maximum CPU utilization obtained
More information(17) (18)
Module 4 : Solving Linear Algebraic Equations Section 3 : Direct Solution Techniques 3 Direct Solution Techniques Methods for solving linear algebraic equations can be categorized as direct and iterative
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More informationNumerical Methods I: Numerical linear algebra
1/3 Numerical Methods I: Numerical linear algebra Georg Stadler Courant Institute, NYU stadler@cimsnyuedu September 1, 017 /3 We study the solution of linear systems of the form Ax = b with A R n n, x,
More informationENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING
ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING Abhishek Mishra and Anil Kumar Tripathi Department of
More informationTowards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters
Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationEnergy-aware scheduling for GreenIT in large-scale distributed systems
Energy-aware scheduling for GreenIT in large-scale distributed systems 1 PASCAL BOUVRY UNIVERSITY OF LUXEMBOURG GreenIT CORE/FNR project Context and Motivation Outline 2 Problem Description Proposed Solution
More informationFine-grained Parallel Incomplete LU Factorization
Fine-grained Parallel Incomplete LU Factorization Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Sparse Days Meeting at CERFACS June 5-6, 2014 Contribution
More informationOptimized LU-decomposition with Full Pivot for Small Batched Matrices S3069
Optimized LU-decomposition with Full Pivot for Small Batched Matrices S369 Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se Based on work for GTC 212: 1x speed-up vs multi-threaded
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationParallel Circuit Simulation. Alexander Hayman
Parallel Circuit Simulation Alexander Hayman In the News LTspice IV features multi-threaded solvers designed to better utilize current multicore processors. Also included are new SPARSE matrix solvers
More information7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix
EE507 - Computational Techniques for EE 7. LU factorization Jitkomut Songsiri factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization
More informationCPU SCHEDULING RONG ZHENG
CPU SCHEDULING RONG ZHENG OVERVIEW Why scheduling? Non-preemptive vs Preemptive policies FCFS, SJF, Round robin, multilevel queues with feedback, guaranteed scheduling 2 SHORT-TERM, MID-TERM, LONG- TERM
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationCPU Scheduling. CPU Scheduler
CPU Scheduling These slides are created by Dr. Huang of George Mason University. Students registered in Dr. Huang s courses at GMU can make a single machine readable copy and print a single copy of each
More informationLSN 15 Processor Scheduling
LSN 15 Processor Scheduling ECT362 Operating Systems Department of Engineering Technology LSN 15 Processor Scheduling LSN 15 FCFS/FIFO Scheduling Each process joins the Ready queue When the current process
More informationFeeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra
SPPEXA Annual Meeting 2016, January 25 th, 2016, Garching, Germany Feeding of the Thousands Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt Sparse Linear Algebra on GPUs Inherently
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More information1.Chapter Objectives
LU Factorization INDEX 1.Chapter objectives 2.Overview of LU factorization 2.1GAUSS ELIMINATION AS LU FACTORIZATION 2.2LU Factorization with Pivoting 2.3 MATLAB Function: lu 3. CHOLESKY FACTORIZATION 3.1
More informationScheduling with Constraint Programming. Job Shop Cumulative Job Shop
Scheduling with Constraint Programming Job Shop Cumulative Job Shop CP vs MIP: Task Sequencing We need to sequence a set of tasks on a machine Each task i has a specific fixed processing time p i Each
More informationMore Science per Joule: Bottleneck Computing
More Science per Joule: Bottleneck Computing Georg Hager Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg Germany PPAM 2013 September 9, 2013 Warsaw, Poland Motivation (1): Scalability
More informationReclaiming the energy of a schedule: models and algorithms
Author manuscript, published in "Concurrency and Computation: Practice and Experience 25 (2013) 1505-1523" DOI : 10.1002/cpe.2889 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.:
More informationNumerical Linear Algebra
Numerical Linear Algebra Direct Methods Philippe B. Laval KSU Fall 2017 Philippe B. Laval (KSU) Linear Systems: Direct Solution Methods Fall 2017 1 / 14 Introduction The solution of linear systems is one
More informationParallel Programming. Parallel algorithms Linear systems solvers
Parallel Programming Parallel algorithms Linear systems solvers Terminology System of linear equations Solve Ax = b for x Special matrices Upper triangular Lower triangular Diagonally dominant Symmetric
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationNumerical Methods I Solving Square Linear Systems: GEM and LU factorization
Numerical Methods I Solving Square Linear Systems: GEM and LU factorization Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 18th,
More information