Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Size: px
Start display at page:

Download "Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign"

Transcription

1 Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign

2 Collaborators Thesis Advisors Denis Barthou (Labri/INRIA Bordeaux) David Padua (University of Illinois at UC) Future collaboration starpu team, INRIA Bordeaux Jointlab INRIA-UIUC 2

3 Objectives and Contributions INTRODUCTION Jointlab INRIA-UIUC 3

4 Objectives Compile linear algebra equations Compute X for L X X U = C [CTSY] Compute L and U for L U = A [LU] Generate efficient task parallel code Identify tasks Generate task dependence graph Jointlab INRIA-UIUC 4

5 Motivation Focus is on linear algebra Start from high level description No code or algorithm Derivation through blocking of operands Data centric approach Derivation for parallelism Output is a parallel task graph Jointlab INRIA-UIUC 5

6 Contributions A specification language Express computation Characterize operands (shapes) Identify wanted result Derivation rules Validity/applicability patterns Operators symbolic execution rules Dependence build engine Jointlab INRIA-UIUC 6

7 A detailed view of the generator SYSTEM DESCRIPTION Jointlab INRIA-UIUC 7

8 Description Language - Operands %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% ctsy All operands (Type inference) Status Known, Unknown Shape Triangular, diagonal Type Matrix, (vector, scalar) Modifiers (transpose) (Sizes) (Density) Jointlab INRIA-UIUC 8

9 Description language - Equation %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% ctsy Base for decomposition Simple equations Assignments X = A*B Solvers LU Triangular Sylvester L*X=B Cholesky Jointlab INRIA-UIUC 9

10 Description language - Parameters %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% ctsy Drive code generation Set of parameters Name Codelet Order of operands Data type For customization Default values used Jointlab INRIA-UIUC 10

11 Kernel Declaration Generate a full solution Cost of full recursion Usually not a good idea Use existing kernels and libraries Already optimized Jointlab INRIA-UIUC 11

12 Performance Kernel Performance Measure performance Depend on size target Guide exploration Optimal nodes Similar to ATLAS Problem Size Jointlab INRIA-UIUC 12

13 Starpu Codelet Codelet declaration void gemm(void *buffers[], void *cl_arg); struct starpu_codelet gemm_cl = {.where = STARPU_CPU,.cpu_funcs = { gemm, NULL},.nbuffers = 3,.modes = {STARPU_R, STARPU_R, STARPU_RW} }; Kernel code void gemm(void *buffers[], void *cl_arg) { struct params *params = cl_arg; int n = params->n; double *a = (double *) STARPU_MATRIX_GET_PTR(buffers[0]); // cblas_dgemm(cblasrowmajor, CBlasNoTrans, CBlasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n); } Jointlab INRIA-UIUC 13

14 Defining blocking space Blocking defines the space of solutions Must only generate valid solutions Look at the equation s operation tree Each node gets a set of dimensions Generate constraints depending on operation Jointlab INRIA-UIUC 14

15 Validity of Blocking y A x B y B B <x A,y B > A B <x A,y A > <x B,y B > x A A New Constraint { y A = x B } Jointlab INRIA-UIUC 15

16 Valid Blockings - DTSY A <x A,y A > <x A,y X > <x A,y B > X <x X,y X > - (C) <x C,y C > B <x B,y B > X <x X,y X > Constraints y A = x X y X = x B x A = x X y B = y X x C = x X y C = y X { x A,y A,x X, x C } { x B,y B,y X,y C } Jointlab INRIA-UIUC 16

17 Valid Blockings DTSY (2) A*X*B X = C A lower triangular, B upper triangular xa = ya = xx = 2 xb = yb = yx = 2 A(0,0) A(0,1) X(0,0) X(0,1) B(0,0) B(0,1) X(0,0) X(0,1) T(0,0) T(0,1) * * - = A(1,0) A(1,1) X(1,0) X(1,1) B(1,0) B(1,1) X(1,0) X(1,1) T(1,0) T(1,1) xa = ya = xx = 2 xb = yb = yx = 1 A(0,0) A(0,1) X(0,0) * * A(1,0) A(1,1) X(1,0) B(0,0) <=> B X(0,0) - = X(1,0) T(0,0) T(1,0) xa = ya = xx = 3 xb = yb = yx = 3 A(0,0) A(2,2) X(0,0) X(2,2) B(0,0) B(2,2) X(0,0) X(2,2) * * - = C(0,0) C(2,2) Jointlab INRIA-UIUC 17

18 Derivation example * = A X T blocking Sym. Exec. A(0,0) A(0,1) X(0,0) X(0,1) T(0,0) T(0,1) * = A(1,0) A(1,1) X(1,0) X(1,1) T(1,0) T(1,1) T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,0)*X(0,0) + A(1,1)*X(1,0) T(1,1) = A(1,0)*X(0,1) + A(1,1)*X(1,1) Removal of 0-computation T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,1)*X(1,0) T(1,1) = A(1,1)*X(1,1) Jointlab INRIA-UIUC 18

19 Operand characterization Blocks of X are outputs (unknown) A(0,0) and A(1,1) are lower triangular Equation Input Output T(1,1) = A(1,1) X(1,1) T(1,1) A(1,1) X(1,1) T(1,0) = A(1,1) X(1,0) T(1,0) A(1,1) X(1,0) T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(0,1) X(1,1) T(0,0) = A(0,0) X(0,0) + A(0,1) X(1,0) T(0,0) A(0,0) A(0,1) X(0,0) X(1,0) Jointlab INRIA-UIUC 19

20 Equation Signature Need to identify equations Identification through types and operators A X = T : LT UNK = MT Set of simplification rules UNK + MT MT UNK + MT Jointlab INRIA-UIUC 20

21 Identify task T(1,1) = A(1,1) X(1,1) Signature : LT UNK = MT Instance of original problem Solvable X(1,1) can now be considered known Jointlab INRIA-UIUC 21

22 Building dependence Find instances of X(1,1) Shift in input set Add dependence edge Equation Input Output T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(1, 1) X(0,1) New Dependence {T 1,1 = A 1,1 X 1,1 T 0,1 = A 0,0 X 0,1 + A 0,1 X(1,1) } Jointlab INRIA-UIUC 22

23 Signature Simplification T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 1. MT = LT UNK + MT MT 2. MT = LT UNK + MT 3. MT MT = LT UNK 4. MT = LT UNK Match to the original problem! Jointlab INRIA-UIUC 23

24 Post identification expansion Simplification step Corresponding equation MT = LT UNK + MT MT T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 MT = LT UNK + MT MT MT = LT UNK MT = LT UNK T1 = A 01 X(1,1) T 0,1 = A 0,0 X 0,1 + T1 T1 = A 01 X(1, 1) T2 = T 0, 1 T1 T2 = A 0, 0 X(0, 1) Jointlab INRIA-UIUC 24

25 Selection Generator Equation Dependence Graph Derivation L X U X = C ø L11 X11 U11 X11 = T Set Equations Dependence Graph Identification 1: L00 X00 U00 X00 = C00 2: L10 X00 U00 + L11 X10 U00 X10 = C10 3: L00 X00 U01 + L00 X01 U11 X01 = C01 4: L10 X00 U L11 X11 U11 X11 = C11 ø 1: L X U X = T00 2: L X U L X U X = T10 3: L X U L X U X = T01 4: L X U L X U X = T Dependence Graph Termination Final Graph Jointlab INRIA-UIUC 25

26 Simple graph example Jointlab INRIA-UIUC 26

27 Graph Example Jointlab INRIA-UIUC 27

28 Heterogeneous Graph Example MM DTSY DTSY Jointlab INRIA-UIUC 28

29 INITIAL RESULTS AND CHALLENGES Jointlab INRIA-UIUC 29

30 Challenges Version generation time 5 minutes in generator 20 minutes compiling Generated code size 500k lines of code in single function (icc segfaults) Jointlab INRIA-UIUC 30

31 Speedup Trisolve (L*X=B) ref Problem Size Jointlab INRIA-UIUC 31

32 Conclusion Faster development cycle for architectures No time to hand-tune everything anymore Can t hand tune for every HW iteration Increased complexity Heterogeneous systems Description of an automatic methodology Leverage existing small scale libraries Jointlab INRIA-UIUC 32

33 AU CAS OU Jointlab INRIA-UIUC 33

34 Exploiting heterogeneous machines : starpu Runtime Goal: Scheduling ( offloading) tasks over heterogeneous machines CPU + GPU + SPU = *PU Auto-tuning of performance models Optimization of memory transfers Target for Compilers StarSs [UPC], HMPP [CAPS] Libraries PLASMA/MAGMA [UTK] Applications Fast multipole methods [CESTA] StarPU provides an Open Scheduling platform Scheduling algorithm = plugins CPU CPU CPU CPU CPU CPU M. CPU CPU M. GPU GPU M. M Speed (GFlops) Jointlab INRIA-UIUC 34 Greedy task model prefetch data model

35 Overview of StarPU Maximizing PU occupancy, minimizing data transfers Principle Accept tasks that may have multiple implementations Together with potential interdependencies Leads to a dynamic acyclic graph of tasks Data-flow approach Provide a high-level data management layer Application should only describe Which data may be accessed by tasks How data may be divided Parallel Compilers CPU HPC Applications StarPU Parallel Libraries Drivers (CUDA, OpenCL GPU Jointlab INRIA-UIUC 35

36 Code styles and optimization for(int i = 0 ; i < N ; i++) for(int j = 0 ; j < N ; j++) for(int k = 0 ; k < N ; k++) c[i][j] += a[i][k] * b[k][j]; for(int kk = 0 ; kk < N ; kk+=b) for(int ii = 0 ; ii < N ; ii+=b) for(int jj = 0 ; jj < N ; jj+=b) for(int i = ii ; i < ii+b ; i++) for(int k = kk ; k < kk+b ; k++) { c_i = c[i]; a_ik = a[i][k]; b_k = b[k]; for(int j = jj ; i < jj+b ; i+=2) { c_i[j] += a_ik * b_k[j]; c_u[j+1] += a_ik * b_k[j+1]; } } Alexandre X. Duchateau - Preliminary Examination 36

37 Empirical search Generation of multiple solutions Algorithmic exploration Theoretical evaluation Usually unreliable Filtering heuristic Empirical evaluation (execution) slow (very slow when looking at bad versions) Alexandre X. Duchateau - Preliminary Examination 37

38 Generalized system overview Driver Description Generator Versions Prediction Execution Selection Predictor Filters what versions are actually executed Driver Guide the search Alexandre X. Duchateau - Preliminary Examination 38

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation

GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation Konrad Trifunovic 2 Albert Cohen 2 David Edelsohn 3 Li Feng 6 Tobias Grosser 5 Harsha Jagasia 1 Razya Ladelsky 4 Sebastian

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Parallel Performance Theory - 1

Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material BIN COUNT EGPGV,. This is the author version of the work. It is posted here by permission of Eurographics for your personal use. Not for redistribution. The definitive version is available at http://diglib.eg.org/.

More information

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th, 2016 a Université de Lille, Sciences et Technologies

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Accelerating Proton Computed Tomography with GPUs

Accelerating Proton Computed Tomography with GPUs Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility' Michael'E.'Papka,'Argonne'Leadership'Compu2ng'Facility,'Northern'Illinois'University' Nicholas'T.'Karonis,'Northern'Illinois'University,'Argonne'Na2onal'Laboratory

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review

More information

LINEAR SYSTEMS (11) Intensive Computation

LINEAR SYSTEMS (11) Intensive Computation LINEAR SYSTEMS () Intensive Computation 27-8 prof. Annalisa Massini Viviana Arrigoni EXACT METHODS:. GAUSSIAN ELIMINATION. 2. CHOLESKY DECOMPOSITION. ITERATIVE METHODS:. JACOBI. 2. GAUSS-SEIDEL 2 CHOLESKY

More information

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

Package magma. February 15, 2013

Package magma. February 15, 2013 Package magma February 15, 2013 Title Matrix Algebra on GPU and Multicore Architectures Version 0.2.2-1 Date 2010-08-27 Author Brian J Smith Maintainer Brian J Smith

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

Parallel Performance Theory

Parallel Performance Theory AMS 250: An Introduction to High Performance Computing Parallel Performance Theory Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline

More information

A Compiler for Linear Algebra Operations

A Compiler for Linear Algebra Operations A Compiler for Linear Algebra Operations Paolo Bientinesi In collaboration with Diego Fabregat AICES, RWTH Aachen pauldj@aices.rwth-aachen.de CScADS Autotuning Workshop 2012 August 13-14, 2012 Snowbird,

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

An Efficient Evolutionary Algorithm for Solving Incrementally Structured Problems

An Efficient Evolutionary Algorithm for Solving Incrementally Structured Problems An Efficient Evolutionary Algorithm for Solving Incrementally Structured Problems Jason Ansel Maciej Pacula Saman Amarasinghe Una-May O Reilly MIT - CSAIL July 14, 2011 Jason Ansel (MIT) PetaBricks July

More information

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga

More information

Saving Energy in Sparse and Dense Linear Algebra Computations

Saving Energy in Sparse and Dense Linear Algebra Computations Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain

More information

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. X, M YYYY 1 ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems Sameh Abdulah, Hatem Ltaief, Ying Sun,

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent

More information

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel? CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Scheduling Trees of Malleable Tasks for Sparse Linear Algebra

Scheduling Trees of Malleable Tasks for Sparse Linear Algebra Scheduling Trees of Malleable Tasks for Sparse Linear Algebra Abdou Guermouche, Loris Marchal, Bertrand Simon, Frédéric Vivien To cite this version: Abdou Guermouche, Loris Marchal, Bertrand Simon, Frédéric

More information

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Algorithms and Methods for Fast Model Predictive Control

Algorithms and Methods for Fast Model Predictive Control Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model

More information

N-body Simulations. On GPU Clusters

N-body Simulations. On GPU Clusters N-body Simulations On GPU Clusters Laxmikant Kale Filippo Gioachin Pritish Jetley Thomas Quinn Celso Mendes Graeme Lufkin Amit Sharma Joachim Stadel Lukasz Wesolowski James Wadsley Edgar Solomonik Fabio

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel Thuerck 1,2 (advisors Michael Goesele 1,2 and Marc Pfetsch 1 ) Maxim Naumov 3 1 Graduate School of Computational Engineering, TU Darmstadt

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Code Generation for GPU Accelerators in the Domain of Image Preprocessing

Code Generation for GPU Accelerators in the Domain of Image Preprocessing Code Generation for GPU Accelerators in the Domain of Image Preprocessing Oliver Reiche, Richard Membarth, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Dagstuhl,

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information

Reduced Vlasov-Maxwell modeling

Reduced Vlasov-Maxwell modeling Reduced Vlasov-Maxwell modeling Philippe Helluy, Michel Massaro, Laurent Navoret, Nhung Pham, Thomas Strub To cite this version: Philippe Helluy, Michel Massaro, Laurent Navoret, Nhung Pham, Thomas Strub.

More information

Algebraic Multigrid as Solvers and as Preconditioner

Algebraic Multigrid as Solvers and as Preconditioner Ò Algebraic Multigrid as Solvers and as Preconditioner Domenico Lahaye domenico.lahaye@cs.kuleuven.ac.be http://www.cs.kuleuven.ac.be/ domenico/ Department of Computer Science Katholieke Universiteit Leuven

More information

A Fast, Parallel Potential Flow Solver

A Fast, Parallel Potential Flow Solver Advisor: Jaime Peraire December 16, 2012 Outline 1 Introduction to Potential FLow 2 The Boundary Element Method 3 The Fast Multipole Method 4 Discretization 5 Implementation 6 Results 7 Conclusions Why

More information

Input: System of inequalities or equalities over the reals R. Output: Value for variables that minimizes cost function

Input: System of inequalities or equalities over the reals R. Output: Value for variables that minimizes cost function Linear programming Input: System of inequalities or equalities over the reals R A linear cost function Output: Value for variables that minimizes cost function Example: Minimize 6x+4y Subject to 3x + 2y

More information

GPU accelerated Arnoldi solver for small batched matrix

GPU accelerated Arnoldi solver for small batched matrix 15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA

More information

GRAPE-DR, GRAPE-8, and...

GRAPE-DR, GRAPE-8, and... GRAPE-DR, GRAPE-8, and... Jun Makino Center for Computational Astrophysics and Division Theoretical Astronomy National Astronomical Observatory of Japan Dec 14, 2010 CPS, Kobe Talk structure GRAPE GRAPE-DR

More information

Tools for Power and Energy Analysis of Parallel Scientific Applications

Tools for Power and Energy Analysis of Parallel Scientific Applications The 41st International Conference on Parallel Processing Tools for Power and Energy Analysis of Parallel Scientific Applications Pedro Alonso 1, Rosa M. Badia 2, Jesús Labarta 2, María Barreda 3, Manuel

More information

Binary Decision Diagrams and Symbolic Model Checking

Binary Decision Diagrams and Symbolic Model Checking Binary Decision Diagrams and Symbolic Model Checking Randy Bryant Ed Clarke Ken McMillan Allen Emerson CMU CMU Cadence U Texas http://www.cs.cmu.edu/~bryant Binary Decision Diagrams Restricted Form of

More information

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Chalermpol Saiprasert, Christos-Savvas Bouganis and George A. Constantinides Department of Electrical & Electronic

More information

Understanding and Implementing F5

Understanding and Implementing F5 Understanding and Implementing F5 John Perry john.perry@usm.edu University of Southern Mississippi Understanding and Implementing F5 p.1 Overview Understanding F5 Description Criteria Proofs Implementing

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Lecture 16: 9.2 Geometry of Linear Operators

Lecture 16: 9.2 Geometry of Linear Operators Lecture 16: 9.2 Geometry of Linear Operators Wei-Ta Chu 2008/11/19 Theorem 9.2.1 If T: R 2 R 2 is multiplication by an invertible matrix A, then the geometric effect of T is the same as an appropriate

More information

Introduction to Kleene Algebras

Introduction to Kleene Algebras Introduction to Kleene Algebras Riccardo Pucella Basic Notions Seminar December 1, 2005 Introduction to Kleene Algebras p.1 Idempotent Semirings An idempotent semiring is a structure S = (S, +,, 1, 0)

More information

CSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms )

CSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms ) CSE 548: Analysis of Algorithms Lecture 12 ( Analyzing Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Fall 2017 Why Parallelism? Moore s Law Source: Wikipedia

More information

Parallel Longest Common Subsequence using Graphics Hardware

Parallel Longest Common Subsequence using Graphics Hardware Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1 Overview Introduction Problem Statement ackground and Related

More information

Programming Languages and Compilers (CS 421)

Programming Languages and Compilers (CS 421) Programming Languages and Compilers (CS 421) Sasa Misailovic 4110 SC, UIUC https://courses.engr.illinois.edu/cs421/fa2017/cs421a Based in part on slides by Mattox Beckman, as updated by Vikram Adve, Gul

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar

More information

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de

More information

Everyday Multithreading

Everyday Multithreading Everyday Multithreading Parallel computing for genomic evaluations in R C. Heuer, D. Hinrichs, G. Thaller Institute of Animal Breeding and Husbandry, Kiel University August 27, 2014 C. Heuer, D. Hinrichs,

More information

Towards Fast, Accurate and Reproducible LU Factorization

Towards Fast, Accurate and Reproducible LU Factorization Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3

More information

Data Structures in Java

Data Structures in Java Data Structures in Java Lecture 21: Introduction to NP-Completeness 12/9/2015 Daniel Bauer Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

Department of Mathematics Comprehensive Examination Option I 2016 Spring. Algebra

Department of Mathematics Comprehensive Examination Option I 2016 Spring. Algebra Comprehensive Examination Option I Algebra 1. Let G = {τ ab : R R a, b R and a 0} be the group under the usual function composition, where τ ab (x) = ax + b, x R. Let R be the group of all nonzero real

More information

Logic, Optimization and Data Analytics

Logic, Optimization and Data Analytics Logic, Optimization and Data Analytics John Hooker Carnegie Mellon University United Technologies Research Center, Cork, Ireland August 2015 Thesis Logic and optimization have an underlying unity. Ideas

More information

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist Using GPUs to Accelerate Online Event Reconstruction at the Large Hadron Collider Dr. Andrea Bocci Applied Physicist On behalf of the CMS Collaboration Discover CERN Inside the Large Hadron Collider at

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

SPECTRAL CLUSTERING OF LARGE NETWORKS

SPECTRAL CLUSTERING OF LARGE NETWORKS SPECTRAL CLUSTERING OF LARGE NETWORKS A. Fender, N. Emad, S. Petiton, M. Naumov May 8th, 207 Introduction Agenda Laplacian Modularity Conclusions 2 THE CLUSTERING PROBLEM Example : detect relevant groups

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

This property turns out to be a general property of eigenvectors of a symmetric A that correspond to distinct eigenvalues as we shall see later.

This property turns out to be a general property of eigenvectors of a symmetric A that correspond to distinct eigenvalues as we shall see later. 34 To obtain an eigenvector x 2 0 2 for l 2 = 0, define: B 2 A - l 2 I 2 = È 1, 1, 1 Î 1-0 È 1, 0, 0 Î 1 = È 1, 1, 1 Î 1. To transform B 2 into an upper triangular matrix, subtract the first row of B 2

More information

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations Rohit Gupta, Martin van Gijzen, Kees Vuik GPU Technology Conference 2012, San Jose CA. GPU Technology Conference 2012,

More information

Parallel Circuit Simulation. Alexander Hayman

Parallel Circuit Simulation. Alexander Hayman Parallel Circuit Simulation Alexander Hayman In the News LTspice IV features multi-threaded solvers designed to better utilize current multicore processors. Also included are new SPARSE matrix solvers

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors 20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,

More information