Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign
|
|
- Joleen Watkins
- 5 years ago
- Views:
Transcription
1 Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign
2 Collaborators Thesis Advisors Denis Barthou (Labri/INRIA Bordeaux) David Padua (University of Illinois at UC) Future collaboration starpu team, INRIA Bordeaux Jointlab INRIA-UIUC 2
3 Objectives and Contributions INTRODUCTION Jointlab INRIA-UIUC 3
4 Objectives Compile linear algebra equations Compute X for L X X U = C [CTSY] Compute L and U for L U = A [LU] Generate efficient task parallel code Identify tasks Generate task dependence graph Jointlab INRIA-UIUC 4
5 Motivation Focus is on linear algebra Start from high level description No code or algorithm Derivation through blocking of operands Data centric approach Derivation for parallelism Output is a parallel task graph Jointlab INRIA-UIUC 5
6 Contributions A specification language Express computation Characterize operands (shapes) Identify wanted result Derivation rules Validity/applicability patterns Operators symbolic execution rules Dependence build engine Jointlab INRIA-UIUC 6
7 A detailed view of the generator SYSTEM DESCRIPTION Jointlab INRIA-UIUC 7
8 Description Language - Operands %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% ctsy All operands (Type inference) Status Known, Unknown Shape Triangular, diagonal Type Matrix, (vector, scalar) Modifiers (transpose) (Sizes) (Density) Jointlab INRIA-UIUC 8
9 Description language - Equation %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% ctsy Base for decomposition Simple equations Assignments X = A*B Solvers LU Triangular Sylvester L*X=B Cholesky Jointlab INRIA-UIUC 9
10 Description language - Parameters %% Operands X: Unknown Matrix L: Lower Triangular Square Matrix U: Upper Triangular Square Matrix C: Square Matrix %% Equation L*X-X*U=C %% ctsy Drive code generation Set of parameters Name Codelet Order of operands Data type For customization Default values used Jointlab INRIA-UIUC 10
11 Kernel Declaration Generate a full solution Cost of full recursion Usually not a good idea Use existing kernels and libraries Already optimized Jointlab INRIA-UIUC 11
12 Performance Kernel Performance Measure performance Depend on size target Guide exploration Optimal nodes Similar to ATLAS Problem Size Jointlab INRIA-UIUC 12
13 Starpu Codelet Codelet declaration void gemm(void *buffers[], void *cl_arg); struct starpu_codelet gemm_cl = {.where = STARPU_CPU,.cpu_funcs = { gemm, NULL},.nbuffers = 3,.modes = {STARPU_R, STARPU_R, STARPU_RW} }; Kernel code void gemm(void *buffers[], void *cl_arg) { struct params *params = cl_arg; int n = params->n; double *a = (double *) STARPU_MATRIX_GET_PTR(buffers[0]); // cblas_dgemm(cblasrowmajor, CBlasNoTrans, CBlasNoTrans, n, n, n, 1.0, a, n, b, n, 1.0, c, n); } Jointlab INRIA-UIUC 13
14 Defining blocking space Blocking defines the space of solutions Must only generate valid solutions Look at the equation s operation tree Each node gets a set of dimensions Generate constraints depending on operation Jointlab INRIA-UIUC 14
15 Validity of Blocking y A x B y B B <x A,y B > A B <x A,y A > <x B,y B > x A A New Constraint { y A = x B } Jointlab INRIA-UIUC 15
16 Valid Blockings - DTSY A <x A,y A > <x A,y X > <x A,y B > X <x X,y X > - (C) <x C,y C > B <x B,y B > X <x X,y X > Constraints y A = x X y X = x B x A = x X y B = y X x C = x X y C = y X { x A,y A,x X, x C } { x B,y B,y X,y C } Jointlab INRIA-UIUC 16
17 Valid Blockings DTSY (2) A*X*B X = C A lower triangular, B upper triangular xa = ya = xx = 2 xb = yb = yx = 2 A(0,0) A(0,1) X(0,0) X(0,1) B(0,0) B(0,1) X(0,0) X(0,1) T(0,0) T(0,1) * * - = A(1,0) A(1,1) X(1,0) X(1,1) B(1,0) B(1,1) X(1,0) X(1,1) T(1,0) T(1,1) xa = ya = xx = 2 xb = yb = yx = 1 A(0,0) A(0,1) X(0,0) * * A(1,0) A(1,1) X(1,0) B(0,0) <=> B X(0,0) - = X(1,0) T(0,0) T(1,0) xa = ya = xx = 3 xb = yb = yx = 3 A(0,0) A(2,2) X(0,0) X(2,2) B(0,0) B(2,2) X(0,0) X(2,2) * * - = C(0,0) C(2,2) Jointlab INRIA-UIUC 17
18 Derivation example * = A X T blocking Sym. Exec. A(0,0) A(0,1) X(0,0) X(0,1) T(0,0) T(0,1) * = A(1,0) A(1,1) X(1,0) X(1,1) T(1,0) T(1,1) T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,0)*X(0,0) + A(1,1)*X(1,0) T(1,1) = A(1,0)*X(0,1) + A(1,1)*X(1,1) Removal of 0-computation T(0,0) = A(0,0)*X(0,0) + A(0,1)*X(1,0) T(0,1) = A(0,0)*X(0,1) + A(0,1)*X(1,1) T(1,0) = A(1,1)*X(1,0) T(1,1) = A(1,1)*X(1,1) Jointlab INRIA-UIUC 18
19 Operand characterization Blocks of X are outputs (unknown) A(0,0) and A(1,1) are lower triangular Equation Input Output T(1,1) = A(1,1) X(1,1) T(1,1) A(1,1) X(1,1) T(1,0) = A(1,1) X(1,0) T(1,0) A(1,1) X(1,0) T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(0,1) X(1,1) T(0,0) = A(0,0) X(0,0) + A(0,1) X(1,0) T(0,0) A(0,0) A(0,1) X(0,0) X(1,0) Jointlab INRIA-UIUC 19
20 Equation Signature Need to identify equations Identification through types and operators A X = T : LT UNK = MT Set of simplification rules UNK + MT MT UNK + MT Jointlab INRIA-UIUC 20
21 Identify task T(1,1) = A(1,1) X(1,1) Signature : LT UNK = MT Instance of original problem Solvable X(1,1) can now be considered known Jointlab INRIA-UIUC 21
22 Building dependence Find instances of X(1,1) Shift in input set Add dependence edge Equation Input Output T(0,1) = A(0,0) X(0,1) + A(0,1) X(1,1) T(0,1) A(0,0) A(0,1) X(1, 1) X(0,1) New Dependence {T 1,1 = A 1,1 X 1,1 T 0,1 = A 0,0 X 0,1 + A 0,1 X(1,1) } Jointlab INRIA-UIUC 22
23 Signature Simplification T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 1. MT = LT UNK + MT MT 2. MT = LT UNK + MT 3. MT MT = LT UNK 4. MT = LT UNK Match to the original problem! Jointlab INRIA-UIUC 23
24 Post identification expansion Simplification step Corresponding equation MT = LT UNK + MT MT T 0,1 = A 0,0 X 0,1 + A 0,1 X 1,1 MT = LT UNK + MT MT MT = LT UNK MT = LT UNK T1 = A 01 X(1,1) T 0,1 = A 0,0 X 0,1 + T1 T1 = A 01 X(1, 1) T2 = T 0, 1 T1 T2 = A 0, 0 X(0, 1) Jointlab INRIA-UIUC 24
25 Selection Generator Equation Dependence Graph Derivation L X U X = C ø L11 X11 U11 X11 = T Set Equations Dependence Graph Identification 1: L00 X00 U00 X00 = C00 2: L10 X00 U00 + L11 X10 U00 X10 = C10 3: L00 X00 U01 + L00 X01 U11 X01 = C01 4: L10 X00 U L11 X11 U11 X11 = C11 ø 1: L X U X = T00 2: L X U L X U X = T10 3: L X U L X U X = T01 4: L X U L X U X = T Dependence Graph Termination Final Graph Jointlab INRIA-UIUC 25
26 Simple graph example Jointlab INRIA-UIUC 26
27 Graph Example Jointlab INRIA-UIUC 27
28 Heterogeneous Graph Example MM DTSY DTSY Jointlab INRIA-UIUC 28
29 INITIAL RESULTS AND CHALLENGES Jointlab INRIA-UIUC 29
30 Challenges Version generation time 5 minutes in generator 20 minutes compiling Generated code size 500k lines of code in single function (icc segfaults) Jointlab INRIA-UIUC 30
31 Speedup Trisolve (L*X=B) ref Problem Size Jointlab INRIA-UIUC 31
32 Conclusion Faster development cycle for architectures No time to hand-tune everything anymore Can t hand tune for every HW iteration Increased complexity Heterogeneous systems Description of an automatic methodology Leverage existing small scale libraries Jointlab INRIA-UIUC 32
33 AU CAS OU Jointlab INRIA-UIUC 33
34 Exploiting heterogeneous machines : starpu Runtime Goal: Scheduling ( offloading) tasks over heterogeneous machines CPU + GPU + SPU = *PU Auto-tuning of performance models Optimization of memory transfers Target for Compilers StarSs [UPC], HMPP [CAPS] Libraries PLASMA/MAGMA [UTK] Applications Fast multipole methods [CESTA] StarPU provides an Open Scheduling platform Scheduling algorithm = plugins CPU CPU CPU CPU CPU CPU M. CPU CPU M. GPU GPU M. M Speed (GFlops) Jointlab INRIA-UIUC 34 Greedy task model prefetch data model
35 Overview of StarPU Maximizing PU occupancy, minimizing data transfers Principle Accept tasks that may have multiple implementations Together with potential interdependencies Leads to a dynamic acyclic graph of tasks Data-flow approach Provide a high-level data management layer Application should only describe Which data may be accessed by tasks How data may be divided Parallel Compilers CPU HPC Applications StarPU Parallel Libraries Drivers (CUDA, OpenCL GPU Jointlab INRIA-UIUC 35
36 Code styles and optimization for(int i = 0 ; i < N ; i++) for(int j = 0 ; j < N ; j++) for(int k = 0 ; k < N ; k++) c[i][j] += a[i][k] * b[k][j]; for(int kk = 0 ; kk < N ; kk+=b) for(int ii = 0 ; ii < N ; ii+=b) for(int jj = 0 ; jj < N ; jj+=b) for(int i = ii ; i < ii+b ; i++) for(int k = kk ; k < kk+b ; k++) { c_i = c[i]; a_ik = a[i][k]; b_k = b[k]; for(int j = jj ; i < jj+b ; i+=2) { c_i[j] += a_ik * b_k[j]; c_u[j+1] += a_ik * b_k[j+1]; } } Alexandre X. Duchateau - Preliminary Examination 36
37 Empirical search Generation of multiple solutions Algorithmic exploration Theoretical evaluation Usually unreliable Filtering heuristic Empirical evaluation (execution) slow (very slow when looking at bad versions) Alexandre X. Duchateau - Preliminary Examination 37
38 Generalized system overview Driver Description Generator Versions Prediction Execution Selection Predictor Filters what versions are actually executed Driver Guide the search Alexandre X. Duchateau - Preliminary Examination 38
Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,
More informationPanorama des modèles et outils de programmation parallèle
Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &
More informationMagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs
MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationParallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationJ.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009
Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationGRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation
GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation Konrad Trifunovic 2 Albert Cohen 2 David Edelsohn 3 Li Feng 6 Tobias Grosser 5 Harsha Jagasia 1 Razya Ladelsky 4 Sebastian
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationParallel Performance Theory - 1
Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationAuto-Tuning Complex Array Layouts for GPUs - Supplemental Material
BIN COUNT EGPGV,. This is the author version of the work. It is posted here by permission of Eurographics for your personal use. Not for redistribution. The definitive version is available at http://diglib.eg.org/.
More informationEnergy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators
PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th, 2016 a Université de Lille, Sciences et Technologies
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationQuantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)
Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationAccelerating Proton Computed Tomography with GPUs
Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility' Michael'E.'Papka,'Argonne'Leadership'Compu2ng'Facility,'Northern'Illinois'University' Nicholas'T.'Karonis,'Northern'Illinois'University,'Argonne'Na2onal'Laboratory
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationECE521 W17 Tutorial 1. Renjie Liao & Min Bai
ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review
More informationLINEAR SYSTEMS (11) Intensive Computation
LINEAR SYSTEMS () Intensive Computation 27-8 prof. Annalisa Massini Viviana Arrigoni EXACT METHODS:. GAUSSIAN ELIMINATION. 2. CHOLESKY DECOMPOSITION. ITERATIVE METHODS:. JACOBI. 2. GAUSS-SEIDEL 2 CHOLESKY
More information11 Parallel programming models
237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages
More informationPackage magma. February 15, 2013
Package magma February 15, 2013 Title Matrix Algebra on GPU and Multicore Architectures Version 0.2.2-1 Date 2010-08-27 Author Brian J Smith Maintainer Brian J Smith
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationParallel Performance Theory
AMS 250: An Introduction to High Performance Computing Parallel Performance Theory Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline
More informationA Compiler for Linear Algebra Operations
A Compiler for Linear Algebra Operations Paolo Bientinesi In collaboration with Diego Fabregat AICES, RWTH Aachen pauldj@aices.rwth-aachen.de CScADS Autotuning Workshop 2012 August 13-14, 2012 Snowbird,
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationUtilisation de la compression low-rank pour réduire la complexité du solveur PaStiX
Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.
More informationAn Efficient Evolutionary Algorithm for Solving Incrementally Structured Problems
An Efficient Evolutionary Algorithm for Solving Incrementally Structured Problems Jason Ansel Maciej Pacula Saman Amarasinghe Una-May O Reilly MIT - CSAIL July 14, 2011 Jason Ansel (MIT) PetaBricks July
More informationLeveraging Task-Parallelism in Energy-Efficient ILU Preconditioners
Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga
More informationSaving Energy in Sparse and Dense Linear Algebra Computations
Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain
More informationExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. X, M YYYY 1 ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems Sameh Abdulah, Hatem Ltaief, Ying Sun,
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationMUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux
The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent
More informationCRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?
CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More informationScheduling Trees of Malleable Tasks for Sparse Linear Algebra
Scheduling Trees of Malleable Tasks for Sparse Linear Algebra Abdou Guermouche, Loris Marchal, Bertrand Simon, Frédéric Vivien To cite this version: Abdou Guermouche, Loris Marchal, Bertrand Simon, Frédéric
More informationA CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method
A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationParallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano
Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic
More informationAlgorithms and Methods for Fast Model Predictive Control
Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model
More informationN-body Simulations. On GPU Clusters
N-body Simulations On GPU Clusters Laxmikant Kale Filippo Gioachin Pritish Jetley Thomas Quinn Celso Mendes Graeme Lufkin Amit Sharma Joachim Stadel Lukasz Wesolowski James Wadsley Edgar Solomonik Fabio
More informationNumerical Linear Algebra
Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationA CUDA Solver for Helmholtz Equation
Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College
More informationFINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING
FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel Thuerck 1,2 (advisors Michael Goesele 1,2 and Marc Pfetsch 1 ) Maxim Naumov 3 1 Graduate School of Computational Engineering, TU Darmstadt
More informationDirect and Incomplete Cholesky Factorizations with Static Supernodes
Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationCode Generation for GPU Accelerators in the Domain of Image Preprocessing
Code Generation for GPU Accelerators in the Domain of Image Preprocessing Oliver Reiche, Richard Membarth, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Dagstuhl,
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationExploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units
Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern
More informationEfficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized
More informationReduced Vlasov-Maxwell modeling
Reduced Vlasov-Maxwell modeling Philippe Helluy, Michel Massaro, Laurent Navoret, Nhung Pham, Thomas Strub To cite this version: Philippe Helluy, Michel Massaro, Laurent Navoret, Nhung Pham, Thomas Strub.
More informationAlgebraic Multigrid as Solvers and as Preconditioner
Ò Algebraic Multigrid as Solvers and as Preconditioner Domenico Lahaye domenico.lahaye@cs.kuleuven.ac.be http://www.cs.kuleuven.ac.be/ domenico/ Department of Computer Science Katholieke Universiteit Leuven
More informationA Fast, Parallel Potential Flow Solver
Advisor: Jaime Peraire December 16, 2012 Outline 1 Introduction to Potential FLow 2 The Boundary Element Method 3 The Fast Multipole Method 4 Discretization 5 Implementation 6 Results 7 Conclusions Why
More informationInput: System of inequalities or equalities over the reals R. Output: Value for variables that minimizes cost function
Linear programming Input: System of inequalities or equalities over the reals R A linear cost function Output: Value for variables that minimizes cost function Example: Minimize 6x+4y Subject to 3x + 2y
More informationGPU accelerated Arnoldi solver for small batched matrix
15. 09. 22 GPU accelerated Arnoldi solver for small batched matrix Samsung Advanced Institute of Technology Hyung-Jin Kim Contents - Eigen value problems - Solution - Arnoldi Algorithm - Target - CUDA
More informationGRAPE-DR, GRAPE-8, and...
GRAPE-DR, GRAPE-8, and... Jun Makino Center for Computational Astrophysics and Division Theoretical Astronomy National Astronomical Observatory of Japan Dec 14, 2010 CPS, Kobe Talk structure GRAPE GRAPE-DR
More informationTools for Power and Energy Analysis of Parallel Scientific Applications
The 41st International Conference on Parallel Processing Tools for Power and Energy Analysis of Parallel Scientific Applications Pedro Alonso 1, Rosa M. Badia 2, Jesús Labarta 2, María Barreda 3, Manuel
More informationBinary Decision Diagrams and Symbolic Model Checking
Binary Decision Diagrams and Symbolic Model Checking Randy Bryant Ed Clarke Ken McMillan Allen Emerson CMU CMU Cadence U Texas http://www.cs.cmu.edu/~bryant Binary Decision Diagrams Restricted Form of
More informationWord-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator
Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Chalermpol Saiprasert, Christos-Savvas Bouganis and George A. Constantinides Department of Electrical & Electronic
More informationUnderstanding and Implementing F5
Understanding and Implementing F5 John Perry john.perry@usm.edu University of Southern Mississippi Understanding and Implementing F5 p.1 Overview Understanding F5 Description Criteria Proofs Implementing
More informationCS 542G: Conditioning, BLAS, LU Factorization
CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationLecture 16: 9.2 Geometry of Linear Operators
Lecture 16: 9.2 Geometry of Linear Operators Wei-Ta Chu 2008/11/19 Theorem 9.2.1 If T: R 2 R 2 is multiplication by an invertible matrix A, then the geometric effect of T is the same as an appropriate
More informationIntroduction to Kleene Algebras
Introduction to Kleene Algebras Riccardo Pucella Basic Notions Seminar December 1, 2005 Introduction to Kleene Algebras p.1 Idempotent Semirings An idempotent semiring is a structure S = (S, +,, 1, 0)
More informationCSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms )
CSE 548: Analysis of Algorithms Lecture 12 ( Analyzing Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Fall 2017 Why Parallelism? Moore s Law Source: Wikipedia
More informationParallel Longest Common Subsequence using Graphics Hardware
Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1 Overview Introduction Problem Statement ackground and Related
More informationProgramming Languages and Compilers (CS 421)
Programming Languages and Compilers (CS 421) Sasa Misailovic 4110 SC, UIUC https://courses.engr.illinois.edu/cs421/fa2017/cs421a Based in part on slides by Mattox Beckman, as updated by Vikram Adve, Gul
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationHigh-performance processing and development with Madagascar. July 24, 2010 Madagascar development team
High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar
More informationUsing a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics
Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de
More informationEveryday Multithreading
Everyday Multithreading Parallel computing for genomic evaluations in R C. Heuer, D. Hinrichs, G. Thaller Institute of Animal Breeding and Husbandry, Kiel University August 27, 2014 C. Heuer, D. Hinrichs,
More informationTowards Fast, Accurate and Reproducible LU Factorization
Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3
More informationData Structures in Java
Data Structures in Java Lecture 21: Introduction to NP-Completeness 12/9/2015 Daniel Bauer Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationDepartment of Mathematics Comprehensive Examination Option I 2016 Spring. Algebra
Comprehensive Examination Option I Algebra 1. Let G = {τ ab : R R a, b R and a 0} be the group under the usual function composition, where τ ab (x) = ax + b, x R. Let R be the group of all nonzero real
More informationLogic, Optimization and Data Analytics
Logic, Optimization and Data Analytics John Hooker Carnegie Mellon University United Technologies Research Center, Cork, Ireland August 2015 Thesis Logic and optimization have an underlying unity. Ideas
More informationDr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist
Using GPUs to Accelerate Online Event Reconstruction at the Large Hadron Collider Dr. Andrea Bocci Applied Physicist On behalf of the CMS Collaboration Discover CERN Inside the Large Hadron Collider at
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationSPECTRAL CLUSTERING OF LARGE NETWORKS
SPECTRAL CLUSTERING OF LARGE NETWORKS A. Fender, N. Emad, S. Petiton, M. Naumov May 8th, 207 Introduction Agenda Laplacian Modularity Conclusions 2 THE CLUSTERING PROBLEM Example : detect relevant groups
More informationImprovements for Implicit Linear Equation Solvers
Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often
More informationThis property turns out to be a general property of eigenvectors of a symmetric A that correspond to distinct eigenvalues as we shall see later.
34 To obtain an eigenvector x 2 0 2 for l 2 = 0, define: B 2 A - l 2 I 2 = È 1, 1, 1 Î 1-0 È 1, 0, 0 Î 1 = È 1, 1, 1 Î 1. To transform B 2 into an upper triangular matrix, subtract the first row of B 2
More informationRobust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations
Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations Rohit Gupta, Martin van Gijzen, Kees Vuik GPU Technology Conference 2012, San Jose CA. GPU Technology Conference 2012,
More informationParallel Circuit Simulation. Alexander Hayman
Parallel Circuit Simulation Alexander Hayman In the News LTspice IV features multi-threaded solvers designed to better utilize current multicore processors. Also included are new SPARSE matrix solvers
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationSaving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors
20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,
More information