Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
|
|
- Phoebe Carroll
- 5 years ago
- Views:
Transcription
1 Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst: Numerical Linear Algebra or High-Perormance Computers Pacheco: A User s Guide to MPI (web) Parallel Programming with MPI Schüle: Paralleles Rechnen
2 Why parallel computing? SETI, weather prediction, quantum simulation TOP500 HLRB-II 2
3 I. Introduction. Computer Science Aspects 2. Numerical Problems 3. Graphs II. Elementary Linear Algebra Problems. BLAS 2. Matrix-Vector Operations 3. Matrix-Matrix-Product III. Linear Equations with Dense Matrices. Gaussian Elimination 2. Vectorization 3. Parallelization 4. QR-Decomposition with Householder matrices IV. Sparse Matrices. General Properties, Storage 2. Sparse Matrices and Graphs 3. Reordering 4. Gaussian Elimination and Graphs V. Iterative Methods or Sparse Matrices. Stationary Methods 2. Nonstationary Methods 3. Preconditioning VI. Domain decomposition VII.Eigenvalues, (Quantum Computing, GPU) 3
4 . Introduction. Computer Science Aspects o Parallel Numerics.. Parallelization in the CPU Elementary operations in CPU are carried out in pipelines: - Divide a task into smaller subtasks - Each small subtask is executed on a piece o hardware that operates concurrently with the other stages o the pipeline. Addition Pipeline: Operand Operand 2 Align exponents accordingly normalize result Stage Stage 2 Stage 3 Stage 4 Compare add mantissa exponents Output Result 4
5 Visualisation Pipelining x 2 x y y 2 5
6 Visualisation Pipelining x 3 x 2 x,y y 2 y 3 6
7 Visualisation Pipelining x 4 2 x 3 x 2,y 2 x,y y 3 y 4 7
8 Visualisation Pipelining x x 4 x 3,y 3 x 2,y 2 x,y y 4 y 5 8
9 Visualisation Pipelining x x 5 x 4,y 4 x 3,y 3 x 2,y 2 x,y y 5 y 6 9
10 Visualisation Pipelining x x 6 x 5,y 5 x 4,y 4 x 3,y 3 x 2,y 2 x +y y 6 y 7 0
11 Visualisation Pipelining x i x i-5 x i-4,y i-4 x i-3,y i-3 x i-2,y i-2 x i-,y i- x i +y i y i-5 y i-6 Startup time = k(=4) clock units Lateron on: per clock unit one result Total time: k*u + n*u
12 Advantages o Pipelines: I pipeline is illed: per clock unit one result is achieved. All additions should be organized such that the pipeline is always illed! I the pipeline is nearly empty, e.g. in the beginning o the computations, it is not eicient! Major task or CPU: Organize all operations such that the operands are just in time at the right position to ill the pipeline and keep it ull. 2
13 CPU - Pipelining 3
14 CPU 4
15 General Steps Instruction Fetch: Get the next command Decoding: Analyse instruction and compute addresses o operands Operand Fetch: Get the values o the next operands Execution step: Carry out command on operands Result Write: Write result in memory Pipelining o these steps, and also inside each step. 5
16 Special case: Vector instruction For set o data the same operation has to be executed on all components. y x M = α M For j =,2,..,n: y j = α x j ; y n x n α α x 5 α x 4 α x 3 α x 2 α x x 6 x 7 Total costs: Startup time + vector length * clock time (pipeline length + vector length ) * τ 6
17 Chaining: Combine pipelines directly α x + y : α α x Multiplication Addition α x + y x y Advantage: total cost = startup time + vector length * clock time 7
18 Problem: Data Dependency! Example: Fibonacci numbers x 0 = 0, x =, x 2 = x 0 + x,, x i = x i-2 + x i- ; x x 0 x 2 x 2 x Ater illing in x 0 and x the next pair needs x 2 which is known only ater the irst computation is inished! Pipeline contains always only one pair is nearly empty all the time! Similar Problem or recursive subroutine calls. 8
19 ..2 Memory Organization small CPU Register ast Capacity Level Cache Level 2 Cache Speed Main Memory large Hard Disc slow World (CD, DVD, Stick, Internet,.) 9
20 General Considerations DRAM (dynamic): Fast, periodically rereshing is necessary SRAM (static) SDRAM (small part SRAM combined with large DRAM) DDR: (double date rate) use both voltage lanks (side,shoulder) Necessary time or reading: - Transport addresses via bus to memory (bus speed) - Time between arrival o adresses and arrival o data: latency time (~4 cycles) - Rereshing o data: cycles - Transport o data rom memory: bus cycle 20
21 Cache Idea Cache as memory buer between large, slow memory and small, ast memory. By considering the data low (last used data), try to predict which data will be requested in the next step: - keep the last used data in ast cache because it is likely that the same data will be used again - keep also the neighborhood o the last used data in ast cache. Memory is organized in pages (main memory, hard disc,..). Hence, together with the last used data put the whole page in the cache. Page size ~ bus band width Main memory Cache Hard disc 2
22 Cache hit: The data requested rom the small, ast memory is ound in the cache: Copy the data to ast memory. Done. Cache miss: The data requested rom the small, ast memory is not ound in the cache: Look or data in the large, slow memory. Copy the related page to the cache (removing the oldest cache entry) and copy it to the ast, small memory. Also: Reuse data as oten as possible! Working blockwise to ensure neighbouring! 22
23 Mapping between Memory Direct mapping: Address in cache, modulo 00 0 Disadvantage: Immediately replacing o data in cache Associative mapping: Partition cache in blocks. Write data to direct mapped address in one o the blocks. 23 Replace oldest data in block.
24 Cyclic data distribution Memory is oten organized in banks connected by bus: bus bank bank2 bankn.... Per cycle n operands can be etched out o the n banks. Storing vectors! x in bank x 2 in bank2,. allows one step access to x 24
25 ..3 Parallel Processors Classical von Neumann model: Code and data in memory! Control unit etches instructions and data rom memory and sequentially coordinates the operations. 25
26 Parallel Computation Flynn s taxonomy: MIMD architecture: multiple instructions multiple data (compare to SISD = single instructions single data, etc.) 26
27 Shared Memory (SMP): memory I/O bus cache cache CPU CPU processors 27
28 Cache Coherence memory I/O start = proc_number; or (S=0; s<s_max; S++) parallel or(i=start; i<n; i+=s+proc_number) x[i]=2.; cache bus cache For S=0 and 2 threads: CPU Thread changes x(0,2,4, ) and thread 2 changes x(,3,5, ). CPU Does the cache contain 4 words (cache line = 4), then each changing step o thread also changes data that is also contained in the cache o thread 2 (and vice versa). Otherwise the data in the two caches is not consistent anymore! To retain the right values in both caches ater each changing step also the Value in the other cache has to be renewed! Leads to a dramatical increase o computational time, ev. slower than sequential computation! 28
29 Locally distributed memory: network memory memory cache cache knot CPU knot CPU P M processors..... memory P n M n Virtual shared memory: Distributed data but organized as shared memory. 29
30 Nonuniorm Memory Access Cluster o multiple CPU processors memory memory 2 controler bus controler cache cache cache cache Symmetric multiprocessor CPU CPU 2 CPU 3 CPU 4 Dierent types o communication! Shared memory and distributed memory! 30
31 Topology o the processor/memory interconnection P cache local m. I/O bus Bus: global memory P n Mesh (Array, Grid): p processors, longest path sqrt(p) P M P M P M. P M P M P M P M 3
32 Time or sending data rom one processor to another depends on the connection network topology: Mesh: 2*sqrt(p) vector or ring: p- or p/2 tree: 2 log(p) hypercube: log(p) Tree: Hypercube: 0d d 2d 3d 4d 32
33 Communication - Topology diameter = largest distance = p-: diameter = largest distance = p/2: diameter = largest distance = 2 sqrt(p) 33
34 Diameter = largest distance = 2 ln(p) Diameter:
35 Tree in Hypercube
36 Dierent Topologies: G p Diam(G) Degree(G) Edges(G) G (n) n n- 2 n- T (n) n loor(n/2) 2 n G 2 (n,n) n 2 2n-2 4 2n 2-2n T 2 (n,n) n 2 2*loor(n/2) 4 2n 2 BT(h) 2 h+ - 2h 3 2 h+ -2 HC(k) 2 k k k 2 k- k G: Grid, T: Torus, BT: binary Tree, HC: Hypercube 36
37 Network based on Switches 3-level Omega network Crossbar P7 P7 P7 P6 P6 P6 P5 P5 P5 P4 P4 P4 P3 P3 P3 P2 P2 P2 P P P P0 P0 P0 37
38 Communication Crossbar: Direct, independent connection between all processors. Nonblocking! Omega network: Blocking network. Simultaneous connection P0 P6 and P P7 is not possible! Turn-over o switches necessary! 38
39 ..4 Perormance Analysis Deinition: Computational Speed r = N / t Mlops, N loating point operations in t microseconds or by known speed r: time or N lops is given by t = N / r Amdahls s Law: Setting: An algorithm takes N lop s. A raction is carried out with speed o V Mlops (good in parallel) A raction - is carried out with S Mlops (bad) : high speed parallel : low speed, strongly sequential 39
40 40 Total CPU time: Overall speed (perormance): (Amdahl s Law) ) ( ) ( S V N S N V N t + = + = S V t N r + = = must be close to in order to beneit signiicantly rom parallelism
41 4 Discussion S S V t N r + = = with S the slow speed To achieve large speed, - has to be small! For very large parallel speed V: S S S V t N r = + + = = 0 The total speed is governed by the raction o the strongly sequential part o the algorithm that cannot be parallelized.
42 Speedup Executing a job using p processors in parallel we can achieve a speedup. Deine t p := wall clock time to execute the job on p parallel processors Speedup: S p := t / t p is the ratio o execution time with versus p processors In the ideal case it would hold t = p t p. Eiciency using p processors: E p = S p / p. 0<= E p <= E p : very good parallelizable, because then S p p or t p t p. Problem scales. E 42 p 0: bad, because E p = S p /p = t / (p t p ) and t << p t p.
43 43 Using the same deinition o speed and raction as above: ideally parallel strongly sequential Ware s Law ) ( ) ( ) ( t p p t t p t t p + = + = p p p S E p p ) ( ) ( + = = p p p p t t S p p + = + = = ) ( ) ( : 0 Ware s Law p E p We always will have a small portion o our algorithm that is not parallelizable and thereore the eiciency will always be zero in the limit!
44 /(-): Reachable Speedup or large p. : S: 0/9 0/8 2 0 Speedup depending on p. Saturation /(-). 44
45 Gustason s Law Other model: We assume that the problem can be solved in unit o time on a parallel machine with p processors. Fraction is good parallelizable, - not Compared with this parallel implementation an uniprocessor would perorm (-) + p or the same job. Speedup: t + p S p = = = p + ( p)( ) t p Eiciency: E p = S p p = p + p 45
46 Example = 0.99, p = 00 or 000: Amdahl/Ware: S= p/(+(-)p)) E=/(+(-)p) p=00: S 00 = 00/.99 ~ 50, E 00 = 0.5, p=000: S 000 = 000/0.99 ~ 00, E 000 = 0., Gustason: S=-+p E=(-)/p+ p=00: S 00 = 99.0, E = p=000: S 000 = 990.0, E =
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationIntroduction The Nature of High-Performance Computation
1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationHYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni
HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may
More informationParallele Numerik. Miriam Mehl. Institut für Parallele und Verteilte Systeme Universität Stuttgart
Parallele Numerik Miriam Mehl Institut für Parallele und Verteilte Systeme Universität Stuttgart Contents 1 Introduction 5 2 Parallelism and Parallel Architecture Basics 13 2.1 Levels of Parallelism...................................
More informationHigh Performance Computing
Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),
More informationMarwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation
and and marwan.burelle@lse.epita.fr http://wiki-prog.kh405.net Outline 1 2 and 3 and Evolutions and Next evolutions in processor tends more on more on growing of cores number GPU and similar extensions
More informationAdministrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application
Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory
More informationComputation of the mtx-vec product based on storage scheme on vector CPUs
BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on
More informationCMP 338: Third Class
CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationCMP N 301 Computer Architecture. Appendix C
CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)
More informationMaking electronic structure methods scale: Large systems and (massively) parallel computing
AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationMICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE
MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By
More informationww.padasalai.net
t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationCRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?
CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?
More informationFall 2008 CSE Qualifying Exam. September 13, 2008
Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for
More informationImprovements for Implicit Linear Equation Solvers
Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often
More informationPERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.
More informationQR FACTORIZATIONS USING A RESTRICTED SET OF ROTATIONS
QR FACTORIZATIONS USING A RESTRICTED SET OF ROTATIONS DIANNE P. O LEARY AND STEPHEN S. BULLOCK Dedicated to Alan George on the occasion of his 60th birthday Abstract. Any matrix A of dimension m n (m n)
More informationLecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 23: Illusiveness of Parallel Performance James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L23 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping peel
More informationLecture 2: Metrics to Evaluate Systems
Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationDigital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH
More informationDriving Point Impedance Computation Applying Parallel Processing Techniques
7th WSEAS Int. Con. on MATHEMATICAL METHODS and COMPUTATIONAL TECHNIQUES IN ELECTRICAL ENGINEERING, Soia, 27-29/10/05 (pp229-234) Driving Point Impedance Computation Applying Parallel Processing Techniques
More informationIntroduction to Computer Engineering. CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison
Introduction to Computer Engineering CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison Chapter 3 Digital Logic Structures Slides based on set prepared by
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationParallel matrix computations
Parallel matrix computations (Gentle intro into HPC) Miroslav Tůma Extended syllabus June 4, 2018 Department of Numerical Mathematics, Faculty of Mathematics and Physics Charles University Contents 1 Computers,
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationTransposition Mechanism for Sparse Matrices on Vector Processors
Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationOPTIMAL PLACEMENT AND UTILIZATION OF PHASOR MEASUREMENTS FOR STATE ESTIMATION
OPTIMAL PLACEMENT AND UTILIZATION OF PHASOR MEASUREMENTS FOR STATE ESTIMATION Xu Bei, Yeo Jun Yoon and Ali Abur Teas A&M University College Station, Teas, U.S.A. abur@ee.tamu.edu Abstract This paper presents
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationParallel Longest Common Subsequence using Graphics Hardware
Parallel Longest Common Subsequence using Graphics Hardware John Kloetzli rian Strege Jonathan Decker Dr. Marc Olano Presented by: rian Strege 1 Overview Introduction Problem Statement ackground and Related
More informationJ.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009
Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.
More informationCSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm
CSCI 1760 - Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm Shay Mozes Brown University shay@cs.brown.edu Abstract. This report describes parallel Java implementations of
More information0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA
0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 2008-09 Salvatore Orlando 1 0-1 Knapsack problem N objects, j=1,..,n Each kind of item j has a value p j and a weight w j (single
More informationParallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer
More informationCombinational vs. Sequential. Summary of Combinational Logic. Combinational device/circuit: any circuit built using the basic gates Expressed as
Summary of Combinational Logic : Computer Architecture I Instructor: Prof. Bhagi Narahari Dept. of Computer Science Course URL: www.seas.gwu.edu/~bhagiweb/cs3/ Combinational device/circuit: any circuit
More informationPerformance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering
ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes
More informationDepartment of Electrical and Computer Engineering The University of Texas at Austin
Department of Electrical and Computer Engineering The University of Texas at Austin EE 360N, Fall 2004 Yale Patt, Instructor Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs Exam 1, October 6, 2004 Name:
More informationChapter 7. Sequential Circuits Registers, Counters, RAM
Chapter 7. Sequential Circuits Registers, Counters, RAM Register - a group of binary storage elements suitable for holding binary info A group of FFs constitutes a register Commonly used as temporary storage
More informationPorting a Sphere Optimization Program from lapack to scalapack
Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential
More informationTDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II
TDDB68 Concurrent programming and operating systems Lecture: CPU Scheduling II Mikael Asplund, Senior Lecturer Real-time Systems Laboratory Department of Computer and Information Science Copyright Notice:
More informationAn Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors
Contemporary Mathematics Volume 218, 1998 B 0-8218-0988-1-03024-7 An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors Michel Lesoinne
More informationSpecial Nodes for Interface
fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster
More informationA Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding
A Parallel Implementation of the Block-GTH algorithm Yuan-Jye Jason Wu y September 2, 1994 Abstract The GTH algorithm is a very accurate direct method for nding the stationary distribution of a nite-state,
More informationAntonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg
INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The
More informationUC Santa Barbara. Operating Systems. Christopher Kruegel Department of Computer Science UC Santa Barbara
Operating Systems Christopher Kruegel Department of Computer Science http://www.cs.ucsb.edu/~chris/ Many processes to execute, but one CPU OS time-multiplexes the CPU by operating context switching Between
More informationTIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION
Mathematical and Computational Applications, Vol. 11, No. 1, pp. 41-49, 2006. Association for Scientific Research TIME DEPENDENCE OF SHELL MODEL CALCULATIONS Süleyman Demirel University, Isparta, Turkey,
More informationAnalytical Modeling of Parallel Programs (Chapter 5) Alexandre David
Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on
More informationPorting a sphere optimization program from LAPACK to ScaLAPACK
Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference
More informationReview for the Midterm Exam
Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations
More informationSolving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *
Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.
More informationCompiling Techniques
Lecture 11: Introduction to 13 November 2015 Table of contents 1 Introduction Overview The Backend The Big Picture 2 Code Shape Overview Introduction Overview The Backend The Big Picture Source code FrontEnd
More informationModels: Amdahl s Law, PRAM, α-β Tal Ben-Nun
spcl.inf.ethz.ch @spcl_eth Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun Design of Parallel and High-Performance Computing Fall 2017 DPHPC Overview cache coherency memory models 2 Speedup An application
More informationClock-driven scheduling
Clock-driven scheduling Also known as static or off-line scheduling Michal Sojka Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Control Engineering November 8, 2017
More informationLOGIC CIRCUITS. Basic Experiment and Design of Electronics
Basic Experiment and Design of Electronics LOGIC CIRCUITS Ho Kyung Kim, Ph.D. hokyung@pusan.ac.kr School of Mechanical Engineering Pusan National University Outline Combinational logic circuits Output
More informationDirect solution methods for sparse matrices. p. 1/49
Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve
More informationICS 233 Computer Architecture & Assembly Language
ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationB629 project - StreamIt MPI Backend. Nilesh Mahajan
B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected
More informationVLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1
VLSI Design Adder Design [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1 Major Components of a Computer Processor Devices Control Memory Input Datapath
More informationEnhancing Performance of Tall-Skinny QR Factorization using FPGAs
Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common
More informationCS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms
CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms Prof. Gregory Provan Department of Computer Science University College Cork 1 Lecture Outline CS 4407, Algorithms Growth Functions
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationParallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1
Parallel Numerics, WT 2016/2017 5 Iterative Methods for Sparse Linear Systems of Equations page 1 of 1 Contents 1 Introduction 1.1 Computer Science Aspects 1.2 Numerical Problems 1.3 Graphs 1.4 Loop Manipulations
More informationRetiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.
Chapter Retiming NCU EE -- SP VLSI esign. Chap. Tsung-Han Tsai 1 Retiming & A transformation techniques used to change the locations of delay elements in a circuit without affecting the input/output characteristics
More informationConsider the following example of a linear system:
LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x + 2x 2 + 3x 3 = 5 x + x 3 = 3 3x + x 2 + 3x 3 = 3 x =, x 2 = 0, x 3 = 2 In general we want to solve n equations
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationCHAPTER 5 - PROCESS SCHEDULING
CHAPTER 5 - PROCESS SCHEDULING OBJECTIVES To introduce CPU scheduling, which is the basis for multiprogrammed operating systems To describe various CPU-scheduling algorithms To discuss evaluation criteria
More informationBinary Decision Diagrams and Symbolic Model Checking
Binary Decision Diagrams and Symbolic Model Checking Randy Bryant Ed Clarke Ken McMillan Allen Emerson CMU CMU Cadence U Texas http://www.cs.cmu.edu/~bryant Binary Decision Diagrams Restricted Form of
More informationMassively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem
Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing
More informationIntroduction to Computer Engineering. CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison
Introduction to Computer Engineering CS/ECE 252, Spring 2017 Rahul Nayar Computer Sciences Department University of Wisconsin Madison Chapter 3 Digital Logic Structures Slides based on set prepared by
More informationSparse solver 64 bit and out-of-core addition
Sparse solver 64 bit and out-of-core addition Prepared By: Richard Link Brian Yuen Martec Limited 1888 Brunswick Street, Suite 400 Halifax, Nova Scotia B3J 3J8 PWGSC Contract Number: W7707-145679 Contract
More informationThe parallelization of the Keller box method on heterogeneous cluster of workstations
Available online at http://wwwibnusinautmmy/jfs Journal of Fundamental Sciences Article The parallelization of the Keller box method on heterogeneous cluster of workstations Norhafiza Hamzah*, Norma Alias,
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationLOGIC CIRCUITS. Basic Experiment and Design of Electronics. Ho Kyung Kim, Ph.D.
Basic Experiment and Design of Electronics LOGIC CIRCUITS Ho Kyung Kim, Ph.D. hokyung@pusan.ac.kr School of Mechanical Engineering Pusan National University Digital IC packages TTL (transistor-transistor
More informationImprovement of Sparse Computation Application in Power System Short Circuit Study
Volume 44, Number 1, 2003 3 Improvement o Sparse Computation Application in Power System Short Circuit Study A. MEGA *, M. BELKACEMI * and J.M. KAUFFMANN ** * Research Laboratory LEB, L2ES Department o
More informationLogic and Computer Design Fundamentals. Chapter 8 Sequencing and Control
Logic and Computer Design Fundamentals Chapter 8 Sequencing and Control Datapath and Control Datapath - performs data transfer and processing operations Control Unit - Determines enabling and sequencing
More informationNotation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing
Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationFigure 4.9 MARIE s Datapath
Term Control Word Microoperation Hardwired Control Microprogrammed Control Discussion A set of signals that executes a microoperation. A register transfer or other operation that the CPU can execute in
More informationCase Study: Quantum Chromodynamics
Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver
More information