Transposition Mechanism for Sparse Matrices on Vector Processors
|
|
- Shanna Atkinson
- 6 years ago
- Views:
Transcription
1 Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands Abstract Many scientific applications involve operations on sparse matrices. However, due to irregularities induced by the sparsity patterns, many operations on sparse matrices execute inefficiently on traditional scalar and vector architectures. To tackle this problem a scheme has been proposed consisting of two parts: (a) An extension to a vector architecture to support sparse matrix-vector multiplication using (b) a novel Blocked Based sparse matrix Compression Storage (BBCS) format. Within this context, in this paper we propose and describe a hardware mechanism for the extended vector architecture that performs the transposition Ì of a sparse matrix using a hierarchical variation of the aforementioned sparse matrix compression format. The proposed Sparse matrix Transposition Mechanism (STM) is used as a Functional Unit for a vector processor and requires an word in-processor memory where is the vector processor s section size. In this paper we provide a full description of the STM and show an expected performance increase of one order of magnitude. Keywords Vector processor, matrix transpose, sparse matrix, functional unit I. INTRODUCTION In many scientific computing areas the manipulation of sparse matrices constitutes the kernel of the solvers. The irregularities however of the matrix sparsity patterns, i.e. the distribution of the non-zeros within the matrix, make many operations on sparse matrices execute inefficiently on traditional scalar and vector architectures. This problem has been tackled both software and hardware approaches. Most of the approaches are in software [], [], because they are less costly. However, research focused on hardware approaches [4], [5], [], [] indicates that much greater improvements can be obtained. In [] the authors report a speedup of up to times (depending on the sparsity pattern) using an Augmented Vector Architecture (AVA) and an associated sparse matrix storage scheme (BBCS) when performing sparse matrix vector multiplication when compared to the aforementioned JD method on a conventional vector processor. The sparse matrix related problem that we address here is that of the matrix transposition, i.e. the construction of the Ì from a sparse matrix on a vector processor. It is not possible to perform the transposition of a sparse matrix using the instruction set of a traditional vector processor. Therefore, this paper we propose a mechanism to enable the transposition of a sparse matrix within the context of the aforementioned AVA. The contributions of this paper can be summarized as follows: We propose and describe a novel mechanism, the Sparse matrix Transposition Mechanism (STM), implemented as a functional unit for a vector processor, that can perform the transposition of a sparse matrix that is stored using in a hierarchical sparse matrix storage format. We evaluate the timing properties of the STM and show an expected performance increase of one order of magnitude when compared to a scalar implementation of the sparse matrix transposition. The remainder of the paper is organized as follows: In the next Section we provide with some background information on transposition, vector processors and the hierarchical sparse matrix storage format. In Section III we describe and evaluate the proposed mechanism and finally, in Section IV we draw some conclusions. II. BACKGROUND This section provides some background information and assumptions made throughout the paper. The transposition of an Å Æ matrix is the calculation of the Æ Å matrix Ì. The operation consists of the exchange of the rows and columns of the matrix. Thus, essentially it is an operation that does not alter the values of the elements but their positions. For a dense matrix, the problem is trivial and can be solved by addressing a rowwise stored matrix with a stride equal to the number of rows of the matrix or vice versa. Sparse matrices however are usually stored in a more complex way that involves the storage of the non-zero values and their positional information [], []. This results in the need of using costly sorting algorithms in order to perform the transposition.
2 With the proposed mechanism we attempt to streamline this operation and make it suitable for a vector processor. The proposed transposition mechanism is functioning as a Functional Unit of a vector processor. Vector processors, such as the one depicted in Figure are based on architectures that support the execution of vector instructions. On most current vector architectures [8], the vectors are copied from the main memory into vector registers within the processor before they are operated upon. Vector registers are arrays of scalar registers that hold (parts of) the vectors to be processed. Due to the fact that the vector register length can not be arbitrarily large, when operating on large vectors they have to be divided into smaller parts, a technique that is usually called strip mining, each of which cannot be larger than the maximum amount of elements a vector register can hold, i.e., the architecturally defined section size of the VP. In a VP the operations are carried out by (usually) pipelined Functional Units (FU) that are able to fetch one or more new element per cycle from each of the source vector register(s) involved, operate on it/them, and return the result(s) to the result (vector) register. Main Memory Vector processor Vector Unit Load Store Unit Vector Controller Scalar Controller Cache Scalar Unit Vector Register File Scalar Registers Fig.. Vector Architecture Functional Unit Functional Unit Functional Unit Functional Unit N Scalar Pipeline Before proceeding with the description STM functional unit we will first give a brief description of the hierarchical storage format, the sparse matrix format that we will assume for the remainder of our paper and which is a hierarchical variation of the aforementioned BBCS format: To obtain the HiSM an Å Æ sparse matrix is partitioned in Å Æ square sub-matrices where is the Section Size of the vector architecture. Each of these sub-matrices, which we will call ¾ -blocks, is then stored separately in memory in the following way: All the nonzero values as well as the positional information combined are stored in a row-wise fashion in an array ( ¾ -blockarray) in memory. In Figure (bottom left) we can observe how such a blockarray is formed containing both the position and value data from the top left ¾ -block of an sparse matrix. The section size is. Note that the positional data consists of only the column position of the non-zero elements within the sub-matrix plus an extra bit that indicates whether the non-zero element is the last element in its row. This bit is incorporated in the position data. We will not elaborate further on the exact bit-by-bit configuration of the ¾ -blockarray. The ¾ -blockarrays can contain up to ¾ non-zero elements and we will assume that an AVA can operate on these in the same way as described in []. These ¾ -blockarrays that describe the non-empty ¾ - blocks form the lowest (zero) level of the hierarchical structure of our format. As can be observed in Figure, the non-empty ¾ -blocks form a similar sparsity pattern as the non-zero values within an ¾ -block, Therefore, the next level of the hierarchy, level-½, is formed in exactly the same way as level zero with the difference that the values of non-zero elements are replaced by the pointers to the ¾ -blockarrays in memory that describe non-empty ¾ - blocks. This new array which contains the pointers to the lower level is stored in exactly the same fashion in memory (see Figure (bottom right). Notice that at level- the pointers are stored in a column-wise fashion. In this way an access pattern is provided where the ¾ -elementwide columns are accessed row-wise. This is favorable for operations such as matrix-vector multiplication (refer to [] for a more elaborate discussion). The next level, level-, if there is one (in the example of Figure there is none), is formed in the same way as level- with the pointers pointing at the ¾ -blockarrays of level-. Further, as in any hierarchical structure the higher levels are formed in the same way and we proceed until we have covered the entire matrix in Ñ Ü ÐÓ Å ÐÓ Æ µ levels. We can summarize the description of the Hierarchical sparse matrix storage format as follows: The entire matrix is divided hierarchically into blocks of size (called ¾ -blocks) with the lowest level containing the actual value of the non-zero elements and the higher levels containing pointers to the non-empty ¾ - blocks of one level lower. The ¾ -blocks at all levels are represented as an array (called a ¾ -blockarray whose entries are non-zero val- ½ The careful reader will notice that when there are empty rows within the ¾ -block this format will not suffice. We have incorporated this detail in our format in the same way as in [], however, being of no further consequence to what will be discussed in the remainder of the paper we will omit a detailed description for simplicity
3 Non-zero element End of Row Storage of an 8x8 submatrix Pointer to an 8x8 submatrix one level lower in the matrix hierarchy End of Column Storage of an 8x8 level hierarchy submatrix s -blockarray Positional Data s -blockarray Positional Data Value Data Pointer Data Level - 0 Level - Fig.. Example of the Hierarchical Sparse Matrix Storage Format ues (for level-¼) or pointers to non-empty lower level ¾ - blockarrays (for all higher levels) along with their corresponding positional information within the block. The formats are identical for all levels. III. THE TRANSPOSITION MECHANISM As mentioned previously, the proposed Sparse matrix Transposition Mechanism (STM) is implemented as a functional unit of a vector processor. The STM with a section size is depicted in Figure. The main part of the unit consists of the -memory. The -memory is used to store an ¾ -block of the hierarchically stored matrix. The mechanism can transpose one ¾ -block at a time. The procedure is as follows: First, the ¾ -block is stored in the -memory one section at a time. When the complete ¾ -block is stored, the ¾ -block is then read from the -memory in the transpose fashion than storing, i.e. row-wise if stored column-wise and vice versa. More specifically: assume that a part of an ¾ block is stored in a vector register Ê. The contents of a register Ê can be stored in the -memory via the column-wise I/O-buffer. The depth of this buffer defines how many elements can maximally be stored per clock cycle. We will call this the -memory bandwidth, which in the case of Figure is. At each cycle the I/O-buffer is filled with non-zero elements of the same column along with their correspond-
4 Vector Register File I/O Buffer Rowwise MUX Column Position Value Non-zero locator memory Row-buffer Column-buffer Non-zero Indicators "" when non-zero when zero Columnwise I/O Buffer M U X Position of st "" Position of nd "" Position of rd "" Position of 4th "" Row Position Non-zero locator Non-zero Indicator Storage cell SxS Memory Fig.. The Sparse matrix Transposition Mechanism (STM) counter counter ing row position. In the next cycle, the row position is used by the Non-zero Locator unit to store the non-zero values at the correct row position in the column-buffer. The nonzero indicator at the corresponding cells of the buffer are then set accordingly to indicate the a non-zero or a zero value. This process is repeated until there are no more non-zero element for the current column. Subsequently, the entire column buffer is copied into the -memory using the column position information (not shown in Figure ). We can now read the transposed of the ¾ -block by reversing the order used for storing, at the row-wise section of the STM. Column by column, the ¾ -block is moved into row-buffer. There, using the Non-zero-Locator, the non-zero values and their column positions are copied into I/O-buffer (maximally at a time) and then stored into a register in the register file. However, when reading the -memory, the working of the Non-zero locator is not trivial. Therefore we will describe its workings in further detail. The working Non-zero locator is graphically depicted in Figure 4. The function of this circuit is to extract from a string of input bits (the non-zero indicators) the position of the first s. When there are more than non-zero elements the located non-zeros are set to zero (not depicted in Figure 4) and the process is repeated in order to locate the following non-zero elements. When there are less than non-zero elements one or more of the 0 -counters will produce an overflow. This overflow indicates to the control logic that a new row or column needs to be fetched from the - "" Fig. 4. The Non-zero Locator counter counter As we have mentioned, the STM can only transpose an ¾ -block. However, because of the similar structure of the HiSM at all hierarchy levels we can apply the same transposition mechanism on all levels in order to achieve the transposition of the entire matrix. Figure 5 graphically illustrates this principle. Observe that when the matrix is transposed, every ¾ -block is also transposed. Additionally, at one level higher, level-½, the position of the non-empty (depicted darker in the figure) blocks is also transposed. This means that if we can transpose an ¾ - blockarray at level-¼ we can apply the same algorithm to the ¾ -blockarrays at all levels to transpose the entire matrix. A. Timing Evaluation In this section we will provide performance estimations of the proposed mechanism. Due to fact that the - memory has to be filled before it can be read back, the
5 Fig. 5. Matrix Transposition STM unit can not be fully pipelined. However, separately, the write and read phases can be pipelined in three stages. This means that cycles are required for the last elements to enter the Ë -memory and similarly cycles are needed for the last results to be returned to the vector register. This results in a functional unit that has a latency of ¾ Ò where Ò is the number of non-zero elements Ì in the ¾ -block and Ì is the throughput of the I/O-buffer, i.e. the average number of elements in the I/O-buffer per cycle. The throughput Ì varies from ½ to where is the previously mentioned STM bandwidth that is equal to the depth of the I/O-buffer. The precise value of Ì and thus the performance of the STM depends on the sparsity pattern of the matrix to be transposed. Therefore we will provide only the worst and best case scenarios to evaluate the performance: Best case: ¾ Ò Worst case: ¾ Ò To Perform the same operation on a scalar machine we would need a sorting loop of on average Ò ÐÓ Ò iterations. The operations within this loop are highly dependent and unpredictable, and therefore no advantage can be expected from ILP techniques such as pipelining, dependence checking and branch prediction. This results in a sustained execution time of several cycles per iteration. Comparing to our scheme we can expect an order of magnitude of improvement. IV. CONCLUSIONS In this paper we have proposed and described a novel mechanism, the Sparse matrix Transposition Mechanism (STM), implemented as a functional unit for a vector processor, that can perform the transposition of a sparse matrix that is stored using in a hierarchical sparse matrix storage format. We have evaluated the timing properties of the STM and showed an expected performance increase of one order of magnitude when compared to a scalar implementation of the sparse matrix transposition. REFERENCES [] S. Vassiliadis, S. Cotofana, and Pyrrhos Stathis, Vector isa extension sprase matrix multiplication., in EuroPar 99 Parallel Processing. 999, Lecture Notes in Computer Science, No. 85, pp. 08 5, Springer-Verlag. [] Victor Eijkhout, LAPACK working note 50: Distributed sparse data structures for linear algebra operations, Tech. Rep. UT- CS-9-9, Department of Computer Science, University of Tennessee, Sept. 99, Mon, Apr 99 0:9: GMT. [] Yosef Saad, SPARSKIT: A basic tool kit for sparse matrix computations, Tech. Rep., Computer Science Department, University of Minnesota, Minneapolis, MN 55455, June 994, Version. [4] Hideharu Amano, Taisuke Boku, Tomohiro Kudoh, and Hideo Aiso, (SM) ¾ -II: A new version of the sparse matrix solving machine, in Proceedings of the th Annual International Symposium on Computer Architecture, Boston, Massachusetts, June 9, 985, IEEE Computer Society TCA and ACM SIGARCH, pp [5] Valerie E. Taylor, Abhiram Ranade, and David G. Messerschitt, SPAR: A New Architecture for Large Finite Element Computations, IEEE Transactions on Computers, vol. 44, no. 4, pp , April 995. [] Pyrrhos Stathis, Stamatis Vassiliadis, and Sorin Cotofana, Sparse matrix vector multiplication evaluation using the bbcs scheme, To appear in 8th PCI, Nov 00. [] A. Wolfe, M. Breternitz, Jr., C. Stephens, A. L. Ting, D. B. Kirk, R. P. Bianchini, Jr., and J. P. Shen, The white dwarf: A highperformance application-specific processor, in Proceedings of the 5th Annual International Symposium on Computer Architecture, H. J. Siegel, Ed., Honolulu, Hawaii, May June 988, pp., IEEE Computer Society Press. [8] John L. Hennessy and David A. Patterson, Computer Architecture A Quantative Approach, Morgan Kaufman, San Mateo, California, 990.
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationCPSC 3300 Spring 2017 Exam 2
CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW
More informationAn Algorithm for a Two-Disk Fault-Tolerant Array with (Prime 1) Disks
An Algorithm for a Two-Disk Fault-Tolerant Array with (Prime 1) Disks Sanjeeb Nanda and Narsingh Deo School of Computer Science University of Central Florida Orlando, Florida 32816-2362 sanjeeb@earthlink.net,
More informationCSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT
CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch
More informationCOVER SHEET: Problem#: Points
EEL 4712 Midterm 3 Spring 2017 VERSION 1 Name: UFID: Sign here to give permission for your test to be returned in class, where others might see your score: IMPORTANT: Please be neat and write (or draw)
More informationCMP 338: Third Class
CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does
More informationEnrico Nardelli Logic Circuits and Computer Architecture
Enrico Nardelli Logic Circuits and Computer Architecture Appendix B The design of VS0: a very simple CPU Rev. 1.4 (2009-10) by Enrico Nardelli B - 1 Instruction set Just 4 instructions LOAD M - Copy into
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationAn Effective New CRT Based Reverse Converter for a Novel Moduli Set { 2 2n+1 1, 2 2n+1, 2 2n 1 }
An Effective New CRT Based Reverse Converter for a Novel Moduli Set +1 1, +1, 1 } Edem Kwedzo Bankas, Kazeem Alagbe Gbolagade Department of Computer Science, Faculty of Mathematical Sciences, University
More informationOptimum Circuits for Bit Reversal
Optimum Circuits for Bit Reversal Mario Garrido Gálvez, Jesus Grajal and Oscar Gustafsson Linköping University Post Print.B.: When citing this work, cite the original article. 2011 IEEE. Personal use of
More informationInstruction Set Extensions for Reed-Solomon Encoding and Decoding
Instruction Set Extensions for Reed-Solomon Encoding and Decoding Suman Mamidi and Michael J Schulte Dept of ECE University of Wisconsin-Madison {mamidi, schulte}@caewiscedu http://mesaecewiscedu Daniel
More informationPipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2
Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1 Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
More informationMapping Sparse Matrix-Vector Multiplication on FPGAs
Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 1 University of Tennessee, Knoxville 2 Oak Ridge National Laboratory July 20, 2007 Outline Introduction
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationFast Multipole Methods: Fundamentals & Applications. Ramani Duraiswami Nail A. Gumerov
Fast Multipole Methods: Fundamentals & Applications Ramani Duraiswami Nail A. Gumerov Week 1. Introduction. What are multipole methods and what is this course about. Problems from physics, mathematics,
More information25.2 Last Time: Matrix Multiplication in Streaming Model
EE 381V: Large Scale Learning Fall 01 Lecture 5 April 18 Lecturer: Caramanis & Sanghavi Scribe: Kai-Yang Chiang 5.1 Review of Streaming Model Streaming model is a new model for presenting massive data.
More informationCSCI-564 Advanced Computer Architecture
CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA
More informationA COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte
A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER Jesus Garcia and Michael J. Schulte Lehigh University Department of Computer Science and Engineering Bethlehem, PA 15 ABSTRACT Galois field arithmetic
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationVLSI Signal Processing
VLSI Signal Processing Lecture 1 Pipelining & Retiming ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-1 Introduction DSP System Real time requirement Data driven synchronized by data
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationBlock-tridiagonal matrices
Block-tridiagonal matrices. p.1/31 Block-tridiagonal matrices - where do these arise? - as a result of a particular mesh-point ordering - as a part of a factorization procedure, for example when we compute
More informationLinear Algebra I Lecture 8
Linear Algebra I Lecture 8 Xi Chen 1 1 University of Alberta January 25, 2019 Outline 1 2 Gauss-Jordan Elimination Given a system of linear equations f 1 (x 1, x 2,..., x n ) = 0 f 2 (x 1, x 2,..., x n
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationEECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates
EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then
More informationDSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.
SP esign Lecture 7 Unfolding cont. & Folding r. Fredrik Edman fredrik.edman@eit.lth.se Unfolding Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way
More informationAlgorithms and Methods for Fast Model Predictive Control
Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model
More informationww.padasalai.net
t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationICS 233 Computer Architecture & Assembly Language
ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by
More informationUnit 6: Branch Prediction
CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,
More informationRetiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.
Chapter Retiming NCU EE -- SP VLSI esign. Chap. Tsung-Han Tsai 1 Retiming & A transformation techniques used to change the locations of delay elements in a circuit without affecting the input/output characteristics
More informationVector Lane Threading
Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program
More informationBlack Box Linear Algebra
Black Box Linear Algebra An Introduction to Wiedemann s Approach William J. Turner Department of Mathematics & Computer Science Wabash College Symbolic Computation Sometimes called Computer Algebra Symbols
More informationBranch Prediction using Advanced Neural Methods
Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level
More informationSpeeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases. D. J. Bernstein University of Illinois at Chicago
Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases D. J. Bernstein University of Illinois at Chicago NSF ITR 0716498 Part I. Linear maps Consider computing 0
More informationMatrices and RRE Form
Matrices and RRE Form Notation R is the real numbers, C is the complex numbers (we will only consider complex numbers towards the end of the course) is read as an element of For instance, x R means that
More informationCprE 281: Digital Logic
CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital
More informationHigh Performance Parallel Tucker Decomposition of Sparse Tensors
High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon,
More informationThis is a repository copy of Improving the associative rule chaining architecture.
This is a repository copy of Improving the associative rule chaining architecture. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/75674/ Version: Accepted Version Book Section:
More informationCitation Osaka Journal of Mathematics. 43(2)
TitleIrreducible representations of the Author(s) Kosuda, Masashi Citation Osaka Journal of Mathematics. 43(2) Issue 2006-06 Date Text Version publisher URL http://hdl.handle.net/094/0396 DOI Rights Osaka
More informationDepartment of Electrical and Computer Engineering The University of Texas at Austin
Department of Electrical and Computer Engineering The University of Texas at Austin EE 360N, Fall 2004 Yale Patt, Instructor Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs Exam 1, October 6, 2004 Name:
More informationAverage Bandwidth Relevance in Parallel Solving Systems of Linear Equations
Average Bandwidth Relevance in Parallel Solving Systems of Linear Equations Liviu Octavian Mafteiu-Scai Computer Science Department, West University of Timisoara, Timisoara, Romania ABSTRACT: This paper
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More informationOn Two Class-Constrained Versions of the Multiple Knapsack Problem
On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic
More informationMinimum Repair Bandwidth for Exact Regeneration in Distributed Storage
1 Minimum Repair andwidth for Exact Regeneration in Distributed Storage Vivec R Cadambe, Syed A Jafar, Hamed Malei Electrical Engineering and Computer Science University of California Irvine, Irvine, California,
More informationA HIGH-SPEED PROCESSOR FOR RECTANGULAR-TO-POLAR CONVERSION WITH APPLICATIONS IN DIGITAL COMMUNICATIONS *
Copyright IEEE 999: Published in the Proceedings of Globecom 999, Rio de Janeiro, Dec 5-9, 999 A HIGH-SPEED PROCESSOR FOR RECTAGULAR-TO-POLAR COVERSIO WITH APPLICATIOS I DIGITAL COMMUICATIOS * Dengwei
More informationModels of Computation
Models of Computation Analysis of Algorithms Week 1, Lecture 2 Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Models of Computation (RAM) a) Random Access Machines
More informationProject Two RISC Processor Implementation ECE 485
Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor
More informationDigital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits.
CS211 Computer Architecture Digital Logic l Topics l Transistors (Design & Types) l Logic Gates l Combinational Circuits l K-Maps Figures & Tables borrowed from:! http://www.allaboutcircuits.com/vol_4/index.html!
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationComputer Architecture
Lecture 2: Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture CPU Evolution What is? 2 Outline Measurements and metrics : Performance, Cost, Dependability, Power Guidelines
More informationPerformance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper
More informationhal , version 1-27 Mar 2014
Author manuscript, published in "2nd Multidisciplinary International Conference on Scheduling : Theory and Applications (MISTA 2005), New York, NY. : United States (2005)" 2 More formally, we denote by
More informationDesigning Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way
EECS 16A Designing Information Devices and Systems I Spring 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate
More informationChapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>
Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building
More informationECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University
ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation
More information5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)
5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h
More informationHalting and Equivalence of Program Schemes in Models of Arbitrary Theories
Halting and Equivalence of Program Schemes in Models of Arbitrary Theories Dexter Kozen Cornell University, Ithaca, New York 14853-7501, USA, kozen@cs.cornell.edu, http://www.cs.cornell.edu/~kozen In Honor
More informationDigital Logic: Boolean Algebra and Gates. Textbook Chapter 3
Digital Logic: Boolean Algebra and Gates Textbook Chapter 3 Basic Logic Gates XOR CMPE12 Summer 2009 02-2 Truth Table The most basic representation of a logic function Lists the output for all possible
More informationarxiv: v1 [cs.sc] 17 Apr 2013
EFFICIENT CALCULATION OF DETERMINANTS OF SYMBOLIC MATRICES WITH MANY VARIABLES TANYA KHOVANOVA 1 AND ZIV SCULLY 2 arxiv:1304.4691v1 [cs.sc] 17 Apr 2013 Abstract. Efficient matrix determinant calculations
More informationIntroducing a Bioinformatics Similarity Search Solution
Introducing a Bioinformatics Similarity Search Solution 1 Page About the APU 3 The APU as a Driver of Similarity Search 3 Similarity Search in Bioinformatics 3 POC: GSI Joins Forces with the Weizmann Institute
More informationB œ c " " ã B œ c 8 8. such that substituting these values for the B 3 's will make all the equations true
System of Linear Equations variables Ð unknowns Ñ B" ß B# ß ÞÞÞ ß B8 Æ Æ Æ + B + B ÞÞÞ + B œ, "" " "# # "8 8 " + B + B ÞÞÞ + B œ, #" " ## # #8 8 # ã + B + B ÞÞÞ + B œ, 3" " 3# # 38 8 3 ã + 7" B" + 7# B#
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program
More informationComputer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2.
COMPUTER SCIENCE S E D G E W I C K / W A Y N E PA R T I I : A L G O R I T H M S, T H E O R Y, A N D M A C H I N E S Computer Science Computer Science An Interdisciplinary Approach Section 4.2 ROBERT SEDGEWICK
More informationDigital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH
More informationCMP N 301 Computer Architecture. Appendix C
CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)
More informationOutline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014
Outline 1 midterm exam on Friday 11 July 2014 policies for the first part 2 questions with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Intro
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationLet s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc.
Finite State Machines Introduction Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc. Such devices form
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationIntroduction to Matrices and Linear Systems Ch. 3
Introduction to Matrices and Linear Systems Ch. 3 Doreen De Leon Department of Mathematics, California State University, Fresno June, 5 Basic Matrix Concepts and Operations Section 3.4. Basic Matrix Concepts
More informationEECS150 - Digital Design Lecture 21 - Design Blocks
EECS150 - Digital Design Lecture 21 - Design Blocks April 3, 2012 John Wawrzynek Spring 2012 EECS150 - Lec21-db3 Page 1 Fixed Shifters / Rotators fixed shifters hardwire the shift amount into the circuit.
More informationImplementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System
Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System G.Suresh, G.Indira Devi, P.Pavankumar Abstract The use of the improved table look up Residue Number System
More informationParallel Sparse Tensor Decompositions using HiCOO Format
Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline
More informationSome long-period random number generators using shifts and xors
ANZIAM J. 48 (CTAC2006) pp.c188 C202, 2007 C188 Some long-period random number generators using shifts and xors Richard P. Brent 1 (Received 6 July 2006; revised 2 July 2007) Abstract Marsaglia recently
More informationDependence Analysis. Dependence Examples. Last Time: Brief introduction to interprocedural analysis. do I = 2, 100 A(I) = A(I-1) + 1 enddo
Dependence Analysis Dependence Examples Last Time: Brief introduction to interprocedural analysis Today: Optimization for parallel machines and memory hierarchies Dependence analysis Loop transformations
More informationConsider the following example of a linear system:
LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x + 2x 2 + 3x 3 = 5 x + x 3 = 3 3x + x 2 + 3x 3 = 3 x =, x 2 = 0, x 3 = 2 In general we want to solve n equations
More informationDesign at the Register Transfer Level
Week-7 Design at the Register Transfer Level Algorithmic State Machines Algorithmic State Machine (ASM) q Our design methodologies do not scale well to real-world problems. q 232 - Logic Design / Algorithmic
More informationLecture 11. Advanced Dividers
Lecture 11 Advanced Dividers Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 15 Variation in Dividers 15.3, Combinational and Array Dividers Chapter 16, Division
More information4th year Project demo presentation
4th year Project demo presentation Colm Ó héigeartaigh CASE4-99387212 coheig-case4@computing.dcu.ie 4th year Project demo presentation p. 1/23 Table of Contents An Introduction to Quantum Computing The
More informationDistributed Data Storage with Minimum Storage Regenerating Codes - Exact and Functional Repair are Asymptotically Equally Efficient
Distributed Data Storage with Minimum Storage Regenerating Codes - Exact and Functional Repair are Asymptotically Equally Efficient Viveck R Cadambe, Syed A Jafar, Hamed Maleki Electrical Engineering and
More informationECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)
ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC
More informationPerformance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu
Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Storage Project3 Digital Logic - Storage: Recap - Review: cache hit rate - Project3 - Digital Logic: - truth table => SOP - simplification: Boolean
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationALU A functional unit
ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1
More informationParallel Sparse Matrix Vector Multiplication (PSC 4.3)
Parallel Sparse Matrix Vector Multiplication (PSC 4.) original slides by Rob Bisseling, Universiteit Utrecht, accompanying the textbook Parallel Scientific Computation adapted for the lecture HPC Algorithms
More informationDesigning Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way
EECS 16A Designing Information Devices and Systems I Fall 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate it
More informationChapter 2. Divide-and-conquer. 2.1 Strassen s algorithm
Chapter 2 Divide-and-conquer This chapter revisits the divide-and-conquer paradigms and explains how to solve recurrences, in particular, with the use of the master theorem. We first illustrate the concept
More informationWord-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator
Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Chalermpol Saiprasert, Christos-Savvas Bouganis and George A. Constantinides Department of Electrical & Electronic
More informationRandomized Simultaneous Messages: Solution of a Problem of Yao in Communication Complexity
Randomized Simultaneous Messages: Solution of a Problem of Yao in Communication Complexity László Babai Peter G. Kimmel Department of Computer Science The University of Chicago 1100 East 58th Street Chicago,
More informationECE/CS 250 Computer Architecture
ECE/CS 250 Computer Architecture Basics of Logic Design: Boolean Algebra, Logic Gates (Combinational Logic) Tyler Bletsch Duke University Slides are derived from work by Daniel J. Sorin (Duke), Alvy Lebeck
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS SYSTEMS OF EQUATIONS AND MATRICES Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
More information