Transposition Mechanism for Sparse Matrices on Vector Processors

Size: px
Start display at page:

Download "Transposition Mechanism for Sparse Matrices on Vector Processors"

Transcription

1 Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands Abstract Many scientific applications involve operations on sparse matrices. However, due to irregularities induced by the sparsity patterns, many operations on sparse matrices execute inefficiently on traditional scalar and vector architectures. To tackle this problem a scheme has been proposed consisting of two parts: (a) An extension to a vector architecture to support sparse matrix-vector multiplication using (b) a novel Blocked Based sparse matrix Compression Storage (BBCS) format. Within this context, in this paper we propose and describe a hardware mechanism for the extended vector architecture that performs the transposition Ì of a sparse matrix using a hierarchical variation of the aforementioned sparse matrix compression format. The proposed Sparse matrix Transposition Mechanism (STM) is used as a Functional Unit for a vector processor and requires an word in-processor memory where is the vector processor s section size. In this paper we provide a full description of the STM and show an expected performance increase of one order of magnitude. Keywords Vector processor, matrix transpose, sparse matrix, functional unit I. INTRODUCTION In many scientific computing areas the manipulation of sparse matrices constitutes the kernel of the solvers. The irregularities however of the matrix sparsity patterns, i.e. the distribution of the non-zeros within the matrix, make many operations on sparse matrices execute inefficiently on traditional scalar and vector architectures. This problem has been tackled both software and hardware approaches. Most of the approaches are in software [], [], because they are less costly. However, research focused on hardware approaches [4], [5], [], [] indicates that much greater improvements can be obtained. In [] the authors report a speedup of up to times (depending on the sparsity pattern) using an Augmented Vector Architecture (AVA) and an associated sparse matrix storage scheme (BBCS) when performing sparse matrix vector multiplication when compared to the aforementioned JD method on a conventional vector processor. The sparse matrix related problem that we address here is that of the matrix transposition, i.e. the construction of the Ì from a sparse matrix on a vector processor. It is not possible to perform the transposition of a sparse matrix using the instruction set of a traditional vector processor. Therefore, this paper we propose a mechanism to enable the transposition of a sparse matrix within the context of the aforementioned AVA. The contributions of this paper can be summarized as follows: We propose and describe a novel mechanism, the Sparse matrix Transposition Mechanism (STM), implemented as a functional unit for a vector processor, that can perform the transposition of a sparse matrix that is stored using in a hierarchical sparse matrix storage format. We evaluate the timing properties of the STM and show an expected performance increase of one order of magnitude when compared to a scalar implementation of the sparse matrix transposition. The remainder of the paper is organized as follows: In the next Section we provide with some background information on transposition, vector processors and the hierarchical sparse matrix storage format. In Section III we describe and evaluate the proposed mechanism and finally, in Section IV we draw some conclusions. II. BACKGROUND This section provides some background information and assumptions made throughout the paper. The transposition of an Å Æ matrix is the calculation of the Æ Å matrix Ì. The operation consists of the exchange of the rows and columns of the matrix. Thus, essentially it is an operation that does not alter the values of the elements but their positions. For a dense matrix, the problem is trivial and can be solved by addressing a rowwise stored matrix with a stride equal to the number of rows of the matrix or vice versa. Sparse matrices however are usually stored in a more complex way that involves the storage of the non-zero values and their positional information [], []. This results in the need of using costly sorting algorithms in order to perform the transposition.

2 With the proposed mechanism we attempt to streamline this operation and make it suitable for a vector processor. The proposed transposition mechanism is functioning as a Functional Unit of a vector processor. Vector processors, such as the one depicted in Figure are based on architectures that support the execution of vector instructions. On most current vector architectures [8], the vectors are copied from the main memory into vector registers within the processor before they are operated upon. Vector registers are arrays of scalar registers that hold (parts of) the vectors to be processed. Due to the fact that the vector register length can not be arbitrarily large, when operating on large vectors they have to be divided into smaller parts, a technique that is usually called strip mining, each of which cannot be larger than the maximum amount of elements a vector register can hold, i.e., the architecturally defined section size of the VP. In a VP the operations are carried out by (usually) pipelined Functional Units (FU) that are able to fetch one or more new element per cycle from each of the source vector register(s) involved, operate on it/them, and return the result(s) to the result (vector) register. Main Memory Vector processor Vector Unit Load Store Unit Vector Controller Scalar Controller Cache Scalar Unit Vector Register File Scalar Registers Fig.. Vector Architecture Functional Unit Functional Unit Functional Unit Functional Unit N Scalar Pipeline Before proceeding with the description STM functional unit we will first give a brief description of the hierarchical storage format, the sparse matrix format that we will assume for the remainder of our paper and which is a hierarchical variation of the aforementioned BBCS format: To obtain the HiSM an Å Æ sparse matrix is partitioned in Å Æ square sub-matrices where is the Section Size of the vector architecture. Each of these sub-matrices, which we will call ¾ -blocks, is then stored separately in memory in the following way: All the nonzero values as well as the positional information combined are stored in a row-wise fashion in an array ( ¾ -blockarray) in memory. In Figure (bottom left) we can observe how such a blockarray is formed containing both the position and value data from the top left ¾ -block of an sparse matrix. The section size is. Note that the positional data consists of only the column position of the non-zero elements within the sub-matrix plus an extra bit that indicates whether the non-zero element is the last element in its row. This bit is incorporated in the position data. We will not elaborate further on the exact bit-by-bit configuration of the ¾ -blockarray. The ¾ -blockarrays can contain up to ¾ non-zero elements and we will assume that an AVA can operate on these in the same way as described in []. These ¾ -blockarrays that describe the non-empty ¾ - blocks form the lowest (zero) level of the hierarchical structure of our format. As can be observed in Figure, the non-empty ¾ -blocks form a similar sparsity pattern as the non-zero values within an ¾ -block, Therefore, the next level of the hierarchy, level-½, is formed in exactly the same way as level zero with the difference that the values of non-zero elements are replaced by the pointers to the ¾ -blockarrays in memory that describe non-empty ¾ - blocks. This new array which contains the pointers to the lower level is stored in exactly the same fashion in memory (see Figure (bottom right). Notice that at level- the pointers are stored in a column-wise fashion. In this way an access pattern is provided where the ¾ -elementwide columns are accessed row-wise. This is favorable for operations such as matrix-vector multiplication (refer to [] for a more elaborate discussion). The next level, level-, if there is one (in the example of Figure there is none), is formed in the same way as level- with the pointers pointing at the ¾ -blockarrays of level-. Further, as in any hierarchical structure the higher levels are formed in the same way and we proceed until we have covered the entire matrix in Ñ Ü ÐÓ Å ÐÓ Æ µ levels. We can summarize the description of the Hierarchical sparse matrix storage format as follows: The entire matrix is divided hierarchically into blocks of size (called ¾ -blocks) with the lowest level containing the actual value of the non-zero elements and the higher levels containing pointers to the non-empty ¾ - blocks of one level lower. The ¾ -blocks at all levels are represented as an array (called a ¾ -blockarray whose entries are non-zero val- ½ The careful reader will notice that when there are empty rows within the ¾ -block this format will not suffice. We have incorporated this detail in our format in the same way as in [], however, being of no further consequence to what will be discussed in the remainder of the paper we will omit a detailed description for simplicity

3 Non-zero element End of Row Storage of an 8x8 submatrix Pointer to an 8x8 submatrix one level lower in the matrix hierarchy End of Column Storage of an 8x8 level hierarchy submatrix s -blockarray Positional Data s -blockarray Positional Data Value Data Pointer Data Level - 0 Level - Fig.. Example of the Hierarchical Sparse Matrix Storage Format ues (for level-¼) or pointers to non-empty lower level ¾ - blockarrays (for all higher levels) along with their corresponding positional information within the block. The formats are identical for all levels. III. THE TRANSPOSITION MECHANISM As mentioned previously, the proposed Sparse matrix Transposition Mechanism (STM) is implemented as a functional unit of a vector processor. The STM with a section size is depicted in Figure. The main part of the unit consists of the -memory. The -memory is used to store an ¾ -block of the hierarchically stored matrix. The mechanism can transpose one ¾ -block at a time. The procedure is as follows: First, the ¾ -block is stored in the -memory one section at a time. When the complete ¾ -block is stored, the ¾ -block is then read from the -memory in the transpose fashion than storing, i.e. row-wise if stored column-wise and vice versa. More specifically: assume that a part of an ¾ block is stored in a vector register Ê. The contents of a register Ê can be stored in the -memory via the column-wise I/O-buffer. The depth of this buffer defines how many elements can maximally be stored per clock cycle. We will call this the -memory bandwidth, which in the case of Figure is. At each cycle the I/O-buffer is filled with non-zero elements of the same column along with their correspond-

4 Vector Register File I/O Buffer Rowwise MUX Column Position Value Non-zero locator memory Row-buffer Column-buffer Non-zero Indicators "" when non-zero when zero Columnwise I/O Buffer M U X Position of st "" Position of nd "" Position of rd "" Position of 4th "" Row Position Non-zero locator Non-zero Indicator Storage cell SxS Memory Fig.. The Sparse matrix Transposition Mechanism (STM) counter counter ing row position. In the next cycle, the row position is used by the Non-zero Locator unit to store the non-zero values at the correct row position in the column-buffer. The nonzero indicator at the corresponding cells of the buffer are then set accordingly to indicate the a non-zero or a zero value. This process is repeated until there are no more non-zero element for the current column. Subsequently, the entire column buffer is copied into the -memory using the column position information (not shown in Figure ). We can now read the transposed of the ¾ -block by reversing the order used for storing, at the row-wise section of the STM. Column by column, the ¾ -block is moved into row-buffer. There, using the Non-zero-Locator, the non-zero values and their column positions are copied into I/O-buffer (maximally at a time) and then stored into a register in the register file. However, when reading the -memory, the working of the Non-zero locator is not trivial. Therefore we will describe its workings in further detail. The working Non-zero locator is graphically depicted in Figure 4. The function of this circuit is to extract from a string of input bits (the non-zero indicators) the position of the first s. When there are more than non-zero elements the located non-zeros are set to zero (not depicted in Figure 4) and the process is repeated in order to locate the following non-zero elements. When there are less than non-zero elements one or more of the 0 -counters will produce an overflow. This overflow indicates to the control logic that a new row or column needs to be fetched from the - "" Fig. 4. The Non-zero Locator counter counter As we have mentioned, the STM can only transpose an ¾ -block. However, because of the similar structure of the HiSM at all hierarchy levels we can apply the same transposition mechanism on all levels in order to achieve the transposition of the entire matrix. Figure 5 graphically illustrates this principle. Observe that when the matrix is transposed, every ¾ -block is also transposed. Additionally, at one level higher, level-½, the position of the non-empty (depicted darker in the figure) blocks is also transposed. This means that if we can transpose an ¾ - blockarray at level-¼ we can apply the same algorithm to the ¾ -blockarrays at all levels to transpose the entire matrix. A. Timing Evaluation In this section we will provide performance estimations of the proposed mechanism. Due to fact that the - memory has to be filled before it can be read back, the

5 Fig. 5. Matrix Transposition STM unit can not be fully pipelined. However, separately, the write and read phases can be pipelined in three stages. This means that cycles are required for the last elements to enter the Ë -memory and similarly cycles are needed for the last results to be returned to the vector register. This results in a functional unit that has a latency of ¾ Ò where Ò is the number of non-zero elements Ì in the ¾ -block and Ì is the throughput of the I/O-buffer, i.e. the average number of elements in the I/O-buffer per cycle. The throughput Ì varies from ½ to where is the previously mentioned STM bandwidth that is equal to the depth of the I/O-buffer. The precise value of Ì and thus the performance of the STM depends on the sparsity pattern of the matrix to be transposed. Therefore we will provide only the worst and best case scenarios to evaluate the performance: Best case: ¾ Ò Worst case: ¾ Ò To Perform the same operation on a scalar machine we would need a sorting loop of on average Ò ÐÓ Ò iterations. The operations within this loop are highly dependent and unpredictable, and therefore no advantage can be expected from ILP techniques such as pipelining, dependence checking and branch prediction. This results in a sustained execution time of several cycles per iteration. Comparing to our scheme we can expect an order of magnitude of improvement. IV. CONCLUSIONS In this paper we have proposed and described a novel mechanism, the Sparse matrix Transposition Mechanism (STM), implemented as a functional unit for a vector processor, that can perform the transposition of a sparse matrix that is stored using in a hierarchical sparse matrix storage format. We have evaluated the timing properties of the STM and showed an expected performance increase of one order of magnitude when compared to a scalar implementation of the sparse matrix transposition. REFERENCES [] S. Vassiliadis, S. Cotofana, and Pyrrhos Stathis, Vector isa extension sprase matrix multiplication., in EuroPar 99 Parallel Processing. 999, Lecture Notes in Computer Science, No. 85, pp. 08 5, Springer-Verlag. [] Victor Eijkhout, LAPACK working note 50: Distributed sparse data structures for linear algebra operations, Tech. Rep. UT- CS-9-9, Department of Computer Science, University of Tennessee, Sept. 99, Mon, Apr 99 0:9: GMT. [] Yosef Saad, SPARSKIT: A basic tool kit for sparse matrix computations, Tech. Rep., Computer Science Department, University of Minnesota, Minneapolis, MN 55455, June 994, Version. [4] Hideharu Amano, Taisuke Boku, Tomohiro Kudoh, and Hideo Aiso, (SM) ¾ -II: A new version of the sparse matrix solving machine, in Proceedings of the th Annual International Symposium on Computer Architecture, Boston, Massachusetts, June 9, 985, IEEE Computer Society TCA and ACM SIGARCH, pp [5] Valerie E. Taylor, Abhiram Ranade, and David G. Messerschitt, SPAR: A New Architecture for Large Finite Element Computations, IEEE Transactions on Computers, vol. 44, no. 4, pp , April 995. [] Pyrrhos Stathis, Stamatis Vassiliadis, and Sorin Cotofana, Sparse matrix vector multiplication evaluation using the bbcs scheme, To appear in 8th PCI, Nov 00. [] A. Wolfe, M. Breternitz, Jr., C. Stephens, A. L. Ting, D. B. Kirk, R. P. Bianchini, Jr., and J. P. Shen, The white dwarf: A highperformance application-specific processor, in Proceedings of the 5th Annual International Symposium on Computer Architecture, H. J. Siegel, Ed., Honolulu, Hawaii, May June 988, pp., IEEE Computer Society Press. [8] John L. Hennessy and David A. Patterson, Computer Architecture A Quantative Approach, Morgan Kaufman, San Mateo, California, 990.

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

CPSC 3300 Spring 2017 Exam 2

CPSC 3300 Spring 2017 Exam 2 CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW

More information

An Algorithm for a Two-Disk Fault-Tolerant Array with (Prime 1) Disks

An Algorithm for a Two-Disk Fault-Tolerant Array with (Prime 1) Disks An Algorithm for a Two-Disk Fault-Tolerant Array with (Prime 1) Disks Sanjeeb Nanda and Narsingh Deo School of Computer Science University of Central Florida Orlando, Florida 32816-2362 sanjeeb@earthlink.net,

More information

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch

More information

COVER SHEET: Problem#: Points

COVER SHEET: Problem#: Points EEL 4712 Midterm 3 Spring 2017 VERSION 1 Name: UFID: Sign here to give permission for your test to be returned in class, where others might see your score: IMPORTANT: Please be neat and write (or draw)

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

Enrico Nardelli Logic Circuits and Computer Architecture

Enrico Nardelli Logic Circuits and Computer Architecture Enrico Nardelli Logic Circuits and Computer Architecture Appendix B The design of VS0: a very simple CPU Rev. 1.4 (2009-10) by Enrico Nardelli B - 1 Instruction set Just 4 instructions LOAD M - Copy into

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

An Effective New CRT Based Reverse Converter for a Novel Moduli Set { 2 2n+1 1, 2 2n+1, 2 2n 1 }

An Effective New CRT Based Reverse Converter for a Novel Moduli Set { 2 2n+1 1, 2 2n+1, 2 2n 1 } An Effective New CRT Based Reverse Converter for a Novel Moduli Set +1 1, +1, 1 } Edem Kwedzo Bankas, Kazeem Alagbe Gbolagade Department of Computer Science, Faculty of Mathematical Sciences, University

More information

Optimum Circuits for Bit Reversal

Optimum Circuits for Bit Reversal Optimum Circuits for Bit Reversal Mario Garrido Gálvez, Jesus Grajal and Oscar Gustafsson Linköping University Post Print.B.: When citing this work, cite the original article. 2011 IEEE. Personal use of

More information

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Instruction Set Extensions for Reed-Solomon Encoding and Decoding Instruction Set Extensions for Reed-Solomon Encoding and Decoding Suman Mamidi and Michael J Schulte Dept of ECE University of Wisconsin-Madison {mamidi, schulte}@caewiscedu http://mesaecewiscedu Daniel

More information

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2 Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1 Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

More information

Mapping Sparse Matrix-Vector Multiplication on FPGAs

Mapping Sparse Matrix-Vector Multiplication on FPGAs Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 1 University of Tennessee, Knoxville 2 Oak Ridge National Laboratory July 20, 2007 Outline Introduction

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Fast Multipole Methods: Fundamentals & Applications. Ramani Duraiswami Nail A. Gumerov

Fast Multipole Methods: Fundamentals & Applications. Ramani Duraiswami Nail A. Gumerov Fast Multipole Methods: Fundamentals & Applications Ramani Duraiswami Nail A. Gumerov Week 1. Introduction. What are multipole methods and what is this course about. Problems from physics, mathematics,

More information

25.2 Last Time: Matrix Multiplication in Streaming Model

25.2 Last Time: Matrix Multiplication in Streaming Model EE 381V: Large Scale Learning Fall 01 Lecture 5 April 18 Lecturer: Caramanis & Sanghavi Scribe: Kai-Yang Chiang 5.1 Review of Streaming Model Streaming model is a new model for presenting massive data.

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER Jesus Garcia and Michael J. Schulte Lehigh University Department of Computer Science and Engineering Bethlehem, PA 15 ABSTRACT Galois field arithmetic

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

VLSI Signal Processing

VLSI Signal Processing VLSI Signal Processing Lecture 1 Pipelining & Retiming ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-1 Introduction DSP System Real time requirement Data driven synchronized by data

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Block-tridiagonal matrices

Block-tridiagonal matrices Block-tridiagonal matrices. p.1/31 Block-tridiagonal matrices - where do these arise? - as a result of a particular mesh-point ordering - as a part of a factorization procedure, for example when we compute

More information

Linear Algebra I Lecture 8

Linear Algebra I Lecture 8 Linear Algebra I Lecture 8 Xi Chen 1 1 University of Alberta January 25, 2019 Outline 1 2 Gauss-Jordan Elimination Given a system of linear equations f 1 (x 1, x 2,..., x n ) = 0 f 2 (x 1, x 2,..., x n

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then

More information

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman. SP esign Lecture 7 Unfolding cont. & Folding r. Fredrik Edman fredrik.edman@eit.lth.se Unfolding Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way

More information

Algorithms and Methods for Fast Model Predictive Control

Algorithms and Methods for Fast Model Predictive Control Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model

More information

ww.padasalai.net

ww.padasalai.net t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

Unit 6: Branch Prediction

Unit 6: Branch Prediction CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,

More information

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit. Chapter Retiming NCU EE -- SP VLSI esign. Chap. Tsung-Han Tsai 1 Retiming & A transformation techniques used to change the locations of delay elements in a circuit without affecting the input/output characteristics

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

Black Box Linear Algebra

Black Box Linear Algebra Black Box Linear Algebra An Introduction to Wiedemann s Approach William J. Turner Department of Mathematics & Computer Science Wabash College Symbolic Computation Sometimes called Computer Algebra Symbols

More information

Branch Prediction using Advanced Neural Methods

Branch Prediction using Advanced Neural Methods Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level

More information

Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases. D. J. Bernstein University of Illinois at Chicago

Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases. D. J. Bernstein University of Illinois at Chicago Speeding up characteristic 2: I. Linear maps II. The Å(Ò) game III. Batching IV. Normal bases D. J. Bernstein University of Illinois at Chicago NSF ITR 0716498 Part I. Linear maps Consider computing 0

More information

Matrices and RRE Form

Matrices and RRE Form Matrices and RRE Form Notation R is the real numbers, C is the complex numbers (we will only consider complex numbers towards the end of the course) is read as an element of For instance, x R means that

More information

CprE 281: Digital Logic

CprE 281: Digital Logic CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital

More information

High Performance Parallel Tucker Decomposition of Sparse Tensors

High Performance Parallel Tucker Decomposition of Sparse Tensors High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon,

More information

This is a repository copy of Improving the associative rule chaining architecture.

This is a repository copy of Improving the associative rule chaining architecture. This is a repository copy of Improving the associative rule chaining architecture. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/75674/ Version: Accepted Version Book Section:

More information

Citation Osaka Journal of Mathematics. 43(2)

Citation Osaka Journal of Mathematics. 43(2) TitleIrreducible representations of the Author(s) Kosuda, Masashi Citation Osaka Journal of Mathematics. 43(2) Issue 2006-06 Date Text Version publisher URL http://hdl.handle.net/094/0396 DOI Rights Osaka

More information

Department of Electrical and Computer Engineering The University of Texas at Austin

Department of Electrical and Computer Engineering The University of Texas at Austin Department of Electrical and Computer Engineering The University of Texas at Austin EE 360N, Fall 2004 Yale Patt, Instructor Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs Exam 1, October 6, 2004 Name:

More information

Average Bandwidth Relevance in Parallel Solving Systems of Linear Equations

Average Bandwidth Relevance in Parallel Solving Systems of Linear Equations Average Bandwidth Relevance in Parallel Solving Systems of Linear Equations Liviu Octavian Mafteiu-Scai Computer Science Department, West University of Timisoara, Timisoara, Romania ABSTRACT: This paper

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage 1 Minimum Repair andwidth for Exact Regeneration in Distributed Storage Vivec R Cadambe, Syed A Jafar, Hamed Malei Electrical Engineering and Computer Science University of California Irvine, Irvine, California,

More information

A HIGH-SPEED PROCESSOR FOR RECTANGULAR-TO-POLAR CONVERSION WITH APPLICATIONS IN DIGITAL COMMUNICATIONS *

A HIGH-SPEED PROCESSOR FOR RECTANGULAR-TO-POLAR CONVERSION WITH APPLICATIONS IN DIGITAL COMMUNICATIONS * Copyright IEEE 999: Published in the Proceedings of Globecom 999, Rio de Janeiro, Dec 5-9, 999 A HIGH-SPEED PROCESSOR FOR RECTAGULAR-TO-POLAR COVERSIO WITH APPLICATIOS I DIGITAL COMMUICATIOS * Dengwei

More information

Models of Computation

Models of Computation Models of Computation Analysis of Algorithms Week 1, Lecture 2 Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Models of Computation (RAM) a) Random Access Machines

More information

Project Two RISC Processor Implementation ECE 485

Project Two RISC Processor Implementation ECE 485 Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor

More information

Digital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits.

Digital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits. CS211 Computer Architecture Digital Logic l Topics l Transistors (Design & Types) l Logic Gates l Combinational Circuits l K-Maps Figures & Tables borrowed from:! http://www.allaboutcircuits.com/vol_4/index.html!

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Computer Architecture

Computer Architecture Lecture 2: Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture CPU Evolution What is? 2 Outline Measurements and metrics : Performance, Cost, Dependability, Power Guidelines

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

hal , version 1-27 Mar 2014

hal , version 1-27 Mar 2014 Author manuscript, published in "2nd Multidisciplinary International Conference on Scheduling : Theory and Applications (MISTA 2005), New York, NY. : United States (2005)" 2 More formally, we denote by

More information

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Spring 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate

More information

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1> Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building

More information

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation

More information

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y) 5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h

More information

Halting and Equivalence of Program Schemes in Models of Arbitrary Theories

Halting and Equivalence of Program Schemes in Models of Arbitrary Theories Halting and Equivalence of Program Schemes in Models of Arbitrary Theories Dexter Kozen Cornell University, Ithaca, New York 14853-7501, USA, kozen@cs.cornell.edu, http://www.cs.cornell.edu/~kozen In Honor

More information

Digital Logic: Boolean Algebra and Gates. Textbook Chapter 3

Digital Logic: Boolean Algebra and Gates. Textbook Chapter 3 Digital Logic: Boolean Algebra and Gates Textbook Chapter 3 Basic Logic Gates XOR CMPE12 Summer 2009 02-2 Truth Table The most basic representation of a logic function Lists the output for all possible

More information

arxiv: v1 [cs.sc] 17 Apr 2013

arxiv: v1 [cs.sc] 17 Apr 2013 EFFICIENT CALCULATION OF DETERMINANTS OF SYMBOLIC MATRICES WITH MANY VARIABLES TANYA KHOVANOVA 1 AND ZIV SCULLY 2 arxiv:1304.4691v1 [cs.sc] 17 Apr 2013 Abstract. Efficient matrix determinant calculations

More information

Introducing a Bioinformatics Similarity Search Solution

Introducing a Bioinformatics Similarity Search Solution Introducing a Bioinformatics Similarity Search Solution 1 Page About the APU 3 The APU as a Driver of Similarity Search 3 Similarity Search in Bioinformatics 3 POC: GSI Joins Forces with the Weizmann Institute

More information

B œ c " " ã B œ c 8 8. such that substituting these values for the B 3 's will make all the equations true

B œ c   ã B œ c 8 8. such that substituting these values for the B 3 's will make all the equations true System of Linear Equations variables Ð unknowns Ñ B" ß B# ß ÞÞÞ ß B8 Æ Æ Æ + B + B ÞÞÞ + B œ, "" " "# # "8 8 " + B + B ÞÞÞ + B œ, #" " ## # #8 8 # ã + B + B ÞÞÞ + B œ, 3" " 3# # 38 8 3 ã + 7" B" + 7# B#

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program

More information

Computer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2.

Computer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2. COMPUTER SCIENCE S E D G E W I C K / W A Y N E PA R T I I : A L G O R I T H M S, T H E O R Y, A N D M A C H I N E S Computer Science Computer Science An Interdisciplinary Approach Section 4.2 ROBERT SEDGEWICK

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Outline 1 midterm exam on Friday 11 July 2014 policies for the first part 2 questions with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Intro

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc.

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc. Finite State Machines Introduction Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc. Such devices form

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Introduction to Matrices and Linear Systems Ch. 3

Introduction to Matrices and Linear Systems Ch. 3 Introduction to Matrices and Linear Systems Ch. 3 Doreen De Leon Department of Mathematics, California State University, Fresno June, 5 Basic Matrix Concepts and Operations Section 3.4. Basic Matrix Concepts

More information

EECS150 - Digital Design Lecture 21 - Design Blocks

EECS150 - Digital Design Lecture 21 - Design Blocks EECS150 - Digital Design Lecture 21 - Design Blocks April 3, 2012 John Wawrzynek Spring 2012 EECS150 - Lec21-db3 Page 1 Fixed Shifters / Rotators fixed shifters hardwire the shift amount into the circuit.

More information

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System G.Suresh, G.Indira Devi, P.Pavankumar Abstract The use of the improved table look up Residue Number System

More information

Parallel Sparse Tensor Decompositions using HiCOO Format

Parallel Sparse Tensor Decompositions using HiCOO Format Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline

More information

Some long-period random number generators using shifts and xors

Some long-period random number generators using shifts and xors ANZIAM J. 48 (CTAC2006) pp.c188 C202, 2007 C188 Some long-period random number generators using shifts and xors Richard P. Brent 1 (Received 6 July 2006; revised 2 July 2007) Abstract Marsaglia recently

More information

Dependence Analysis. Dependence Examples. Last Time: Brief introduction to interprocedural analysis. do I = 2, 100 A(I) = A(I-1) + 1 enddo

Dependence Analysis. Dependence Examples. Last Time: Brief introduction to interprocedural analysis. do I = 2, 100 A(I) = A(I-1) + 1 enddo Dependence Analysis Dependence Examples Last Time: Brief introduction to interprocedural analysis Today: Optimization for parallel machines and memory hierarchies Dependence analysis Loop transformations

More information

Consider the following example of a linear system:

Consider the following example of a linear system: LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x + 2x 2 + 3x 3 = 5 x + x 3 = 3 3x + x 2 + 3x 3 = 3 x =, x 2 = 0, x 3 = 2 In general we want to solve n equations

More information

Design at the Register Transfer Level

Design at the Register Transfer Level Week-7 Design at the Register Transfer Level Algorithmic State Machines Algorithmic State Machine (ASM) q Our design methodologies do not scale well to real-world problems. q 232 - Logic Design / Algorithmic

More information

Lecture 11. Advanced Dividers

Lecture 11. Advanced Dividers Lecture 11 Advanced Dividers Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 15 Variation in Dividers 15.3, Combinational and Array Dividers Chapter 16, Division

More information

4th year Project demo presentation

4th year Project demo presentation 4th year Project demo presentation Colm Ó héigeartaigh CASE4-99387212 coheig-case4@computing.dcu.ie 4th year Project demo presentation p. 1/23 Table of Contents An Introduction to Quantum Computing The

More information

Distributed Data Storage with Minimum Storage Regenerating Codes - Exact and Functional Repair are Asymptotically Equally Efficient

Distributed Data Storage with Minimum Storage Regenerating Codes - Exact and Functional Repair are Asymptotically Equally Efficient Distributed Data Storage with Minimum Storage Regenerating Codes - Exact and Functional Repair are Asymptotically Equally Efficient Viveck R Cadambe, Syed A Jafar, Hamed Maleki Electrical Engineering and

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Storage Project3 Digital Logic - Storage: Recap - Review: cache hit rate - Project3 - Digital Logic: - truth table => SOP - simplification: Boolean

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

Parallel Sparse Matrix Vector Multiplication (PSC 4.3)

Parallel Sparse Matrix Vector Multiplication (PSC 4.3) Parallel Sparse Matrix Vector Multiplication (PSC 4.) original slides by Rob Bisseling, Universiteit Utrecht, accompanying the textbook Parallel Scientific Computation adapted for the lecture HPC Algorithms

More information

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Fall 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate it

More information

Chapter 2. Divide-and-conquer. 2.1 Strassen s algorithm

Chapter 2. Divide-and-conquer. 2.1 Strassen s algorithm Chapter 2 Divide-and-conquer This chapter revisits the divide-and-conquer paradigms and explains how to solve recurrences, in particular, with the use of the master theorem. We first illustrate the concept

More information

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Chalermpol Saiprasert, Christos-Savvas Bouganis and George A. Constantinides Department of Electrical & Electronic

More information

Randomized Simultaneous Messages: Solution of a Problem of Yao in Communication Complexity

Randomized Simultaneous Messages: Solution of a Problem of Yao in Communication Complexity Randomized Simultaneous Messages: Solution of a Problem of Yao in Communication Complexity László Babai Peter G. Kimmel Department of Computer Science The University of Chicago 1100 East 58th Street Chicago,

More information

ECE/CS 250 Computer Architecture

ECE/CS 250 Computer Architecture ECE/CS 250 Computer Architecture Basics of Logic Design: Boolean Algebra, Logic Gates (Combinational Logic) Tyler Bletsch Duke University Slides are derived from work by Daniel J. Sorin (Duke), Alvy Lebeck

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2 MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS SYSTEMS OF EQUATIONS AND MATRICES Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information