Modeling Computation and Communication Performance of Parallel Scientific Applications: A Case Study of the IBM SP2

Size: px
Start display at page:

Download "Modeling Computation and Communication Performance of Parallel Scientific Applications: A Case Study of the IBM SP2"

Transcription

1 This document was created with FrameMaker Modeling Computation and Communication Performance of Parallel Scientific Applications: A Case Study of the IM SP2 Eric L. oyd, Gheith A. Abandah, Hsien Hsin Lee, and Edward S. Davidson Advanced Computer Architecture Laboratory Department of Electrical Engineering and Computer Science University of Michigan 30 eal Avenue Ann Arbor, MI PHONE: ; FAX: {boyd, gabandah, linear, davidson}@eecs.umich.edu Abstract A methodology for performance analysis of Massively Parallel Processors (MPPs) is presented. The IM SP2 and some key routines of a finite element method application (FEMC) are used as a case study. A hierarchy of lower bounds on run time is developed for the POWER2 processor, using the MACS methodology developed in earlier work for uniprocessors and vector processors. Significantly, this hierarchy is extended to incorporate the effects of the memory hierarchy of each SP2 node and communication across the High Performance Switch (HPS) linking the nodes of the SP2. The performance models developed via this methodology facilitate explaining performance, identifying performance bottlenecks, and guiding application code improvements.. Introduction Scientific applications are typically dominated by loop code, floating point operations, and array references. The performance of such applications on scalar and vector uniprocessors has been found to be well characterized by the MACS model, a hierarchical series of performance bounds. As depicted in Figure, the M bound models the peak floating point performance of the Machine architecture independent of the application. The MA bound models the machine in conjunction with the operations deemed to be essential in the high level Application workload. Hence Gap A represents the performance degradation due to the application algorithm s need for operations that are not entirely masked by the most efficient class of floating point operations. The MAC bound is derived from the actual Compiler generated workload. Hence Gap C represents the performance degradation due to the additional operations introduced by the compiler. The MACS bound factors in the compiler generated Schedule for the workload. Gap S thus represents the performance degradation due to scheduling constraints. The performance degradation seen in the remaining gap, Gap P, is due to as yet unmodeled effects, such as unmodeled cache miss penalties, load imbalances, OS interrupts, and I/O, that affect actual delivered performance. The bounds are generally expressed as lower bounds on run time and, along with measured run time, are given in units of CPF (clocks per essential floating point operation). The reciprocal of CPF times the clock rate in MHz yields an upper bound on performance in MFLOPS. The MACS model has been effectively demonstrated and applied in varying amounts of detail to a wide variety of processors. [][2][3][4][5][6] The MACS model developed for each SP2 node incorporates a significant additional refinement beyond these earlier studies by including the effects of the memory hierarchy. Gap A Gap C Gap S Gap P M MA MAC MACS Measured CPF Figure : MACS Performance ound Hierarchy Scientific applications often exhibit a large degree of potential parallelism in their algorithms, and hence are prime candidates for execution on MPPs. Gap P can become very large and poses the fundamental limit to scalability for parallel applications. Characterizing the communication performance of an MPP and the communication requirements of an application achieves a refinement of Gap P that is crucial to modeling and improving the performance of parallel scientific applications that exhibit moderate to high amounts of communication relative to computation. We demonstrate that coupling the MACS bounds hierarchy, which models the computation of an individual node, with models of communication, as well as load balancing and cache effects, to refine Gap P, enables the effective modeling of the performance of scientific applications on a parallel computer such as the IM SP2. This refinement is begun in this paper with the introduction of a communication model for the SP2. The MACS model is applied to the SP2 and integrated with this communication model in a case study of a commercial scientific

2 application. This methodology could easily be extended to other message passing MPPs, such as the Intel Paragon and the Thinking Machines CM5. [7][8] We believe that it can also be extended to shared memory MPPs, such as the Kendall Square Research KSR2, the Convex Exemplar, and the Cray T3D [][2][3][4], by using techniques to expose and characterize the implicit communication, as demonstrated in [6][9][0]. All experiments in this paper were run on an IM SP2 with 32 Thin Node 66 POWER2 processors running the AIX operating system. An overview of the SP2 architecture is given in Section 2. The single node SP2 (POWER2) MACS model is detailed in Section 3, with extensions to include the effects of the memory hierarchy. Section 4 presents the communication model for the IM SP2. An industrial structural finite element modeling code, FEMC, is used to demonstrate the methodology. This application is being ported from a vector supercomputer, parallelized, and tuned for the IM SP2 at the University of Michigan, and hence represents one scenario of performance modeling. Three performance limiting FEMC routines are examined. These routines exhibit low, moderate, and high communication to computation ratios and various patterns of communication. Section 5 gives a brief overview of the application routines and discusses the performance modeling results. The performance models developed via this methodology facilitate explaining performance, identifying performance bottlenecks, and guiding application code improvements. 2. IM SP2 Architectural Analysis A typical IM SP2 contains between 4 and 28 nodes connected by a High Performance Switch communication interconnect; bigger configurations are possible. Each node consists of a POWER or POWER2 processor and an SP2 communication adapter. There are three types of POWER2 nodes currently available: Thin Nodes, Thin 2 Nodes, and Wide Nodes. Thin Nodes have a 64 Kbyte data cache; Thin 2 Nodes have a 28 Kbyte data cache, and Wide Nodes have a 256 Kbyte data cache. Thin 2 Nodes and Wide Nodes can do quadword data accesses, copying two adjacent double words into two adjacent floating point registers from the primary cache in one cycle. 2.. POWER2 Architecture [5] Each Thin Node POWER2 operates at a clock speed of 66.7 MHz, corresponding to a 5 ns processor clock cycle time. The POWER2 processor is subdivided into an Instruction Cache Unit (ICU), a Data Cache Unit (DCU), a Fixed Point Unit (FXU), and a Floating Point Unit (FPU), as shown in Figure 2. The POWER2 processor also includes a Storage Control Unit (SCU) and two I/O Units (XIO), neither of which are shown or described. Thin Nodes and Thin 2 Nodes also include an optional secondary cache ( or 2 Mbytes, respectively), memory (64 Mbytes to 52 Mbytes), and a communication adapter. All experiments in this paper are run on Thin Nodes with no secondary cache and a 256 Mbyte memory. The ICU can fetch up to eight instructions per clock cycle from the instruction cache (I Cache). It can dispatch up to six instructions per cycle through the dispatch unit, two of which are reserved for branch and compare instructions. Two independent branch processors within the ICU can each resolve one branch per cycle. Most of the branch operation penalties can be masked by resolving them concurrently with FXU and FPU instruction execution. The instruction cache is 32 Kbytes with 28 byte cache lines with a one cycle access time. The FXU decodes and executes all memory references, integer operations, and logical operations. It includes address translation, data protection, and data cache directories for load/store instructions. There are two fixed point execution units which are fed by their own instruction decode unit and maintain a copy of the general purpose register file. Two fixed point instructions can be executed per cycle by the two execution units. Each unit contains an adder and a logic functional unit providing addition, subtraction, and oolean operations. One unit can also execute special operations such as cache operations and privileged operations, while the other unit can perform fixed point multiply and divide operations. The FPU consists of three units, a double precision arithmetic unit, a load unit, and a store unit. Each unit has dual pipelines and hence can execute up to two instructions per cycle. The peak issue rate from the FPU instruction buffers to these units is thus two double precision floating point multiply adds, two floating point loads, and two floating point stores per clock. The DCU includes a four way set associative multiport write back cache. The experiments in this paper were performed on a configuration with a 64 Kbytes data cache with 64 byte cache lines. The DCU also provides several buffers for cache and direct memory access (DMA) operations as well as error detection/correction and bit steering for all data sent to and received from memory. The DCU has a one cycle access time. The miss penalty to memory is determined experimentally to be between 6 and 2 processor clock cycles Interconnect Architecture [6][7] [8] The SP2 interconnect, termed the High Performance Switch (HPS), is designed to minimize the average latency of message transmissions while allowing the aggregate bandwidth to scale linearly with the number of nodes. The HPS is a bidirectional 2

3 Instruction Cache Dispatch Unit Instruction Cache Unit (ICU) Dual ranch Processors Instruction buffers Instruction buffers Execution Unit without Multiply/ Divide Execution Unit with Multiply/ Divide Sync Arithmetic Execution Unit Store Execution Unit Load Execution Unit Fixed Point Unit (FXU) Floating Point Unit (FPU) Data Cache Unit (DCU) (4 Separate Chips) Memory Unit Secondary Cache (Optional) Figure 2: POWER2 Processor Architecture multistage interconnect (MIN), logically composed of switching elements. It operates at 40 MHz, providing a peak bandwidth of 40 Mytes/sec in each direction over each full duplex communication link. Each message packet includes routing information that enables each switching element to determine on the fly the next destination for the packet. uffered wormhole routing is employed for switch flow control. Unlike standard wormhole routing algorithms, if a message is blocked within a switching element, it is temporarily buffered in dynamically allocated shared memory. Each switching element is physically an eight input, eight output device, wired as a bidirectional 4x4 element. Nodes of the SP2 system are grouped into frames; each frame consists of a switch board, as shown in Figure 3, and 6 nodes. Each frame incorporates 8 switching elements in two stages of four elements each, plus 8 shadow switching elements to provide a redundant check. For systems of up to eighty processors, the 6 links of a frame on the far side of the second switch stage are connected to the far side links of to 4 other frames. SP2 systems with 8 to 28 nodes use intermediate switch boards between frames. Messages to be sent between nodes on the SP2 are broken into discrete packets, each containing the required routing information. The smallest unit on which flow control is performed on the SP2, a flit, is one byte wide, corresponding to the width of an output port of the HPS switch elements. Packets vary in length up to 255 flits. Each packet is of length r + n +, where the first flit contains the packet length, the next r flits contain the routing information, and the last n flits contain data and error checking bytes. On a 32 node system, r = or 2. Experiments show that the first packet can contain up to 26 bytes (or flits) of useful data. Additional packets can contain up to 232 bytes (or flits) of useful data. 3

4 Connections to the 6 Frame Nodes Connections to other Frames Stage Stage 2 Figure 3: SP2 Switch oard 3. IM SP2 MACS Model A lower bound on the computation run time of floating point applications on the IM SP2 is given by the MACS model developed for the POWER2 processor. Only the M bound, MA bound, and MAC bounds are presented below. The MACS bound is not developed because the POWER2 architecture is implemented with extensive buffering, multiple issue and out of order execution among its functional pipes; hence the schedule is modified dynamically at run time and schedule stalls are greatly reduced. Thus, a realistic MACS bound is hard to develop; and would likely be very close to the MAC bound. 3.. M ound Equations The M bound models the peak performance of the POWER2 FPU, independent of the application requirements, the compiler workload, or the scheduler. The M ound for the POWER2 is 0.25 CPF since the POWER2 can compute at most four floating point operations per cycle (one multiply add issued to each of 2 pipelines) MA ound Equations The MA bound models the peak application performance of the POWER2 architecture given the visible workload of the high level application code. The POWER2 is modeled as five independent functional units: floating point unit, fixed point unit, instruction issue unit, memory unit, and a dependence pseudo unit. The MA bound, t MA, is calculated as the maximum CPF bound among the five independent functional units: t MA = MAX (t fl, t fx, t m, t i, t d ) / (f a + f m + 2 * f ma + 4 * f div + 4 * f sqrt ) () The CPF bound for each functional unit is calculated as a function of the number of essential operations that must be performed by that functional unit per simulation time step in the high level source code. The bound assumes that each functional unit need execute only these essential operations, and that it can execute them at its peak rate. The compiler is idealized in that no nonessential operations are considered; the scheduler is idealized in that no stalls due to resource constraints and/or schedule dependences are considered. Equation () bounds run time by assuming that the busiest functional unit is kept continuously busy. The set of essential arithmetic operations includes the minimum number of floating point assembly operations necessary to complete a computation (including floating point additions and subtractions, f a, floating point multiplications, f m, triad operations which do both, f ma, division operations, f div, and square root operations, f sqrt ). oth division and square root operations are weighted by a factor of four, as is commonly done for the Lawrence Livermore Fortran Kernels [9] and other benchmarks. The number of essential floating point loads, l fl, equals the number of distinct values that appear on the right hand side (RHS) of a high level code statement before they appear on the left hand side (LHS) of a high level code statement. The number of essential floating point stores, s fl, equals the number of distinct values that appear on the LHS of a high level code statement 4

5 that are neither temporary values nor scalars which should spend their lifetime in registers. For Thin 2 Nodes or Wide Nodes, l fl and s fl should be divided by two if the stride equals one, since it is possible to employ quadword loads and stores. The floating point functional unit bound, t fl, models the time needed in the FPU to execute the essential arithmetic and memory floating point instructions. Since the POWER2 contains three dual pipelined execution units in the FPU one for arithmetic operations, one for loading data, and one for normalizing store data: t fl = MAX ((f a + f m + f ma + 7 * f div + 27 * f sqrt ) / 2, s fl / 2, l fl / 2) (2) Floating point divide operations require the FPU for 7 cycles and square root operations require 27. The model s instruction issue functional unit bound, t i, models the IM SP2 instruction dispatch bandwidth of the ICU. This unit can dispatch four floating point arithmetic or memory reference instructions per cycle. The POWER2 can also dispatch branch and condition register instructions concurrently with the above, or other fixed-point instructions in place of memory operations; however, since in scientific loop-dominated code they have negligible or no impact on POWER2 performance, these instructions are not included in the MA bound. t i = (f a + f m + f ma + f div + f sqrt + l fl + s fl ) / 4 (3) Since non-floating point operations are assumed to have a negligible impact on the performance of scientific applications, the fixed point functional unit, t fx, models only the impact of the fixed point unit on floating point operations. An address calculation is required for each floating point memory operation. Since the FXU in the POWER2 architecture can begin two floating point memory operations (either loads or stores) per cycle: t fx = (l fl + s fl ) / 2 (4) In previously published implementations of the MACS model, the memory hierarchy unit, t m, has ignored the instruction cache entirely, and has assumed that all data accesses hit in the data cache. As a result, the memory unit models developed for other architectures have focused on data cache port bottlenecks only. Since the routines of interest modeled in Section 5 have working sets that exceed the size of the data cache, a more sophisticated model of memory is required to explain a significant fraction of runtime. Two ports connect the floating point unit with the primary data cache in the POWER2 processor, and a single port connects the data cache to memory. A memory hierarchy unit lower bound on run time is as follows: t m = MAX((load miss time + store miss time), (l fl L eff + s fl S eff ) / 2) (5) The first term in Equation (5) models the single port between the data cache and memory. In applications with a high miss rate, multiple cache misses may occur at the same time, but only one can be serviced by the memory at time due to this single port bottleneck. The second term in Equation (5) models the two ports between the FPU and the data cache. In applications with a low miss rate, there is typically only a single outstanding cache miss at any one time. While the memory services the miss, the other data cache port can continue to service a single memory access per cycle. The effective number of cycles per floating point load, L eff, and per floating point store, S eff, is calculated as a function of the number of essential misses. Note that t m t fx, and t m = t fx only if all memory access operations hit in the data cache. A key issue in evaluating Equation (5) is developing a lower bound on the effective access time of floating point loads and stores, given a high level application. For a memory system of n levels, the Effective Access Time, T eff, is: n T eff = f j t j j = where f j is the Access Frequency for level j, and t j is the Access Time for that level. The access frequency can be calculated as follows: f j = m m 2 m j- (-m j ) (7) where m j is the miss ratio in the jth level of the memory hierarchy. SP2 systems have a primary cache, an optional secondary cache, and main memory, hence Equation (6) can be rewritten: T eff = ( - m )t + m ( - m 2 ) t 2 + m m 2 (-m 3 ) t 3 = ( - m ) t + m t 3 (8) assuming no secondary cache (m 2 = ) and ignoring page faults (m 3 = 0). Experimental calibration loops show that t l = t s =, t 3l = 6.3, and t 3s = 22.0, hence: L eff = * m l S eff = m s (9) Thus for our SP2 system, and later experiments, we use: t m = MAX(((l fl (6.3 m l ) + s fl (22.0 m s )), ((l fl ( m l ) + s fl ( m s )) / 2) (0) Essential misses are classified as either compulsory (m comp ) or capacity (m cap ). Hence a lower bound on the miss rate, m, can be calculated as follows: (6) 5

6 m m comp + m = cap Number of Essential Accesses A lower bound on the number of compulsory misses equals the number of blocks in the Working Set (), hence: m comp = (2) Capacity misses depend on the number of Working Set blocks (), the number of Cache blocks (C), the Cache Degree of Associativity (A), and the Access Pattern. Assuming a linear access pattern, as is often found in loops, a lower bound on capacity misses, m cap, can be calculated as follows: 0 C (cache region) m cap = C ( DC) A C C ( + D A ) (transition region) C ( + D A) (memory region) D is the number of degrees of freedom in the access pattern which equals the number of distinct arrays and is restricted to be between and A. The miss ratio is linear in the transition region [2], and m cap equals the product of and the miss ratio. The loop carried dependence pseudo unit, t d, models the performance of loops with a recurrence, i.e. a result of one iteration depends on the corresponding result of a previous iteration. Whenever there is such a cycle in the dependence graph of the floating point arithmetic operations, t d is computed as the worst- case recurrence cycle, where each recurrence cycle is calculated as the total latency of the operations in one tour divided by the number of iterations in that recurrence cycle. The latency of an operation is related to pipeline depth and is computed as the minimum number of clocks between issuing that operation and issuing a succeeding operation that uses its result as an operand, hence: L r = total latency of the loop-carried dependence in recurrence cycle r (4) I r = number of iterations in recurrence cycle r (5) t d = MAX recurrence cycles r (L r /I r ) (6) The latency is cycle for floating point adds and multiplies, 2 for floating point multiply adds, 7 for floating point divides, and 27 for floating point square roots MAC ound Equations The MAC bound for the POWER2 architecture is computed similarly to the MA bound, except that the operation counts are computed from the compiled assembly code. In the MAC bound equations, counts of the various types of operations are marked with primes to indicate that they represent the number of operations found in the compiled code, not the minimum number of essential operations needed in the high level code. Furthermore all compiled operations are counted in the MAC model, including fixed point and branch instructions. The fixed point unit can begin two fixed point operations per cycle, but at most one fixed point multiplication or division per cycle. The branch functional unit, which can execute two branch instructions per cycle, is added for the MAC model. L eff ' and S eff ' are calculated as for the MA model, but the miss ratio is computed using the number of actual misses divided by the actual accesses. For our SP2, t m ' is calculated as in Equation (0), using the MAC parameter values. n FPU = number of FPU computation instructions = f a ' + f m ' + f ma ' + 7 * f div ' + 27 * f sqrt ' + others (7) t fl ' = MAX (n FPU / 2, l fl ' / 2, s fl ' / 2) (8) n FXU = number of FXU instructions = l fl ' + s fl ' + others (9) n FXMD = number of fixed point multiplication and division instructions (20) n C = number of branch and compare instructions (2) t fx ' = MAX (n FXU / 2, n FXMD ) (22) t m ' = MAX((load miss time + store miss time), ((l fx ' + l fl ') L eff ' + (s fx ' + s fl ') S eff ') / 2) (23) t b ' = n C / 2 (24) t i ' = (compiled code length n C ) / 4 (25) t d ' = MAX recurrence cycles r (L r '/I r ) (26) t MAC = MAX (t fl ', t fx ', t m ', t b ', t i ', t d ') / (f a + f m + 2 * f ma + 4 * f div + 4 * f sqrt ) (27) 3.4. Automatic MAC ound Generator The Automatic MAC ound Generator is a single pass forward-scanning tool which generates the parameters used in the () (3) 6

7 MAC ound. [6] It reads a designated region of interest of the IM POWER2 assembly code as its input and reports statistics for each loop. Reported statistics include the nesting relationship for each loop and respective values of t fl ', t fx ', t b ', and t i '. This tool accepts perfectly nested loops and most imperfectly nested loops (forward branches inside the loop body are allowed) as input. In general, all nested loops written in well-structured programming styles are accepted by this tool. Statistics for outer loops report only on code not contained in the inner loops, i.e. the residue code; statistics for code spanned by forward branches are reported separately. Forward branches within loops often complicate the computation of the bounds; however, the tool can recognize and handle such branches appropriately using weights derived from profiling. These statistics are then combined in a weighted average, with the weights determined by standard basic block profiling. 4. IM SP2 Communication Model Scientific applications executed on MPPs tend to exhibit a large Gap P due to communication overhead, nonessential cache stalls, load imbalance, other unmodeled effects, and system level phenomena. Extending the parameter based hierarchical performance bounds modeling methodology into Gap P can begin with developing performance models for communication as a function of simple performance parameters. One such estimation approach has been applied to a wide variety of parallel machines with good results. Simple calibration loops were developed to measure latency and bandwidth for internode communication. [20] Let r be the asymptotic transfer rate of a communication interconnect in units of megabytes/second, n is the message length in bytes, and t o is the (asymptotic) zero message length latency in microseconds. This suggests the following model of communication latency: n T comm ( n) = t o (28) As shown in Section 4., Equation (28) works well in characterizing point to point communication. For more complex communication patterns such as one to many, many to one, many to many, combine, and broadcast, both t o and r are found to be functions of the number of processors, p. We define t comm (p) as the setup time and π comm (p) as the transfer time per byte. This suggests the following model of communication latency: T comm (n, p) = t comm (p) + π comm (p) n (29) In the rest of Section 4, we model the performance of the IM SP2 in the form of Equation (29) for a variety of communication constructs defined in the message passing library, MPL. Although these models are actually estimates derived from curve fitting to clean machine primitive test loop performance, we have found their predictive accuracy to be good enough to permit treating them as performance bounds. Furthermore there is no overlap in our case study between the computation modeled in the MAC bound and the communication since all the communication patterns are blocking. As of this date, our experiments show that our SP2 always exhibits blocking behavior even for commands specified as nonblocking. Thus the communication model can simply be added to the MAC bound. 4.. Point to Point Communication The cost of point to point communication is measured by a classic ping pong experiment. Processor P executes a blocking send, MP_SEND, to processor P2, and then executes a blocking receive, MP_RECV, from P2. Meanwhile P2 executes a blocking receive, MP_RECV, from P, and then executes a blocking send, MP_SEND, to P. oth processors loop for many iterations, and the minimum time is divided by 2 to determine the latency for a single point to point communication on an otherwise clean system. As shown in Figure 4, Equation (28) applies well if the experimental results are split into four distinct regions as a function of the message length. The resulting values for t o, and r are shown in Table. The length of a packet is at most 255 bytes, including the actual data, routing information, packet length information, and error checking bits. This suggests why there is a change in parameter values at n = 27. The length of the FIFO queue is 2 Kbytes, suggesting why they change again at n = The change in parameter values at n = 32 Kbytes is due to the fact that MPL does one copy of data for messages of size greater than 32 Kbytes, and two copies of data for messages of size less than or equal to 32 Kbytes. In successive cases, the asymptotic zero message length latency increases, but this effect is immediately compensated for by the change in the transfer time. As a result, after each transition point, the point to point communication time as a function of the message length is lower than would have been expected from the model(s) of smaller messages. Additional experiments indicate that the difference between intra frame (nearby node) and inter frame (remote node) communication latencies is approximately µs, independent of message length; this is less than 3% of the total latency for the smallest messages. This effect is negligible for larger messages One to Many Communication The cost of a scatter operation, implemented as a one to many communication, is measured by having one processor send distinct messages to P other processors by executing a non blocking send, MP_SEND, to each in turn. At the end of every it- 7 r

8 r Time (µs) Message Length (ytes) Figure 4: Point to Point Communication Time n t o r n < n < n < n Table : Point to Point Communication Parameters eration MP_WAIT is called to ensure that all of the sends have completed. This series of operations is done for a large number of iterations; the total time is measured and averaged for a single iteration. Each receiving processor executes a blocking receive, MP_RECV, per iteration. As shown in Figure 5, Equation (29) applies fairly well over a range of p and n values if t comm = * p and π comm = 0.03 p. As a result: (30) The setup time is 0.0 µs for a single destination, and 5.5 µs for each additional destination, apparently due to the added overhead of managing more than one ongoing message. As n increases toward 64 Kbytes, the overall transfer rate goes to 32 M/s Many to One Communication The cost of a gather operation from P processors to one, implemented as a many to one communication, is measured by having P processors execute a blocking send, MP_SEND, to one other processor. This series of operations is done for a large number of iterations. The receiving processor executes a non blocking receive, MP_RECV, for each source every iteration. At the end of every iteration MP_WAIT is called to ensure that all of the sends have completed. The total time is measured and averaged for a single iteration. As shown in Figure 6, Equation (29) applies fairly well to the range of p and n values if t comm = * p and π comm = p. As a result: (3) As with a one to many communication, as the message length n goes towards 64 Kbytes, the overall transfer rate approaches 8

9 Time (µs) F F F F F F F F H H H H H H H H F F H F H H F F F H H H P = P = 2 H P = 4 F P = 8 P = 6 P = Message Length (ytes) Figure 5: One to Many Communication Time Time (µs) F F F F F H H H H H H F F F H H F F F H H H F F H H H F P = P = 2 H P = 4 F P = 8 P = 6 P = Message Length (ytes) 35 M/s. Figure 6: Many to One Communication Time 4.4. Many to Many Communication The cost of a many to many communication is measured by having each of P processors send a distinct message to each 9

10 of the other processors. Each processor executes a non blocking send, MP_SEND, to every other processor. Each processor then executes a non blocking receive, MP_RECV, from every other processor. At the end of every iteration MP_WAIT is called to ensure that all of the sends have completed. This series of operations is done for a large number of iterations. The total time is measured and averaged for a single iteration. As shown in Figure 7, Equation (29) applies to a range of p and n values if t comm = (p 2) and π comm = (p 2). As a result: T MM ( n, p) = ( ( p 2) ) + ( ( p 2) ) n (32) Time (µs) H H H H H H H H H H H H H H F F F F F F F F F F F F F F P = 2 P = 4 H P = 8 F P = 6 P = Message Length (ytes) Figure 7: Many to Many Communication Time For large messages, the bidirectional rate for a processor node 32 M/s, implying good communication scalability even when the HPS is highly saturated Combine Communication The cost of a combine operation, implemented using an MPL collective communication routine, is measured by having each of P processors execute MP_COMINE. This results in a vector of double precision operands being sent to one processor. The receiving processor performs a double precision addition reduction and broadcasts the results back to the P processors. This operation is done for a large number of iterations and the total time is measured and averaged for a single iteration. As shown in Figure 8, Equation (29) applies to a range of p and n values as follows: 97D + ( 0.D) n n < 27 T C ( n, p) = 4D + ( 0.2D) n 27 n 2048 ( D) + ( 0.09D) n 2048 < n where D = lg 2 ( p) 4.6. roadcast Communication The cost of a broadcast operation, implemented using an MPL collective communication routine, is measured by having every processor execute a broadcast, MP_CAST, specifying the same source, P0. This series of operations is done for a large number of iterations. The total time is measured for P0, and averaged for a single iteration. As shown in Figure 9, Equation (29) applies to a range of p and n values as follows: (33) 0

11 Time (µs) H H H H H H H H H H H H H H F F F F F F F F F F F F F F P = 2 P = 4 H P = 8 F P = 6 P = 32 0 T C ( n, p) Message Length (ytes) = Figure 8: Combine Communication Time ( D) + ( D) n n < 27 { ( D) + ( D) n n 27 where D = lg 2 ( p) roadcasting to P processors takes about the same time as broadcasting to the nearest lowest integer power of 2 processors, suggesting that broadcast is done using a recursive doubling algorithm. 5. Performance Evaluation and Improvement of FEMC This section analyzes the performance of three important routines drawn from FEMC, an industrial finite element code. All the modeling data shown is collected from a single (typical) simulation time step. The calculated miss ratio for the MA bound assumes cache lines are accessed with a stride of one, which serves as a lower bound. The MAC bounds were calculated using the miss ratios calculated for the MA bound. (PLEASE NOTE: Although this can be used as a lower bound, for the final draft of the paper we will use the actual cache miss counts using the POWER2 Performance Monitor. [2]) An additional bound, MA_PC, is also shown since t m dominates the MA and MAC bounds for all of the modeled routines and it is interesting to examine the bottlenecks masked by the memory component of the bound. This MA_PC bound is the MA bound with a perfect cache (hit ratio of 00%) assumption. 5.. Routine A Routine A exhibits the longest execution time of the three routines. It is characterized by large amounts of computation and communication. The communication pattern is many to many and is nonuniformly distributed throughout the routine. It employs MP_SEND and MP_RECV communication primitives. Figure 0 shows the measured run time and modeled behavior of Routine A for each processor in an eight processor run. While the modeled behavior can be seen to exhibit very good load balancing, particularly with regard to computation, the measured performance exhibits somewhat more unbalanced, although acceptable, behavior. The large unmodeled gap between the MAC + Communication model and the measured time can be attributed to unmodeled data cache misses, unmodeled aspects of communication (e.g. nonzero processor wait times during serial code sections and load imbalances), poor instruction overlap, and interrupts. Figure shows the measured run time of P0 in Routine A multiplied by the number of processors, P. Hence perfect speedup would yield a constant bar height, independent of P. As can be seen, although there is significant speedup, there is some (34)

12 Time (µs) H H H H H H H H F F F F F F F F F H F H F H F H HF HF P = P = 2 H P = 4 F P = 8 P = 6 P = Message Length (ytes) Figure 9: roadcast Communication Time Figure 0: Load alancing in Routine A with 8 Processors overhead, which is only partially attributable to the modeled communication overhead. The modeled run time explained by MA_PC is essentially unchanged (except for the additional loop overhead introduced by parallelization) as the number of processors is increased. The modeled run time explained by the MA bound varies significantly with the number of processors. As P increases, the working set size decreases, allowing it to fit in cache when P = 4. (PLEASE NOTE: The MAC bound tracks the MA bound because, in this draft of the paper, it uses the miss ratio calculated for the MA bound. This will be fixed in the 2

13 final draft of the paper as noted above.) As the number of processors is increased, the number of MP_SEND and MP_RECV communication primitives scales proportionally, causing T comm to increase as shown. Figure : (# Processors * Time) in Routine A, P Routine Routine exhibits the shortest execution time of the three routines. It is communication intensive with relatively little calculation. The communication pattern is a gather (many to one) followed by broadcast (one to many with identical data). It is implemented using MP_RECV, MP_SEND, and MP_CAST. Computation in this routine begins only after the communication is completed. Figure 2 shows the measured run time and modeled behavior of Routine for each processor in an eight processor run. In this communication dominated routine, the MA and MAC models are insignificant. While the measured behavior is seen to exhibit reasonable load balancing, the modeled communication is distinctly bimodal. The extra modeled communication time for P0 is due to the fact that P0 receives a many to one communication burst from all the other processors. P0 then performs some computation while the other processors stall, an effect which is unmodeled for P through P7. Only after P0 completes its computation does it broadcast to the other processors. The unmodeled gap between the MAC + Communication model for P0 and the measured time can be attributed to the communication load imbalance and OS interrupts. Figure 3 shows the measured run time of P0 in Routine (presented as in Figure ). Comparing the measured and modeled run time of a particular processor, P0, as the total number of processors is increased shows a wide variation in the percentage of run time explained by communication. With only one processor, T comm explains only 3% of run time. As the routine is written, P0 will send a message to itself on a one processor run. As P is increased to 2, 4, and 8, T comm explains 54%, 65%, and 46% of P0 s run time, respectively. As can be seen, there is very little speedup in Routine as there is very little computation and P0 is a bottleneck during the communication phase Routine C Routine C is calculation intensive with relatively little communication. There are two separate global reductions (combining trees) implemented with MP_COMINE, one in the middle of the routine, and one at the end. Figure 4 shows the measured run time and modeled behavior of Routine C for each processor on an eight processor run. The modeled behavior can be seen to exhibit very good load balancing, both with regard to computation and communication, as does the measured performance. The unmodeled gap between the MAC + Communication model and the measured time can be attributed primarily to the unmodeled data cache misses. Figure 5 shows the measured run time of P0 in Routine C (presented as in Figure ). Comparing the measured and mod- 3

14 Figure 2: Load alancing in Routine with 8 Processors Figure 3: (# Processors * Time) in Routine, P0 eled run time of a particular processor, P0, as the total number of processors is increased shows that the run time explained by the MA_PC bound stays constant, and hence the amount of essential computation scales perfectly. The run time explained by the MA bound decreases as more of the working set fits in cache. The extra overhead in the MAC bound is due to the method of loop parallelization employed in the high level code. Each processor loops over the entire iteration space. Inside the loop, each processor checks a conditional to determine if the data utilized in that iteration is local. The overhead of checking the conditional remains constant in every processor, regardless of the 4

15 Figure 4: Load alancing in Routine C with 8 Processors number of processors. The MAC bound summed over all processors thus increases significantly with the number of processors. Since checking the conditional statement is not essential, this effect does not occur in the MA bound. Note that Figure 5 serves to highlight the effect of this style of coding which can easily be removed. As the number of processors increases, the width of the MP_COMINE increases. T comm increases with the log base 2 of the number of processors, but is relatively insignificant compared to the time spent in computation. Although there is significant speedup, there is also some overhead, which is only slightly attributable to the modeled communication overhead. Figure 5: (# Processors * Time) in Routine C, P0 5

16 6. Conclusion We have shown that computation and communication can be effectively bounded on the IM SP2 using the MACS Hierarchal Performance ounds and the SP2 communication model developed in this paper. We have demonstrated the use of this methodology to provide an intuitive understanding of where time is spent within three time critical routines from a commercial code. While communication and computation are important parts of the performance puzzle, effects as yet unmodeled including load imbalance and nonessential cache misses also contribute significantly to performance degradation. Use of such modeling techniques leads directly to better understanding of new architectures and application codes and to their improvement. Acknowledgments The University of Michigan Center for Parallel Computing, site of the IM SP-2, is partially funded by NSF grant CDA We are most grateful to Doug Ranz of IM for his sound advice and help in procuring salient facts about the IM SP2 architecture. References [] E. L. oyd, E. S. Davidson. Hierarchical Performance Modeling with MACS: A Case Study of the Convex C-240, Proceedings of the 20th International Symposium on Computer Architecture, May, 993, pp [2] W. H. Mangione Smith, T P. Shih, S. G. Abraham, E. S. Davidson, Approaching a Machine Application ound in Delivered Performance on Scientific Code, IEEE Proceedings, August, 993, pp [3] W. H. Mangione Smith, S. G. Abraham, E. S. Davidson, A Performance Comparison of the IM RS/6000 and the Astronautics ZS, Computer, anuary, 99, pp [4] P. H. Wang, Hierarchical Performance Modeling with Cache Effects: A Case Study of the DEC Alpha, University of Michigan, Technical Report, CSE-TR , March, 995. [5] W. Azeem. Modeling and Approaching the Deliverable Performance Capability of the KSR Processor, University of Michigan, Technical Report, CSE-TR-64-93, une, 993. [6] E. L. oyd, W. Azeem, H H. Lee, T P. Shih, S H. Hung, E. S. Davidson. A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR, Proceedings of the 994 International Conference on Parallel Processing, August, 994, Vol. 3, pp [7] Paragon XP/S Product Overview, Supercomputer Systems Division, Intel Corporation, eaverton, OR, 99. [8] Connection Machine CM5 Technical Summary, Thinking Machines Corporation, Cambridge, MA, November, 992. [9] E. L. oyd,. D. Wellman, S. G. Abraham, E. S. Davidson. Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads, Proceedings of the 993 International Conference on Supercomputing, uly, 993, pp [0] E. L. oyd, E. S. Davidson. Communication in the KSR MPP: Performance Evaluation Using Synthetic Workload Experiments, Proceedings of the 994 International Conference on Supercomputing, uly, 994, pp [] KSR Principles of Operation, Kendall Square Research Corporation, Waltham, MA, 992. [2] KSR Technical Summary, Kendall Square Research Corporation, Waltham, MA, 992. [3] Convex Exemplar Programming Guide, Convex Press, Richardson, Texas, 994. [4] Cray T3D System Architecture Overview, Cray Research, Inc., Chippewa Falls, WI, September, 993. [5] S. W. White, S. Dhawan, POWER2: Next Generation of the RISC System/6000 Family, D.. Shippy, T. W. Griffith, POWER2 Fixed point, Data Cache, and Storage Control Units, T. N. Hicks, et al, POWER2 Floating point Unit: Architecture and Implementation,. I. arreh, et al, POWER2 Instruction Cache Unit, IM ournal of Research and Development, Vol. 38, No. 5, September, 994, pp [6] C.. Stunkel, et al, The SP High Performance Switch, Proceedings of the Scalable High Performance Computing Conference, May, 994, pp [7] C.. Stunkel, et al, The SP2 Communication Subsystem, <URL: [8] C.. Stunkel, et al, Architecture and Implementation of Vulcan, Proceedings of the 8th International Parallel Processing Symposium, April, 994, pp [9] F. H. McMahon, The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range, Technical Report UCRL-5375, Lawrence Livermore National Laboratory, December 986. [20] R. Hockney, Performance Parameters and enchmarking of Supercomputers, Parallel Computing, Vol. 7, No. 0 &, December, 99, pp [2] E. H. Welbon, et al, The POWER2 Performance Monitor, IM ournal of Research and Development, Vol. 38, No. 5, September 994, pp

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

Fall 2008 CSE Qualifying Exam. September 13, 2008

Fall 2008 CSE Qualifying Exam. September 13, 2008 Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

Lecture 19. Architectural Directions

Lecture 19. Architectural Directions Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

Special Nodes for Interface

Special Nodes for Interface fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster

More information

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may

More information

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world.

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world. Outline EECS 150 - Components and esign Techniques for igital Systems Lec 18 Error Coding Errors and error models Parity and Hamming Codes (SECE) Errors in Communications LFSRs Cyclic Redundancy Check

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich, The Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008 Jugene Case-Studies: Overview Case Study: PEPC Case Study: racoon Case Study: QCD CPU0CPU3 CPU1CPU2 2

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

Timing Results of a Parallel FFTsynth

Timing Results of a Parallel FFTsynth Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1994 Timing Results of a Parallel FFTsynth Robert E. Lynch Purdue University, rel@cs.purdue.edu

More information

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER Jesus Garcia and Michael J. Schulte Lehigh University Department of Computer Science and Engineering Bethlehem, PA 15 ABSTRACT Galois field arithmetic

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law Topics 2 Page The Nature of Time real (i.e. wall clock) time = User Time: time spent

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Topics Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law 2 The Nature of Time real (i.e. wall clock) time = User Time: time spent executing

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN

More information

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College Why analysis? We want to predict how the algorithm will behave (e.g. running time) on arbitrary inputs, and how it will

More information

EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs)

EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) EECS150 - igital esign Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Nov 21, 2002 John Wawrzynek Fall 2002 EECS150 Lec26-ECC Page 1 Outline Error detection using parity Hamming

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

ESTIMATION OF THE BURST LENGTH IN OBS NETWORKS

ESTIMATION OF THE BURST LENGTH IN OBS NETWORKS ESTIMATION OF THE BURST LENGTH IN OBS NETWORKS Pallavi S. Department of CSE, Sathyabama University Chennai, Tamilnadu, India pallavi.das12@gmail.com M. Lakshmi Department of CSE, Sathyabama University

More information

Performance, Power & Energy

Performance, Power & Energy Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?

More information

EECS150 - Digital Design Lecture 21 - Design Blocks

EECS150 - Digital Design Lecture 21 - Design Blocks EECS150 - Digital Design Lecture 21 - Design Blocks April 3, 2012 John Wawrzynek Spring 2012 EECS150 - Lec21-db3 Page 1 Fixed Shifters / Rotators fixed shifters hardwire the shift amount into the circuit.

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,

More information

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits An Enhanced Two-Level Adaptive Multiple Branch Prediction for Superscalar Processors Jong-bok Lee, Soo-Mook Moon and Wonyong Sung fjblee@mpeg,smoon@altair,wysung@dspg.snu.ac.kr School of Electrical Engineering,

More information

Lecture 3, Performance

Lecture 3, Performance Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations

More information

ECC for NAND Flash. Osso Vahabzadeh. TexasLDPC Inc. Flash Memory Summit 2017 Santa Clara, CA 1

ECC for NAND Flash. Osso Vahabzadeh. TexasLDPC Inc. Flash Memory Summit 2017 Santa Clara, CA 1 ECC for NAND Flash Osso Vahabzadeh TexasLDPC Inc. 1 Overview Why Is Error Correction Needed in Flash Memories? Error Correction Codes Fundamentals Low-Density Parity-Check (LDPC) Codes LDPC Encoding and

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

Research Article Scheduling Algorithm: Tasks Scheduling Algorithm for Multiple Processors with Dynamic Reassignment

Research Article Scheduling Algorithm: Tasks Scheduling Algorithm for Multiple Processors with Dynamic Reassignment Hindawi Publishing Corporation Journal of Computer Systems, Networks, and Communications Volume 008, Article ID 57880, 9 pages doi:0.55/008/57880 Research Article Scheduling Algorithm: Tasks Scheduling

More information

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH

More information

Burst Scheduling Based on Time-slotting and Fragmentation in WDM Optical Burst Switched Networks

Burst Scheduling Based on Time-slotting and Fragmentation in WDM Optical Burst Switched Networks Burst Scheduling Based on Time-slotting and Fragmentation in WDM Optical Burst Switched Networks G. Mohan, M. Ashish, and K. Akash Department of Electrical and Computer Engineering National University

More information

These are special traffic patterns that create more stress on a switch

These are special traffic patterns that create more stress on a switch Myths about Microbursts What are Microbursts? Microbursts are traffic patterns where traffic arrives in small bursts. While almost all network traffic is bursty to some extent, storage traffic usually

More information

Compiling Techniques

Compiling Techniques Lecture 11: Introduction to 13 November 2015 Table of contents 1 Introduction Overview The Backend The Big Picture 2 Code Shape Overview Introduction Overview The Backend The Big Picture Source code FrontEnd

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur March 19, 2018 Modular Exponentiation Public key Cryptography March 19, 2018 Branch Prediction Attacks 2 / 54 Modular Exponentiation

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

CPU SCHEDULING RONG ZHENG

CPU SCHEDULING RONG ZHENG CPU SCHEDULING RONG ZHENG OVERVIEW Why scheduling? Non-preemptive vs Preemptive policies FCFS, SJF, Round robin, multilevel queues with feedback, guaranteed scheduling 2 SHORT-TERM, MID-TERM, LONG- TERM

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives Miloš D. Ercegovac Computer Science Department Univ. of California at Los Angeles California Robert McIlhenny

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance

More information

CprE 281: Digital Logic

CprE 281: Digital Logic CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital

More information

The goal differs from prime factorization. Prime factorization would initialize all divisors to be prime numbers instead of integers*

The goal differs from prime factorization. Prime factorization would initialize all divisors to be prime numbers instead of integers* Quantum Algorithm Processor For Finding Exact Divisors Professor J R Burger Summary Wiring diagrams are given for a quantum algorithm processor in CMOS to compute, in parallel, all divisors of an n-bit

More information

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements. 1 2 Introduction Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements. Defines the precise instants when the circuit is allowed to change

More information

VLSI Signal Processing

VLSI Signal Processing VLSI Signal Processing Lecture 1 Pipelining & Retiming ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-1 Introduction DSP System Real time requirement Data driven synchronized by data

More information

How to deal with uncertainties and dynamicity?

How to deal with uncertainties and dynamicity? How to deal with uncertainties and dynamicity? http://graal.ens-lyon.fr/ lmarchal/scheduling/ 19 novembre 2012 1/ 37 Outline 1 Sensitivity and Robustness 2 Analyzing the sensitivity : the case of Backfilling

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle Computer Engineering Department CC 311- Computer Architecture Chapter 4 The Processor: Datapath and Control Single Cycle Introduction The 5 classic components of a computer Processor Input Control Memory

More information

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

MICROPROCESSOR REPORT.   THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By

More information

Simple Instruction-Pipelining (cont.) Pipelining Jumps

Simple Instruction-Pipelining (cont.) Pipelining Jumps 6.823, L9--1 Simple ruction-pipelining (cont.) + Interrupts Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Src1 ( j / ~j ) Src2 ( / Ind) Pipelining Jumps

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute

More information

Information redundancy

Information redundancy Information redundancy Information redundancy add information to date to tolerate faults error detecting codes error correcting codes data applications communication memory p. 2 - Design of Fault Tolerant

More information

Exploring performance and power properties of modern multicore chips via simple machine models

Exploring performance and power properties of modern multicore chips via simple machine models Exploring performance and power properties of modern multicore chips via simple machine models G. Hager, J. Treibig, J. Habich, and G. Wellein Erlangen Regional Computing Center (RRZE) Martensstr. 1, 9158

More information

ww.padasalai.net

ww.padasalai.net t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t

More information

EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs

EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs EECS150 - Digital Design Lecture 26 - Faults and Error Correction April 25, 2013 John Wawrzynek 1 Types of Faults in Digital Designs Design Bugs (function, timing, power draw) detected and corrected at

More information

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts. TDDI4 Concurrent Programming, Operating Systems, and Real-time Operating Systems CPU Scheduling Overview: CPU Scheduling CPU bursts and I/O bursts Scheduling Criteria Scheduling Algorithms Multiprocessor

More information

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Outline 1 midterm exam on Friday 11 July 2014 policies for the first part 2 questions with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Intro

More information

Digital Systems Roberto Muscedere Images 2013 Pearson Education Inc. 1

Digital Systems Roberto Muscedere Images 2013 Pearson Education Inc. 1 Digital Systems Digital systems have such a prominent role in everyday life The digital age The technology around us is ubiquitous, that is we don t even notice it anymore Digital systems are used in:

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 23: Illusiveness of Parallel Performance James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L23 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping peel

More information

CSE370: Introduction to Digital Design

CSE370: Introduction to Digital Design CSE370: Introduction to Digital Design Course staff Gaetano Borriello, Brian DeRenzi, Firat Kiyak Course web www.cs.washington.edu/370/ Make sure to subscribe to class mailing list (cse370@cs) Course text

More information

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode

More information

Transposition Mechanism for Sparse Matrices on Vector Processors

Transposition Mechanism for Sparse Matrices on Vector Processors Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands

More information

McBits: Fast code-based cryptography

McBits: Fast code-based cryptography McBits: Fast code-based cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands Joint work with Daniel Bernstein, Tung Chou December 17, 2013 IMA International Conference on Cryptography

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

Lecture: Pipelining Basics

Lecture: Pipelining Basics Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:

More information

ARecursive Doubling Algorithm. for Solution of Tridiagonal Systems. on Hypercube Multiprocessors

ARecursive Doubling Algorithm. for Solution of Tridiagonal Systems. on Hypercube Multiprocessors ARecursive Doubling Algorithm for Solution of Tridiagonal Systems on Hypercube Multiprocessors Omer Egecioglu Department of Computer Science University of California Santa Barbara, CA 936 Cetin K Koc Alan

More information

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class Fault Modeling 李昆忠 Kuen-Jong Lee Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan Class Fault Modeling Some Definitions Why Modeling Faults Various Fault Models Fault Detection

More information

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture Rajeevan Amirtharajah University of California, Davis Outline Announcements Review: PDP, EDP, Intersignal Correlations, Glitching, Top

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs Article Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs E. George Walters III Department of Electrical and Computer Engineering, Penn State Erie,

More information

THE UNIVERSITY OF MICHIGAN. Faster Static Timing Analysis via Bus Compression

THE UNIVERSITY OF MICHIGAN. Faster Static Timing Analysis via Bus Compression Faster Static Timing Analysis via Bus Compression by David Van Campenhout and Trevor Mudge CSE-TR-285-96 THE UNIVERSITY OF MICHIGAN Computer Science and Engineering Division Department of Electrical Engineering

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

Processor Design & ALU Design

Processor Design & ALU Design 3/8/2 Processor Design A. Sahu CSE, IIT Guwahati Please be updated with http://jatinga.iitg.ernet.in/~asahu/c22/ Outline Components of CPU Register, Multiplexor, Decoder, / Adder, substractor, Varity of

More information

Bounded Delay for Weighted Round Robin with Burst Crediting

Bounded Delay for Weighted Round Robin with Burst Crediting Bounded Delay for Weighted Round Robin with Burst Crediting Sponsor: Sprint Kert Mezger David W. Petr Technical Report TISL-0230-08 Telecommunications and Information Sciences Laboratory Department of

More information

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits Chris Calabro January 13, 2016 1 RAM model There are many possible, roughly equivalent RAM models. Below we will define one in the fashion

More information

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

Logic Design II (17.342) Spring Lecture Outline

Logic Design II (17.342) Spring Lecture Outline Logic Design II (17.342) Spring 2012 Lecture Outline Class # 10 April 12, 2012 Dohn Bowden 1 Today s Lecture First half of the class Circuits for Arithmetic Operations Chapter 18 Should finish at least

More information

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Instruction Set Extensions for Reed-Solomon Encoding and Decoding Instruction Set Extensions for Reed-Solomon Encoding and Decoding Suman Mamidi and Michael J Schulte Dept of ECE University of Wisconsin-Madison {mamidi, schulte}@caewiscedu http://mesaecewiscedu Daniel

More information

Analytical Modeling of Parallel Systems

Analytical Modeling of Parallel Systems Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing

More information