Performance, Power & Energy
|
|
- Barnaby Holt
- 5 years ago
- Views:
Transcription
1 Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/ What is good performance? Time needed to finish certain task(s) Number of tasks finished per unit time Latency Throughput PERFORMNE EVLUTION H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/ Latency vs Throughput (1) Low latency High throughput? High throughput Low Latency? High latency low throughput? Low throughput high latency? Latency vs Throughput (2) omputer 1 Finish task takes 15s takes 20s takes 50s omputer 1 and 2 must finish task,, Latency = 15s + 20s+ 50s = 85s Throughput = 3 / 85s = tasks / s omputer 2 Finish task takes 20s takes 25s takes 45s Latency = 20s + 25s + 45s = 90s Throughput = 3 / 90s = 0.03 tasks/s Is omputer 1 faster than omputer 2? H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/6102 6
2 Latency vs Throughput (3) What if omputer 2 can perform 3 tasks at the same time? omputer 1 omputer 2 Finish task takes 15s takes 20s takes 50s Latency = 15s + 20s+ 50s = 85s Throughput = 3 / 85s = tasks / s Finish task takes 20s takes 25s takes 45s Latency = 45s Throughput = 3 / 45s = Is omputer 2 faster than omputer 1? H. So, Sp10 Lecture 3 - ELE8106/ Latency vs Throughput (4) What if both omputer 1 and 2 can perform 2 tasks at the same time? omputer 1 :15, :20, :50 Latency = 50s Throughput = 3 / 50s = 0.06 tasks / s omputer 2 :20, :25, :45 Latency = 45s Which computer is faster? Throughput = 3 / 45s = H. So, Sp10 Lecture 3 - ELE8106/ Latency vs Throughput (5) oth omputer 1 and 2 can perform 2 tasks at the same time. Define latency as time to get first result. omputer 1 :15, :20, :50 First result = 15s Last result = 50s Throughput = 3 / 50s = 0.06 tasks / s omputer 2 :20, :25, :45 First result = 20s Last result = 45s Throughput = 3 / 45s = Latency vs Throughput (6) oth omputer 1 and 2 can perform 2 tasks at the same time. Tasks = omputer 1 :15, :20, :50 First result = 15s Last result = 85s Throughput = 6 / 85s = 0.07 tasks / s omputer 2 :20, :25, :45 First result = 20s Last result = 90s Throughput = 3 / 45s = H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/ Latency vs Throughput Summary Latency Time to first data/ response arrive Time for task to finish Indicates the responsiveness of a system Throughput Sustained rate of task completion Matters most when there are a lot of continuous input Especially with streaming input long term efficiency measurement Latency vs Throughput Summary Latency and throughput measure important in different scenarios The two has close tie to each other, but no obvious relationship Many factors affect latency/throughput Data input / Workload Scheduling etc H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/
3 Performance: task completion Time to complete 1 task is a good way to measure general purpose computers Time to complete 1 task (latency): How to improve speed? no. of instrs PI no. of instrs PI Decrease number of instruction Increase clock frequency Decrease cycles per instruction H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/ Increase clock frequency Linear increase in performance ut heat dissipation has prohibited simple clock frequency boost Improving speed compiler no. of instrs PI (micro) computer architecture NOTE: the number of instructions of a program is closely related to its PI PI changes depending on the app. Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and urton Smith H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/ Review: PI vs # of instructions program executes the following instruction profile: Instruction Type Number lock ycle dd Multiply Division Load Store With a clock cycle time of 1ns, how long does the program takes to finish? What is the average PI of the processor? (2000* * * * *2) * 1ns = 26 us vg. PI = 26,000 / 5000 = 5.2 mdahl s Law Overall speedup due to improving a fraction P with speed up of S is: E.g. if P = 0.2, S=5, then overall speed up is 1 (1 0.2) =1.19 If the same improvement can be applied to a larger portion with P=0.9, then speedup = 1 (1 0.9) = (1 P) + P S lways optimize for the common cases. H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/
4 Instruction example revisit Instruction Type Number lock ycle dd Multiply Division Load Store If we can reduce execution speed of any one instruction, which instruction to optimize? ase 1: Optimize dd (2000* * * * *2) = 24.2ms (Speedup = 26/24.2 = 1.07) ase 2: Optimize Load (2000* * * * *2) = 18.8ms (Speedup = 26/18.8 = 1.38) ompiler optimizations Decrease # of instructions E.g. ommon subexpression elimination E.g. onstant propagation (?) use function call instead of macro Use less expensive instructions E.g. Shift left instead of divide by 2 E.g. Register reuse to avoid load/store Many more H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/ Ex: Predicated instructions if cond { true_part } else { false_part } more_instr Pseudo-code predicated code ssembly code branch cond goto LF true_part goto LD LF: false_part LD: more_instr (cond) true_part (!cond) false_part more_instr Reduce number of instructions Reduce branch mispredictions Improve Instruction-cache hit rate #instr PI PI H. So, Sp10 Lecture 3 - ELE8106/ Decreasing PI Traditional high performance PU architectures focus on decreasing PI Reduce data/branch hazards PI close to 1 Increase IP (instructions per cycle) Parallel processing PI < 1, IP > 1 Implicit (Hidden below IS) Superscalar Explicit (Exposed through IS) VLIW Vector processors SIMD H. So, Sp10 Lecture 3 - ELE8106/ Superscalar Processors (1) Key Idea: Issues more than 1 instruction per cycle to make maximum use of computing resources Relatively simple, in-order instruction dispatch+execution Dispatch N consecutive upcoming instructions each cycle until data hazard arises Sophisticated, out-of-order dispatch +execution Execute N not-necessarily consecutive instructions per cycle as long as there is available execution unit H. So, Sp10 Lecture 3 - ELE8106/ Tomasulo rchitecture From Mem FP Op Queue Load uffers dd1 dd2 dd3 FP adders Mult1 Mult2 Reservation Stations FP Registers FP multipliers Store uffers To Mem ommon Data us (D) dapted from EES252 H. So, Sp10 Lecture 3 - ELE8106/6102 U.. erkeley 24
5 VLIW Very Long Instruction Word (VLIW) machines Each instruction is in fact composed of multiple smaller, standard instructions 4 to 8 standard instructions per cycle ompiler looks for instructions from the original program that can be issued at the same cycle and pack them into one mega-instruction No dynamic instruction analysis on hardware IF reg EX $ EX $ simplistic VLIW H. So, Sp10 Lecture 3 - ELE8106/ reg Vector Processors Processor that operates on vectors as basic data type ompared to scalar processor Vector instructions E.g. dd 2 vectors: set_vector_len 64 add vectorr, vector, vector form of data-parallelism Reduces no. of instructions H. So, Sp10 Lecture 3 - ELE8106/ SIMD Single instruction multiple data class of computation architecture Only one instruction stream is presented, which operates on multiple data streams Vector processing is special form of SIMD in which all data are indeed vectors E.g. Intel s MMX, SSE, SSE2 extensions To implement r1=a1+b1, r2=a2+b2, r3=a3+b3 and r4=a4+b4 in one instruction: add r1,a1,b1,r2,a2,b2,r3,b3,c3,r4,b4,c4 Save no. of instructions May pack 4 8-bit adds into a single 32-bit add Reuse the 32-bit hardware adder (with small modifications) Explicit vs Implicit (1) Instruction Set rchitecture (IS) is the contract between the software and hardware The hardware guarantee certain behavior to the software according to the IS E.g. if an instruction i1 comes before instruction i2, then the effect of i1 will definitely be reflected when i2 is executed Without changing the IS, the hardware must extract all the instruction-level parallelism (ILP) behind the scene yet keeping the promised behavior to software Very complicated hardware design Keeping the IS maintain binary compatibility pplications compiled to run on an Intel 8086 can still be run on a modern Intel ore i7!!! Good division of labor easy development hange in HW won t affect SW SW cannot foresee data-dependent run-time behavior of the program H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/ Explicit vs Implicit Exposing the underlying parallel architecture to software allows software to bear the burden of extracting parallelism from the application simple hardware Software can take a long time to do the best job because it is a one-off effort ny change to the hardware requires major change to the software tools No division of labor Data-dependent behavior cannot be anticipated during compile time SW cannot fully exploit all possible parallelization opportunities Performance Summary Key to computer performance: no. of instrs PI lock frequency determined by circuit implementations The number of instructions and PI both depends on the tight interaction between the compiler and the computer micro-architecture Implicit parallelism hidden behind the IS puts the burden on low-level hardware implementations to extract ILP Explicit parallelism expose underlying architecture to the compiler and leave the burden to software to extract ILP H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/
6 POWER ND ENERGY Power and Energy Power consumption of a circuit is the energy consumed per unit time Power measure how much energy is being used/ dissipated at any one time ffects heat dissipation ffects input power supply Slightly affect battery lifetime Energy consumption is the measure of the absolute amount of energy used to perform certain operation ffects battery capacity oncerns embedded system designers oth metrics important for R designs Some techniques lower power but not energy H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/ Power, Energy and Performance Power onsumption ctivity factor (amount of circuit switching) Load apacitance (size of circuit) Voltage Supply Swing Voltage lock frequency Dynamic Power Dissipation V in L V out E 0 1 = L 2 E R = 1 V 2 2 L dd E = 1 V 2 2 L dd P total = α( L ) + I sc + I leakage Energy per operation Total Energy onsumption Dynamic Static E op P dyn / = α L E total = E op no. of operations Energy stored from to L during 01 transition Energy drained from L to ground during 10 transition In the absence of static/leakage power consumption, the capacitance keeps the energy stored until discharged Total Run Time T total = no. of operations PI / H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/ Dynamic Power onsumption P dynamic = Energy/transition transition rate P(transition) = L 2 α = α L 2 = eff 2 Power dissipation depends on data input statistics The more data transitions, the more power is consumed Switching activities oth input switches randomly: i.e. 50% chance that it has 01 transition Probability that Q has a 01 transition: Q=& ND gate P(Q 0 1 ) = = 3 16 H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/
7 Transistor Leakage Transistors are not completely turned off even when they should be. Main contribution from sub-threshold current function of V th and What are the Options? Power onsumption ctivity factor (amount of circuit switching) Load apacitance (size of circuit) Voltage Supply Swing Voltage lock frequency I leak P total = α( L ) + I sc + I leakage Dynamic Static V in = V out V in = 0 V out L L Energy per operation E op P dyn / = α L Should be OFF I leak Total Energy onsumption Total Run Time E total = E op no. of operations T total = no. of operations PI / H. So, Sp10 Lecture 3 - ELE8106/ H. So, Sp10 Lecture 3 - ELE8106/
Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationPerformance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance Metrics & Architectural Adaptivity ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So What are the Options? Power Consumption Activity factor (amount of circuit switching) Load Capacitance (size
More informationEE241 - Spring 2000 Advanced Digital Integrated Circuits. Announcements
EE241 - Spring 2 Advanced Digital Integrated Circuits Lecture 11 Low Power-Low Energy Circuit Design Announcements Homework #2 due Friday, 3/3 by 5pm Midterm project reports due in two weeks - 3/7 by 5pm
More informationEE241 - Spring 2001 Advanced Digital Integrated Circuits
EE241 - Spring 21 Advanced Digital Integrated Circuits Lecture 12 Low Power Design Self-Resetting Logic Signals are pulses, not levels 1 Self-Resetting Logic Sense-Amplifying Logic Matsui, JSSC 12/94 2
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationPower Dissipation. Where Does Power Go in CMOS?
Power Dissipation [Adapted from Chapter 5 of Digital Integrated Circuits, 2003, J. Rabaey et al.] Where Does Power Go in CMOS? Dynamic Power Consumption Charging and Discharging Capacitors Short Circuit
More informationLecture 2: Metrics to Evaluate Systems
Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video
More informationPERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.
More informationCMP N 301 Computer Architecture. Appendix C
CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)
More informationMicroprocessor Power Analysis by Labeled Simulation
Microprocessor Power Analysis by Labeled Simulation Cheng-Ta Hsieh, Kevin Chen and Massoud Pedram University of Southern California Dept. of EE-Systems Los Angeles CA 989 Outline! Introduction! Problem
More informationEE 466/586 VLSI Design. Partha Pande School of EECS Washington State University
EE 466/586 VLSI Design Partha Pande School of EECS Washington State University pande@eecs.wsu.edu Lecture 8 Power Dissipation in CMOS Gates Power in CMOS gates Dynamic Power Capacitance switching Crowbar
More informationEECS 427 Lecture 11: Power and Energy Reading: EECS 427 F09 Lecture Reminders
EECS 47 Lecture 11: Power and Energy Reading: 5.55 [Adapted from Irwin and Narayanan] 1 Reminders CAD5 is due Wednesday 10/8 You can submit it by Thursday 10/9 at noon Lecture on 11/ will be taught by
More informationEE115C Winter 2017 Digital Electronic Circuits. Lecture 6: Power Consumption
EE115C Winter 2017 Digital Electronic Circuits Lecture 6: Power Consumption Four Key Design Metrics for Digital ICs Cost of ICs Reliability Speed Power EE115C Winter 2017 2 Power and Energy Challenges
More informationWhere Does Power Go in CMOS?
Power Dissipation Where Does Power Go in CMOS? Dynamic Power Consumption Charging and Discharging Capacitors Short Circuit Currents Short Circuit Path between Supply Rails during Switching Leakage Leaking
More informationLecture 8-1. Low Power Design
Lecture 8 Konstantinos Masselos Department of Electrical & Electronic Engineering Imperial College London URL: http://cas.ee.ic.ac.uk/~kostas E-mail: k.masselos@ic.ac.uk Lecture 8-1 Based on slides/material
More informationLecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 12: Energy and Power James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L12 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today a working understanding of
More informationCSCI-564 Advanced Computer Architecture
CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA
More informationDesign for Manufacturability and Power Estimation. Physical issues verification (DSM)
Design for Manufacturability and Power Estimation Lecture 25 Alessandra Nardi Thanks to Prof. Jan Rabaey and Prof. K. Keutzer Physical issues verification (DSM) Interconnects Signal Integrity P/G integrity
More informationMark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions
Lecture Slides Intro Number Systems Logic Functions EE 0 in Context EE 0 EE 20L Logic Design Fundamentals Logic Design, CAD Tools, Lab tools, Project EE 357 EE 457 Computer Architecture Using the logic
More informationECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)
ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC
More informationLast Lecture. Power Dissipation CMOS Scaling. EECS 141 S02 Lecture 8
EECS 141 S02 Lecture 8 Power Dissipation CMOS Scaling Last Lecture CMOS Inverter loading Switching Performance Evaluation Design optimization Inverter Sizing 1 Today CMOS Inverter power dissipation» Dynamic»
More informationChapter 8. Low-Power VLSI Design Methodology
VLSI Design hapter 8 Low-Power VLSI Design Methodology Jin-Fu Li hapter 8 Low-Power VLSI Design Methodology Introduction Low-Power Gate-Level Design Low-Power Architecture-Level Design Algorithmic-Level
More informationLecture: Pipelining Basics
Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19
More information8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS
8. Design Tradeoffs 6.004x Computation Structures Part 1 Digital Circuits Copyright 2015 MIT EECS 6.004 Computation Structures L08: Design Tradeoffs, Slide #1 There are a large number of implementations
More information8. Design Tradeoffs x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS
8. Design Tradeoffs 6.004x Computation Structures Part 1 Digital Circuits Copyright 2015 MIT EECS 6.004 Computation Structures L08: Design Tradeoffs, Slide #1 There are a large number of implementations
More informationComputer Architecture
Lecture 2: Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture CPU Evolution What is? 2 Outline Measurements and metrics : Performance, Cost, Dependability, Power Guidelines
More informationCSE140L: Components and Design Techniques for Digital Systems Lab. Power Consumption in Digital Circuits. Pietro Mercati
CSE140L: Components and Design Techniques for Digital Systems Lab Power Consumption in Digital Circuits Pietro Mercati 1 About the final Friday 09/02 at 11.30am in WLH2204 ~2hrs exam including (but not
More informationEnergy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems
Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems Jian-Jia Chen *, Chuan Yue Yang, Tei-Wei Kuo, and Chi-Sheng Shih Embedded Systems and Wireless Networking Lab. Department of Computer
More informationVectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier
Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Espen Stenersen Master of Science in Electronics Submission date: June 2008 Supervisor: Per Gunnar Kjeldsberg, IET Co-supervisor: Torstein
More informationVLSI Design I; A. Milenkovic 1
Why Power Matters PE/EE 47, PE 57 VLSI Design I L5: Power and Designing for Low Power Department of Electrical and omputer Engineering University of labama in Huntsville leksandar Milenkovic ( www.ece.uah.edu/~milenka
More informationIntroduction to CMOS VLSI Design (E158) Lecture 20: Low Power Design
Harris Introduction to CMOS VLSI Design (E158) Lecture 20: Low Power Design David Harris Harvey Mudd College David_Harris@hmc.edu Based on EE271 developed by Mark Horowitz, Stanford University MAH E158
More informationHw 6 due Thursday, Nov 3, 5pm No lab this week
EE141 Fall 2005 Lecture 18 dders nnouncements Hw 6 due Thursday, Nov 3, 5pm No lab this week Midterm 2 Review: Tue Nov 8, North Gate Hall, Room 105, 6:30-8:30pm Exam: Thu Nov 10, Morgan, Room 101, 6:30-8:00pm
More informationEECS150 - Digital Design Lecture 22 Power Consumption in CMOS. Announcements
EECS150 - Digital Design Lecture 22 Power Consumption in CMOS November 22, 2011 Elad Alon Electrical Engineering and Computer Sciences University of California, Berkeley http://www-inst.eecs.berkeley.edu/~cs150
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper
More informationIntro To Digital Logic
Intro To Digital Logic 1 Announcements... Project 2.2 out But delayed till after the midterm Midterm in a week Covers up to last lecture + next week's homework & lab Nick goes "H-Bomb of Justice" About
More informationAdministrative Stuff
EE141- Spring 2004 Digital Integrated Circuits Lecture 30 PERSPECTIVES 1 Administrative Stuff Homework 10 posted just for practice. No need to turn in (hw 9 due today). Normal office hours next week. HKN
More informationExploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units
Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern
More informationAnnouncements. EE141- Fall 2002 Lecture 7. MOS Capacitances Inverter Delay Power
- Fall 2002 Lecture 7 MOS Capacitances Inverter Delay Power Announcements Wednesday 12-3pm lab cancelled Lab 4 this week Homework 2 due today at 5pm Homework 3 posted tonight Today s lecture MOS capacitances
More informationVLSI Signal Processing
VLSI Signal Processing Lecture 1 Pipelining & Retiming ADSP Lecture1 - Pipelining & Retiming (cwliu@twins.ee.nctu.edu.tw) 1-1 Introduction DSP System Real time requirement Data driven synchronized by data
More informationIssue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)
Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode
More informationL16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory
L16: Power Dissipation in Digital Systems 1 Problem #1: Power Dissipation/Heat Power (Watts) 100000 10000 1000 100 10 1 0.1 4004 80088080 8085 808686 386 486 Pentium proc 18KW 5KW 1.5KW 500W 1971 1974
More information! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.
ESE 57: Digital Integrated Circuits and VLSI Fundamentals Lec 9: March 9, 8 Memory Overview, Memory Core Cells Today! Charge Leakage/ " Domino Logic Design Considerations! Logic Comparisons! Memory " Classification
More informationCircuit Modeling for Practical Many-core Architecture Design Exploration
Circuit Modeling for Practical Many-core Architecture Design Exploration Redefining design abstractions Dean Truong Bevan Baas VLSI Computation Lab University of California, Davis Outline Motivation Circuit
More informationPipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2
Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1 Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
More informationDynamic Combinational Circuits. Dynamic Logic
Dynamic Combinational Circuits Dynamic circuits Charge sharing, charge redistribution Domino logic np-cmos (zipper CMOS) Krish Chakrabarty 1 Dynamic Logic Dynamic gates use a clocked pmos pullup Two modes:
More informationWhere are we? Data Path Design
Where are we? Subsystem Design Registers and Register Files dders and LUs Simple ripple carry addition Transistor schematics Faster addition Logic generation How it fits into the datapath Data Path Design
More informationMotivation for Lecture. For digital design we use CMOS transistors. Gate Source. CMOS symboler. MOS transistor. Depletion. A channel is created
Motivation for Lecture igital Integrated ircuits iktor Öwall o see how standard gates are implemented with transistors? How does technology affect the performance, e.g. speed and power consumption? What
More informationLecture 6 Power Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010
EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng 6.1 Outline Power and Energy Dynamic Power Static Power 6.2 Power and Energy Power is drawn from a voltage source attached to the V DD
More informationAmdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:
Amdahl's Law Useful for evaluating the impact of a change. (A general observation.) Insight: Improving a feature cannot improve performance beyond the use of the feature Suppose we introduce a particular
More informationCIS 371 Computer Organization and Design
CIS 371 Computer Organization and Design Unit 13: Power & Energy Slides developed by Milo Mar0n & Amir Roth at the University of Pennsylvania with sources that included University of Wisconsin slides by
More informationDatapath Component Tradeoffs
Datapath Component Tradeoffs Faster Adders Previously we studied the ripple-carry adder. This design isn t feasible for larger adders due to the ripple delay. ʽ There are several methods that we could
More informationCMP 338: Third Class
CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does
More informationESE 570: Digital Integrated Circuits and VLSI Fundamentals
ESE 570: Digital Integrated Circuits and VLSI Fundamentals Lec 19: March 29, 2018 Memory Overview, Memory Core Cells Today! Charge Leakage/Charge Sharing " Domino Logic Design Considerations! Logic Comparisons!
More informationCSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Instructor: Mohsen Imani. Slides from Tajana Simunic Rosing
CSE140L: Components and Design Techniques for Digital Systems Lab FSMs Instructor: Mohsen Imani Slides from Tajana Simunic Rosing Source: Vahid, Katz 1 FSM design example Moore vs. Mealy Remove one 1 from
More informationCPSC 3300 Spring 2017 Exam 2
CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW
More informationCMOS Digital Integrated Circuits Lec 10 Combinational CMOS Logic Circuits
Lec 10 Combinational CMOS Logic Circuits 1 Combinational vs. Sequential Logic In Combinational Logic circuit Out In Combinational Logic circuit Out State Combinational The output is determined only by
More informationICS 233 Computer Architecture & Assembly Language
ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by
More informationDigital Integrated Circuits A Design Perspective
Digital Integrated Circuits Design Perspective Designing Combinational Logic Circuits Fuyuzhuo School of Microelectronics,SJTU Introduction Digital IC Dynamic Logic Introduction Digital IC 2 EE141 Dynamic
More informationSchool of EECS Seoul National University
4!4 07$ 8902808 3 School of EECS Seoul National University Introduction Low power design 3974/:.9 43 Increasing demand on performance and integrity of VLSI circuits Popularity of portable devices Low power
More informationDSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.
SP esign Lecture 7 Unfolding cont. & Folding r. Fredrik Edman fredrik.edman@eit.lth.se Unfolding Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way
More informationSISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M
8/8/26 S528 arallel Architecture lassification & Single ore Architecture arallel Architecture lassification A Sahu Dept of SE, IIT Guwahati A Sahu Flynn s lassification SISD Architecture ategories M SISD
More informationELEC516 Digital VLSI System Design and Design Automation (spring, 2010) Assignment 4 Reference solution
ELEC516 Digital VLSI System Design and Design Automation (spring, 010) Assignment 4 Reference solution 1) Pulse-plate 1T DRAM cell a) Timing diagrams for nodes and Y when writing 0 and 1 Timing diagram
More informationDynamic operation 20
Dynamic operation 20 A simple model for the propagation delay Symmetric inverter (rise and fall delays are identical) otal capacitance is linear t p Minimum length devices R W C L t = 0.69R C = p W L 0.69
More informationHardware Acceleration of DNNs
Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning
More informationDigital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH
More informationEE141Microelettronica. CMOS Logic
Microelettronica CMOS Logic CMOS logic Power consumption in CMOS logic gates Where Does Power Go in CMOS? Dynamic Power Consumption Charging and Discharging Capacitors Short Circuit Currents Short Circuit
More informationWhere are we? Data Path Design. Bit Slice Design. Bit Slice Design. Bit Slice Plan
Where are we? Data Path Design Subsystem Design Registers and Register Files dders and LUs Simple ripple carry addition Transistor schematics Faster addition Logic generation How it fits into the datapath
More informationDigital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEM ORY INPUT-OUTPUT CONTROL DATAPATH
More informationPattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits
An Enhanced Two-Level Adaptive Multiple Branch Prediction for Superscalar Processors Jong-bok Lee, Soo-Mook Moon and Wonyong Sung fjblee@mpeg,smoon@altair,wysung@dspg.snu.ac.kr School of Electrical Engineering,
More informationComputer Architecture ELEC2401 & ELEC3441
Last Time Pipeline Hazard Computer Architecture ELEC2401 & ELEC3441 Lecture 8 Pipelining (3) Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Structural Hazard Hazard Control
More informationLecture 11. Advanced Dividers
Lecture 11 Advanced Dividers Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 15 Variation in Dividers 15.3, Combinational and Array Dividers Chapter 16, Division
More informationToday. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of
ESE532: System-on-a-Chip Architecture Day 20: November 8, 2017 Energy Today Energy Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators How does parallelism impact energy? 1 2 Message
More informationGoals for Performance Lecture
Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s
More informationSimple Instruction-Pipelining. Pipelined Harvard Datapath
6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period
More informationManaging Physical Design Issues in ASIC Toolflows Complex Digital Systems Christopher Batten February 21, 2006
Managing Physical Design Issues in ASI Toolflows 6.375 omplex Digital Systems hristopher Batten February 1, 006 Managing Physical Design Issues in ASI Toolflows Logical Effort Physical Design Issues lock
More informationObjective and Outline. Acknowledgement. Objective: Power Components. Outline: 1) Acknowledgements. Section 4: Power Components
Objective: Power Components Outline: 1) Acknowledgements 2) Objective and Outline 1 Acknowledgement This lecture note has been obtained from similar courses all over the world. I wish to thank all the
More informationSimple Instruction-Pipelining. Pipelined Harvard Datapath
6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute
More informationDelay and Energy Consumption Analysis of Conventional SRAM
World Academy of Science, Engineering and Technology 13 8 Delay and Energy Consumption Analysis of Conventional SAM Arash Azizi-Mazreah, Mohammad T. Manzuri Shalmani, Hamid Barati, and Ali Barati Abstract
More informationChapter 7. Sequential Circuits Registers, Counters, RAM
Chapter 7. Sequential Circuits Registers, Counters, RAM Register - a group of binary storage elements suitable for holding binary info A group of FFs constitutes a register Commonly used as temporary storage
More informationSpiral 2 7. Capacitance, Delay and Sizing. Mark Redekopp
2-7.1 Spiral 2 7 Capacitance, Delay and Sizing Mark Redekopp 2-7.2 Learning Outcomes I understand the sources of capacitance in CMOS circuits I understand how delay scales with resistance, capacitance
More informationEE115C Digital Electronic Circuits Homework #4
EE115 Digital Electronic ircuits Homework #4 Problem 1 Power Dissipation Solution Vdd =1.0V onsider the source follower circuit used to drive a load L =20fF shown above. M1 and M2 are both NMOS transistors
More informationDigital Integrated Circuits Designing Combinational Logic Circuits. Fuyuzhuo
Digital Integrated Circuits Designing Combinational Logic Circuits Fuyuzhuo Introduction Digital IC Dynamic Logic Introduction Digital IC EE141 2 Dynamic logic outline Dynamic logic principle Dynamic logic
More informationBinary addition example worked out
Binary addition example worked out Some terms are given here Exercise: what are these numbers equivalent to in decimal? The initial carry in is implicitly 0 1 1 1 0 (Carries) 1 0 1 1 (Augend) + 1 1 1 0
More informationA Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor
A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B. Javadi, H. Honda, K. Inoue, K. Murakami Faculty
More informationEE141. Administrative Stuff
-Spring 2004 Digital Integrated ircuits Lecture 15 Logical Effort Pass Transistor Logic 1 dministrative Stuff First (short) project to be launched next Th. Overall span: 1 week Hardware lab this week Hw
More informationEC 413 Computer Organization
EC 413 Computer Organization rithmetic Logic Unit (LU) and Register File Prof. Michel. Kinsy Computing: Computer Organization The DN of Modern Computing Computer CPU Memory System LU Register File Disks
More informationObjectives for Energy Reduction
genda Introduction, challenges and objectives Wireless (ody) Sensor Networks pplications, system requirements Generic architecture of a WSN node Processor and radio transceiver performance Examples of
More informationVector Lane Threading
Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program
More informationChapter # 3: Multi-Level Combinational Logic
hapter # 3: Multi-Level ombinational Logic ontemporary Logic esign Randy H. Katz University of alifornia, erkeley June 993 No. 3- hapter Overview Multi-Level Logic onversion to NN-NN and - Networks emorgan's
More informationEECS 151/251A Homework 5
EECS 151/251A Homework 5 Due Monday, March 5 th, 2018 Problem 1: Timing The data-path shown below is used in a simple processor. clk rd1 rd2 0 wr regfile 1 0 ALU REG 1 The elements used in the design have
More informationEnergy Delay Optimization
EE M216A.:. Fall 21 Lecture 8 Energy Delay Optimization Prof. Dejan Marković ee216a@gmail.com Some Common Questions Is sizing better than V DD for energy reduction? What are the optimal values of gate
More informationECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018
ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for
More informationDynamic Combinational Circuits. Dynamic Logic
Dynamic Combinational Circuits Dynamic circuits Charge sharing, charge redistribution Domino logic np-cmos (zipper CMOS) Krish Chakrabarty 1 Dynamic Logic Dynamic gates use a clocked pmos pullup Two modes:
More informationBeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power
BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing:
More informationEECS 141 F01 Lecture 17
EECS 4 F0 Lecture 7 With major inputs/improvements From Mary-Jane Irwin (Penn State) Dynamic CMOS In static circuits at every point in time (except when switching) the output is connected to either GND
More informationArithmetic Building Blocks
rithmetic uilding locks Datapath elements dder design Static adder Dynamic adder Multiplier design rray multipliers Shifters, Parity circuits ECE 261 Krish Chakrabarty 1 Generic Digital Processor Input-Output
More informationEE241 - Spring 2005 Advanced Digital Integrated Circuits. Admin. Lecture 10: Power Intro
EE241 - Spring 2005 Advanced Digital Integrated Circuits Lecture 10: Power Intro Admin Project Phase 2 due Monday March 14, 5pm (by e-mail to jan@eecs.berkeley.edu and huifangq@eecs.berkeley.edu) Should
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN
More information