Hardware Acceleration of DNNs

Size: px
Start display at page:

Download "Hardware Acceleration of DNNs"

Transcription

1 Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018

2 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning Accelerator MIT Eyeriss Slide credit: Xuan Yang Volta GPU with Tensor ores Stanford S348V, Winter 2018

3 And many more Image credit: Shan Tang Stanford S348V, Winter 2018

4 Modern NVIDIA GPU (Volta) Stanford S348V, Winter 2018

5 Recall properties of GPUs ompute rich : packed densely with processing elements - Good for compute-bound applications Good, because dense-matrix multiplication and DNN convolutional layers (when implemented properly) is compute bound ut also remember cost of instruction stream processing and control in a programmable processor: Note: these figures are estimates for a PU: [Figure credit Eric hung] Stanford S348V, Winter 2018

6 Volta GPU Single instruction to perform 2x4x4x4 + 4x4 ops Each SM core has: 64 fp32 ALUs (mul-add) 32 fp64 ALUs 8 tensor cores Execute 4x4 matrix mul-add instr A x + for 4x4 matrices A,, A, stored as fp16, accumulation with fp32 There are 80 SM cores in the GV100 GPU: 5,120 fp32 mul-add ALUs 640 tensor cores 6 M of L2 cache 1.5 GHz max clock = 15.7 TFLOPs fp32 = 125 TFLOPs (fp16/32 mixed) in tensor cores Stanford S348V, Winter 2018

7 Google TPU (version 1) Stanford S348V, Winter 2018

8 Discussion: workloads What did TPU paper state about characteristics of modern DNN workloads at Google? Stanford S348V, Winter 2018

9 Google s TPU Figure credit: Jouppi et al Stanford S348V, Winter 2018

10 TPU area proportionality ompute ~ 30% of chip Note low area footprint of control Key instructions: read host memory write host memory read weights matrix_multiply / convolve activate Figure credit: Jouppi et al Stanford S348V, Winter 2018

11 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO w00 w10 w20 w30 w01 w11 w21 w31 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

12 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO x0 w00 w10 w20 w30 w01 w11 w21 w31 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

13 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO x0 w00 w10 w20 w30 x0 * w00 x1 w01 w11 w21 w31 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

14 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO x0 w00 w10 w20 w30 x0 * w10 x1 x2 w01 w11 w21 w31 x0 * w00 + x1 * w01 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

15 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO w00 w10 w20 x0 x0 * w20 w30 x3 w01 w11 x0 * w10 + x1 * w11 w21 x2 x1 w31 w02 w12 w22 w32 x0 * w00 + x1 * w01 + x2 * w02 + w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018

16 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO w00 w10 w20 w30 x0 * w30 w01 w02 w12 w22 w32 x0 * w10 + x1 * w11 + x2 * w12 + w03 x3 x0 * w10 + x1 * w11 + x2 * w12 + w11 w13 x2 w21 w23 x1 x0 * w20 + x1 * w21 + x3 * w w31 w33 Accumulators (32-bit) Stanford S348V, Winter 2018

17 Systolic array (matrix matrix multiplication example: Y=WX) Weights FIFO x30 x20 x10 w00 w10 w20 w30 x30 * w00 x20 * w10 x10 * w20 x00 * w30 Notice: need multiple 4x32bit accumulators to hold output columns x31 x22 x13 w01 w03 x21 x20 * w20 + x21 * w21 x12 w11 w13 x02 w21 w23 + x03 * w Accumulators (32-bit) w31 w02 w12 w22 w32 x10 * w10 + x00 * w10 + x11 * w11 x01 * w11 + x12 * w12 + x02 * w12 + x03 x00 * w10 + x01 * w11 + x02 * w12 + x11 x10 * w20 + x11 * w21 x01 x00 * w20 + x01 * w21 w33 Stanford S348V, Winter 2018

18 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018

19 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018

20 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018

21 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018

22 TPU Perf/Watt GM = geometric mean over all apps WM = weighted mean over all apps total = cost of host machine + PU incremental = only cost of TPU Figure credit: Jouppi et al Stanford S348V, Winter 2018

23 Alternative scheduling strategies TPU was weight stationary (weights kept in register at PE) (a) Weight Stationary (b) Output Stationary (c) No Local Reuse Fig. 8. Dataflows for DNNs. Figure credit: Sze et al Stanford S348V, Winter 2018

24 EIE: targeting sparsified networks Stanford S348V, Winter 2018

25 (which is almost assured a c Sparse, weight-sharing fully-connected layer. Representation b i = ReLU 0 1 nx W ij a A j j=0 ession [23] describes a metho 0 1 To exploit the sparsity of a sparse weight matrix W in a Fully-connected layer: column (S) format [24]. Matrix-vector multiplication of activation For each column W j of vector a against that weight contains matrix Wthe non-zero length vector z that encode the corresponding entry in b i = X S[I ij ]a A j represented by a four-bit valu Sparse, weight-sharing representation: before a non-zero entry we j2x i \Y I ij = index for weight W example, ij we encode the foll S[] = table of shared weight values s the set of columns for which X i = list of non-zero indices in row i [0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0 Y = list of non-zero indices in a as v =[1, 2, 0, 3], z =[2, 0, Note: activations are sparse due to ReLU are stored in one large pair o pointing to the beginning of final entry in p points one be that the number Stanford ofs348v, non-zeros Winter 2018

26 Efficient inference engine (EIE) ASI ustom hardware for decode and evaluate sparse, compressed DNNs Hardware represents weight matrix in compressed sparse column (S) format to exploit sparsity in activations: for each nonzero a_j in a: for each nonzero M_ij in column M_j: b_i += M_ij * a_j More detailed version: int16* a_values; PTR* M_j_start; // column j int4* M_j_values; int4* M_j_indices; int16* lookup; // lookup table for // cluster values for j=0 to length(a): if (a[j] == 0) continue; // scan to nonzero col_values = M_j_values[M_j_start[j]]; col_indices = M_j_indices[M_j_start[j]]; col_nonzeros = M_j_start[j+1]-M_j_start[j]; for i=0, i_count=0 to col_nonzeros: i += col_indices[i_count] b[i] += lookup[m_j_values[i]] * a_values[j_count] * Keep in mind there s a unique lookup table for each chunk of matrix values Stanford S348V, Winter 2018

27 Stanford S348V, Winter 2018 Parallelization of sparse-matrix-vector product Stride rows of matrix across processing elements Output activations strided across processing elements ~a 0 0 a 2 0 a 4 a 5 0 a 7 ~ b PE0 PE1 PE2 PE3 w 0,0 0 w 0,2 0 w 0,4 w 0,5 w 0,6 0 0 w 1,1 0 w 1,3 0 0 w 1, w 2,2 0 w 2,4 0 0 w 2,7 0 w 3, w 0, w 4,1 0 0 w 4, w 5, w 5, w 6,4 0 w 6,6 0 w 7,0 0 0 w 7,4 0 0 w 7,7 0 w 8, w 8,7 w 9, w 9,6 w 9, w 10, w 11, w 11,7 w 12,0 0 w 12,2 0 0 w 12,5 0 w 12,7 w 13,0 w 13, w 13, w 14,2 w 14,3 w 14,4 w 14, w 15,2 w 15,3 0 w 15, A = b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 b 12 b 13 b 14 b 15 1 A ReLU ) b 0 b 1 0 b 3 0 b 5 b b b 13 b A Weights stored local to PEs. Must broadcast non-zero a_j s to all PEs Accumulation of each output b_i is local to PE

28 EIE unit for quantized sparse/matrix vector product Tuple representing non-zero activation (a j, j) arrives and is enqueued Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect Even Ptr SRAM ank Odd Ptr SRAM ank Pointer Read ol Start/ End Addr Sparse Matrix SRAM Regs Sparse Matrix Access Relative Index Weight Decoder Address Accum ypass Absolute Address Arithmetic Unit Dest Act Regs Src Act Regs Act R/W ReLU (b) Stanford S348V, Winter 2018

29 EIE Efficiency PU Dense (aseline) Speedup 1000x 5x 3x 2x 0.6x 24x 9x 2 14x 5x 0.5x x 92x 22x 10x 8x 1.0x mgpu Dense mgpu ompressed EIE 618x 210x 115x 25x 14x GPU ompressed 1018x 94x 56x 10x GPU Dense 507x 248x 100x PU ompressed 16x 10x 9x 1.0x 63x 34x 0.3x 60x 33x 15x 9x 0.5x 48x 25x 9x 3x 3x 2x 189x 98x 2x 0.5x 2x 15x 3x 0.5x 3x 0.6x 0. Alex-6 Figure 6. Alex-7 Alex-8 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with PU running uncompressed DNN model. There is no batching in all cases. PU Dense (aseline) Energy Efficiency VGG x PU ompressed GPU Dense 119,797x 61,533x 34,522x GPU ompressed mgpu Dense mgpu ompressed EIE 76,784x 14,826x 11,828x 10000x 10,904x 9,485x 24,207x 8,053x 1000x 100x 10x 37x 12x 9x 26x 37x 5x 7x 10x Alex-6 78x 10 59x 15x Alex-7 7x10x 3x 18x 7x Alex-8 102x 20x 10x 17x 10x 6 13x VGG-6 14x VGG-7 2x 5x 8x 25x 14x 5x VGG-8 39x 14x 6x 6x 6x 6x 8x NT-We 25x 7x NT-Wd 15x 20x 4x 5x 7x NT-LSTM 23x 6x 7x 36x 9x Geo Mean Energy efficiency of GPU, PU: ore i7 5930k (6 mobile cores)gpu and EIE compared with PU running uncompressed DNN model. There is no batching in all cases. Table III GPU: GTX Titan X energy numbers. We annotated the toggle rate from the RTLWarning:these arefrom notstate end-to-end: justmodels fully ENHMARK - OF - THE - ART DNN mgpu: K1 netlist, which was dumped to simulation to thetegra gate-level Figure 7. Layer Weight% layers! Act% FLOP% connected Size 9216, , , , , 4096 Description switching activity interchange format (SAIF), and estimated Alex-6 9% 35.1% 3% ompressed the powersources using Prime-Time PX. of energy savings: AlexNet [1] for Alex-7 9% 35.3% 3% omparison aseline. We compare EIE with three diflarge scale image ompression allows all weights to be stored in SRAM (few DRAM loads) ferent off-the-shelf computing units: PU, GPU and mobile classification Alex-8 25% 37.5% 10% GPU. - Low-precision 16-bit fixed-point math (5x more efficient than 32-bit fixed math) VGG-6 4% 18.3% 1% ompressed 1) PU. We use Intel ore i k PU, a Haswell-E Skip math on inputs activations that are zero (65% less math) VGG-16 [3] for class processor, that has been used in NVIDIA Digits Deep VGG-7 4% 37.5% 2% large scale image Stanford S348V, Winter 2018 Learning Dev ox as a PU baseline. To run the benchmark classification and

30 Thoughts EIE paper highlights performance on fully connected layers (see graph above) - Final layers of networks like AlexNet, VGG - ommon in recurrent network topologies like LSTMs ut many state-of-the-art image processing networks have moved to fully convolutional solutions - Recall Inception, SqueezeNet, etc.. Stanford S348V, Winter 2018

31 Summary of hardware techniques Specialized datapaths for dense linear algebra computations - Reduce overhead of control (compared to PUs/GPUs) Reduced precision (computation and storage) Exploit sparsity Accelerate decompression Stanford S348V, Winter 2018

Optimizing Inference via Approximation

Optimizing Inference via Approximation Lecture 11: Optimizing Inference via Approximation Visual omputing Systems Take-home midterm To be released Sunday morning 10/23. Hand-in noon Tuesday 10/25. Example forms of questions: - Short answer

More information

Efficient Methods and Hardware for Deep Learning

Efficient Methods and Hardware for Deep Learning 1 Efficient Methods and Hardware for Deep Learning Song Han Stanford University 2 Deep Learning is Changing Our Lives Self-Driving Machine Translation 2 AlphaGo Smart Robots 3 Models are Getting Larger

More information

DNPU: An Energy-Efficient Deep Neural Network Processor with On-Chip Stereo Matching

DNPU: An Energy-Efficient Deep Neural Network Processor with On-Chip Stereo Matching DNPU: An Energy-Efficient Deep Neural Network Processor with On-hip Stereo atching Dongjoo Shin, Jinmook Lee, Jinsu Lee, Juhyoung Lee, and Hoi-Jun Yoo Semiconductor System Laboratory School of EE, KAIST

More information

arxiv: v1 [cs.ar] 11 Dec 2017

arxiv: v1 [cs.ar] 11 Dec 2017 Multi-Mode Inference Engine for Convolutional Neural Networks Arash Ardakani, Carlo Condo and Warren J. Gross Electrical and Computer Engineering Department, McGill University, Montreal, Quebec, Canada

More information

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

Performance, Power & Energy

Performance, Power & Energy Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

CprE 281: Digital Logic

CprE 281: Digital Logic CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital

More information

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017 Neural Network Approximation Low rank, Sparsity, and Quantization zsc@megvii.com Oct. 2017 Motivation Faster Inference Faster Training Latency critical scenarios VR/AR, UGV/UAV Saves time and energy Higher

More information

Exploring the Granularity of Sparsity in Convolutional Neural Networks

Exploring the Granularity of Sparsity in Convolutional Neural Networks Exploring the Granularity of Sparsity in Convolutional Neural Networks Anonymous TMCV submission Abstract Sparsity helps reducing the computation complexity of DNNs by skipping the multiplication with

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Lecture: Pipelining Basics

Lecture: Pipelining Basics Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:

More information

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip

Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip 1 Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip Dave Fick, CTO/Founder Mike Henry, CEO/Founder About Mythic 2 Focused on high-performance Edge AI Full stack co-design:

More information

Compact and Computationally Efficient Representation of Deep Neural Networks

Compact and Computationally Efficient Representation of Deep Neural Networks WIEDEMANN ET AL. - EFFICIENT REPRESENTATIONS OF DEEP NEURAL NETWORKS 1 Compact and Computationally Efficient Representation of Deep Neural Networks Simon Wiedemann, Klaus-Robert Müller, Member, IEEE, and

More information

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control Logic and Computer Design Fundamentals Chapter 8 Sequencing and Control Datapath and Control Datapath - performs data transfer and processing operations Control Unit - Determines enabling and sequencing

More information

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

arxiv: v1 [cs.ne] 10 May 2018

arxiv: v1 [cs.ne] 10 May 2018 Laconic Deep Learning Computing arxiv:85.53v [cs.ne] May 28 Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos Electrical and Computer Engineering, University of Toronto

More information

Design at the Register Transfer Level

Design at the Register Transfer Level Week-7 Design at the Register Transfer Level Algorithmic State Machines Algorithmic State Machine (ASM) q Our design methodologies do not scale well to real-world problems. q 232 - Logic Design / Algorithmic

More information

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern

More information

AMLD Deep Learning in PyTorch. 2. PyTorch tensors

AMLD Deep Learning in PyTorch. 2. PyTorch tensors AMLD Deep Learning in PyTorch 2. PyTorch tensors François Fleuret http://fleuret.org/amld/ February 10, 2018 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE PyTorch s tensors François Fleuret AMLD Deep Learning

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then

More information

Tunable Floating-Point for Energy Efficient Accelerators

Tunable Floating-Point for Energy Efficient Accelerators Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH

More information

Spiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller

Spiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller 2-. piral 2- Datapath Components: Counters s Design Example: Crosswalk Controller 2-.2 piral Content Mapping piral Theory Combinational Design equential Design ystem Level Design Implementation and Tools

More information

Introduction to Computer Engineering. CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison

Introduction to Computer Engineering. CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison Introduction to Computer Engineering CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison Chapter 3 Digital Logic Structures Slides based on set prepared by

More information

Project Two RISC Processor Implementation ECE 485

Project Two RISC Processor Implementation ECE 485 Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor

More information

Exploring the Granularity of Sparsity in Convolutional Neural Networks

Exploring the Granularity of Sparsity in Convolutional Neural Networks Exploring the Granularity of Sparsity in Convolutional Neural Networks Huizi Mao 1, Song Han 1, Jeff Pool 2, Wenshuo Li 3, Xingyu Liu 1, Yu Wang 3, William J. Dally 1,2 1 Stanford University 2 NVIDIA 3

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes

More information

Chapter 7. Sequential Circuits Registers, Counters, RAM

Chapter 7. Sequential Circuits Registers, Counters, RAM Chapter 7. Sequential Circuits Registers, Counters, RAM Register - a group of binary storage elements suitable for holding binary info A group of FFs constitutes a register Commonly used as temporary storage

More information

Value-aware Quantization for Training and Inference of Neural Networks

Value-aware Quantization for Training and Inference of Neural Networks Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, and Peter Vajda 2 1 Department of Computer Science and Engineering Seoul National University {eunhyeok.park,sungjoo.yoo}@gmail.com

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN

More information

Chapter 6. BCH Codes

Chapter 6. BCH Codes Chapter 6 BCH Codes Description of the Codes Decoding of the BCH Codes Outline Implementation of Galois Field Arithmetic Implementation of Error Correction Nonbinary BCH Codes and Reed-Solomon Codes Weight

More information

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review

More information

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher

More information

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

arxiv: v1 [cs.ne] 20 Apr 2018

arxiv: v1 [cs.ne] 20 Apr 2018 arxiv:1804.07802v1 [cs.ne] 20 Apr 2018 Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, Peter Vajda 2 1 Department of Computer Science and Engineering

More information

Combinational Logic Design Combinational Functions and Circuits

Combinational Logic Design Combinational Functions and Circuits Combinational Logic Design Combinational Functions and Circuits Overview Combinational Circuits Design Procedure Generic Example Example with don t cares: BCD-to-SevenSegment converter Binary Decoders

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines 1 Clocks A microprocessor is composed of many different circuits that are operating simultaneously if each

More information

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M 8/8/26 S528 arallel Architecture lassification & Single ore Architecture arallel Architecture lassification A Sahu Dept of SE, IIT Guwahati A Sahu Flynn s lassification SISD Architecture ategories M SISD

More information

Deep Learning with Coherent Nanophotonic Circuits

Deep Learning with Coherent Nanophotonic Circuits Yichen Shen, Nicholas Harris, Dirk Englund, Marin Soljacic Massachusetts Institute of Technology @ Berkeley, Oct. 2017 1 Neuromorphic Computing Biological Neural Networks Artificial Neural Networks 2 Artificial

More information

Binary addition example worked out

Binary addition example worked out Binary addition example worked out Some terms are given here Exercise: what are these numbers equivalent to in decimal? The initial carry in is implicitly 0 1 1 1 0 (Carries) 1 0 1 1 (Augend) + 1 1 1 0

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Origami: Folding Warps for Energy Efficient GPUs

Origami: Folding Warps for Energy Efficient GPUs Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University

More information

arxiv: v3 [cs.cv] 20 Nov 2015

arxiv: v3 [cs.cv] 20 Nov 2015 DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han Stanford University, Stanford, CA 94305, USA songhan@stanford.edu arxiv:1510.00149v3 [cs.cv]

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

GMU, ECE 680 Physical VLSI Design 1

GMU, ECE 680 Physical VLSI Design 1 ECE680: Physical VLSI Design Chapter VIII Semiconductor Memory (chapter 12 in textbook) 1 Chapter Overview Memory Classification Memory Architectures The Memory Core Periphery Reliability Case Studies

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions Lecture Slides Intro Number Systems Logic Functions EE 0 in Context EE 0 EE 20L Logic Design Fundamentals Logic Design, CAD Tools, Lab tools, Project EE 357 EE 457 Computer Architecture Using the logic

More information

Chapter 3 Digital Logic Structures

Chapter 3 Digital Logic Structures Chapter 3 Digital Logic Structures Original slides from Gregory Byrd, North Carolina State University Modified by C. Wilcox, M. Strout, Y. Malaiya Colorado State University Computing Layers Problems Algorithms

More information

Computer Architecture

Computer Architecture Lecture 2: Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture CPU Evolution What is? 2 Outline Measurements and metrics : Performance, Cost, Dependability, Power Guidelines

More information

YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration

YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration 1 YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration Renzo Andri, Lukas Cavigelli, avide Rossi and Luca Benini Integrated Systems Laboratory, ETH Zürich, Zurich, Switzerland

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5. CHPTER 9 2008 Pearson Education, Inc. 9-. log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C 7 Z = F 7 + F 6 + F 5 + F 4 + F 3 + F 2 + F + F 0 N = F 7 9-3.* = S + S = S + S S S S0 C in C 0 dder

More information

Lecture 3, Performance

Lecture 3, Performance Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations

More information

Lecture 25. Semiconductor Memories. Issues in Memory

Lecture 25. Semiconductor Memories. Issues in Memory Lecture 25 Semiconductor Memories Issues in Memory Memory Classification Memory Architectures TheMemoryCore Periphery 1 Semiconductor Memory Classification RWM NVRWM ROM Random Access Non-Random Access

More information

Goals for Performance Lecture

Goals for Performance Lecture Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s

More information

Lecture 11. Advanced Dividers

Lecture 11. Advanced Dividers Lecture 11 Advanced Dividers Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 15 Variation in Dividers 15.3, Combinational and Array Dividers Chapter 16, Division

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Consider the following example of a linear system:

Consider the following example of a linear system: LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x + 2x 2 + 3x 3 = 5 x + x 3 = 3 3x + x 2 + 3x 3 = 3 x =, x 2 = 0, x 3 = 2 In general we want to solve n equations

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines Reminder: midterm on Tue 2/28 will cover Chapters 1-3, App A, B if you understand all slides, assignments,

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042 Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SR and Freescale under SR task number 2042 Lukasz G. Szafaryn, Kevin Skadron Department of omputer Science

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

GF(2 m ) arithmetic: summary

GF(2 m ) arithmetic: summary GF(2 m ) arithmetic: summary EE 387, Notes 18, Handout #32 Addition/subtraction: bitwise XOR (m gates/ops) Multiplication: bit serial (shift and add) bit parallel (combinational) subfield representation

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEM ORY INPUT-OUTPUT CONTROL DATAPATH

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Fast Matrix Multiplication Over GF3

Fast Matrix Multiplication Over GF3 Fast Matrix Multiplication Over GF3 Matthew Lambert Advised by Dr. B. David Saunders & Dr. Stephen Siegel February 13, 2015 Matrix Multiplication Outline Fast Matrix Multiplication Outline Fast Matrix

More information

Parallel Sparse Tensor Decompositions using HiCOO Format

Parallel Sparse Tensor Decompositions using HiCOO Format Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

MIXED PRECISION TRAINING ON VOLTA GPUS

MIXED PRECISION TRAINING ON VOLTA GPUS MIXED PRECISION TRAINING ON VOLTA GPUS Boris Ginsburg, Paulius Micikevicius, Oleksii Kuchaiev, Sergei Nikolaev bginsburg, pauliusm, okuchaiev, snikolaev@nvidia.com 10/17/2017 ACKNOWLEDGMENTS Michael Houston,

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 23: Illusiveness of Parallel Performance James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L23 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping peel

More information

Lecture 2: Metrics to Evaluate Systems

Lecture 2: Metrics to Evaluate Systems Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video

More information

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad 1. Pruning 2. Quantization 3. Architectural Modifications PRUNING WHY PRUNING? Deep Neural Networks have redundant parameters.

More information

Logic. Basic Logic Functions. Switches in series (AND) Truth Tables. Switches in Parallel (OR) Alternative view for OR

Logic. Basic Logic Functions. Switches in series (AND) Truth Tables. Switches in Parallel (OR) Alternative view for OR TOPIS: Logic Logic Expressions Logic Gates Simplifying Logic Expressions Sequential Logic (Logic with a Memory) George oole (85-864), English mathematician, oolean logic used in digital computers since

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information

Digital Integrated Circuits A Design Perspective

Digital Integrated Circuits A Design Perspective Semiconductor Memories Adapted from Chapter 12 of Digital Integrated Circuits A Design Perspective Jan M. Rabaey et al. Copyright 2003 Prentice Hall/Pearson Outline Memory Classification Memory Architectures

More information

Lecture 4: Linear Algebra 1

Lecture 4: Linear Algebra 1 Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech

More information

EE141-Fall 2011 Digital Integrated Circuits

EE141-Fall 2011 Digital Integrated Circuits EE4-Fall 20 Digital Integrated Circuits Lecture 5 Memory decoders Administrative Stuff Homework #6 due today Project posted Phase due next Friday Project done in pairs 2 Last Lecture Last lecture Logical

More information

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of ESE532: System-on-a-Chip Architecture Day 20: November 8, 2017 Energy Today Energy Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators How does parallelism impact energy? 1 2 Message

More information

System Data Bus (8-bit) Data Buffer. Internal Data Bus (8-bit) 8-bit register (R) 3-bit address 16-bit register pair (P) 2-bit address

System Data Bus (8-bit) Data Buffer. Internal Data Bus (8-bit) 8-bit register (R) 3-bit address 16-bit register pair (P) 2-bit address Intel 8080 CPU block diagram 8 System Data Bus (8-bit) Data Buffer Registry Array B 8 C Internal Data Bus (8-bit) F D E H L ALU SP A PC Address Buffer 16 System Address Bus (16-bit) Internal register addressing:

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Logic BIST. Sungho Kang Yonsei University

Logic BIST. Sungho Kang Yonsei University Logic BIST Sungho Kang Yonsei University Outline Introduction Basics Issues Weighted Random Pattern Generation BIST Architectures Deterministic BIST Conclusion 2 Built In Self Test Test/ Normal Input Pattern

More information

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep

More information

EECS 579: Logic and Fault Simulation. Simulation

EECS 579: Logic and Fault Simulation. Simulation EECS 579: Logic and Fault Simulation Simulation: Use of computer software models to verify correctness Fault Simulation: Use of simulation for fault analysis and ATPG Circuit description Input data for

More information