Hardware Acceleration of DNNs
|
|
- Myles Pierce
- 6 years ago
- Views:
Transcription
1 Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018
2 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning Accelerator MIT Eyeriss Slide credit: Xuan Yang Volta GPU with Tensor ores Stanford S348V, Winter 2018
3 And many more Image credit: Shan Tang Stanford S348V, Winter 2018
4 Modern NVIDIA GPU (Volta) Stanford S348V, Winter 2018
5 Recall properties of GPUs ompute rich : packed densely with processing elements - Good for compute-bound applications Good, because dense-matrix multiplication and DNN convolutional layers (when implemented properly) is compute bound ut also remember cost of instruction stream processing and control in a programmable processor: Note: these figures are estimates for a PU: [Figure credit Eric hung] Stanford S348V, Winter 2018
6 Volta GPU Single instruction to perform 2x4x4x4 + 4x4 ops Each SM core has: 64 fp32 ALUs (mul-add) 32 fp64 ALUs 8 tensor cores Execute 4x4 matrix mul-add instr A x + for 4x4 matrices A,, A, stored as fp16, accumulation with fp32 There are 80 SM cores in the GV100 GPU: 5,120 fp32 mul-add ALUs 640 tensor cores 6 M of L2 cache 1.5 GHz max clock = 15.7 TFLOPs fp32 = 125 TFLOPs (fp16/32 mixed) in tensor cores Stanford S348V, Winter 2018
7 Google TPU (version 1) Stanford S348V, Winter 2018
8 Discussion: workloads What did TPU paper state about characteristics of modern DNN workloads at Google? Stanford S348V, Winter 2018
9 Google s TPU Figure credit: Jouppi et al Stanford S348V, Winter 2018
10 TPU area proportionality ompute ~ 30% of chip Note low area footprint of control Key instructions: read host memory write host memory read weights matrix_multiply / convolve activate Figure credit: Jouppi et al Stanford S348V, Winter 2018
11 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO w00 w10 w20 w30 w01 w11 w21 w31 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018
12 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO x0 w00 w10 w20 w30 w01 w11 w21 w31 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018
13 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO x0 w00 w10 w20 w30 x0 * w00 x1 w01 w11 w21 w31 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018
14 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO x0 w00 w10 w20 w30 x0 * w10 x1 x2 w01 w11 w21 w31 x0 * w00 + x1 * w01 w02 w12 w22 w32 w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018
15 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO w00 w10 w20 x0 x0 * w20 w30 x3 w01 w11 x0 * w10 + x1 * w11 w21 x2 x1 w31 w02 w12 w22 w32 x0 * w00 + x1 * w01 + x2 * w02 + w03 w13 w23 w Accumulators (32-bit) Stanford S348V, Winter 2018
16 Systolic array (matrix vector multiplication example: y=wx) Weights FIFO w00 w10 w20 w30 x0 * w30 w01 w02 w12 w22 w32 x0 * w10 + x1 * w11 + x2 * w12 + w03 x3 x0 * w10 + x1 * w11 + x2 * w12 + w11 w13 x2 w21 w23 x1 x0 * w20 + x1 * w21 + x3 * w w31 w33 Accumulators (32-bit) Stanford S348V, Winter 2018
17 Systolic array (matrix matrix multiplication example: Y=WX) Weights FIFO x30 x20 x10 w00 w10 w20 w30 x30 * w00 x20 * w10 x10 * w20 x00 * w30 Notice: need multiple 4x32bit accumulators to hold output columns x31 x22 x13 w01 w03 x21 x20 * w20 + x21 * w21 x12 w11 w13 x02 w21 w23 + x03 * w Accumulators (32-bit) w31 w02 w12 w22 w32 x10 * w10 + x00 * w10 + x11 * w11 x01 * w11 + x12 * w12 + x02 * w12 + x03 x00 * w10 + x01 * w11 + x02 * w12 + x11 x10 * w20 + x11 * w21 x01 x00 * w20 + x01 * w21 w33 Stanford S348V, Winter 2018
18 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018
19 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018
20 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018
21 uilding larger matrix-matrix multiplies Example: A = 8x8, = 8x4096, =8x = Assume 4096 accumulators A Stanford S348V, Winter 2018
22 TPU Perf/Watt GM = geometric mean over all apps WM = weighted mean over all apps total = cost of host machine + PU incremental = only cost of TPU Figure credit: Jouppi et al Stanford S348V, Winter 2018
23 Alternative scheduling strategies TPU was weight stationary (weights kept in register at PE) (a) Weight Stationary (b) Output Stationary (c) No Local Reuse Fig. 8. Dataflows for DNNs. Figure credit: Sze et al Stanford S348V, Winter 2018
24 EIE: targeting sparsified networks Stanford S348V, Winter 2018
25 (which is almost assured a c Sparse, weight-sharing fully-connected layer. Representation b i = ReLU 0 1 nx W ij a A j j=0 ession [23] describes a metho 0 1 To exploit the sparsity of a sparse weight matrix W in a Fully-connected layer: column (S) format [24]. Matrix-vector multiplication of activation For each column W j of vector a against that weight contains matrix Wthe non-zero length vector z that encode the corresponding entry in b i = X S[I ij ]a A j represented by a four-bit valu Sparse, weight-sharing representation: before a non-zero entry we j2x i \Y I ij = index for weight W example, ij we encode the foll S[] = table of shared weight values s the set of columns for which X i = list of non-zero indices in row i [0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0 Y = list of non-zero indices in a as v =[1, 2, 0, 3], z =[2, 0, Note: activations are sparse due to ReLU are stored in one large pair o pointing to the beginning of final entry in p points one be that the number Stanford ofs348v, non-zeros Winter 2018
26 Efficient inference engine (EIE) ASI ustom hardware for decode and evaluate sparse, compressed DNNs Hardware represents weight matrix in compressed sparse column (S) format to exploit sparsity in activations: for each nonzero a_j in a: for each nonzero M_ij in column M_j: b_i += M_ij * a_j More detailed version: int16* a_values; PTR* M_j_start; // column j int4* M_j_values; int4* M_j_indices; int16* lookup; // lookup table for // cluster values for j=0 to length(a): if (a[j] == 0) continue; // scan to nonzero col_values = M_j_values[M_j_start[j]]; col_indices = M_j_indices[M_j_start[j]]; col_nonzeros = M_j_start[j+1]-M_j_start[j]; for i=0, i_count=0 to col_nonzeros: i += col_indices[i_count] b[i] += lookup[m_j_values[i]] * a_values[j_count] * Keep in mind there s a unique lookup table for each chunk of matrix values Stanford S348V, Winter 2018
27 Stanford S348V, Winter 2018 Parallelization of sparse-matrix-vector product Stride rows of matrix across processing elements Output activations strided across processing elements ~a 0 0 a 2 0 a 4 a 5 0 a 7 ~ b PE0 PE1 PE2 PE3 w 0,0 0 w 0,2 0 w 0,4 w 0,5 w 0,6 0 0 w 1,1 0 w 1,3 0 0 w 1, w 2,2 0 w 2,4 0 0 w 2,7 0 w 3, w 0, w 4,1 0 0 w 4, w 5, w 5, w 6,4 0 w 6,6 0 w 7,0 0 0 w 7,4 0 0 w 7,7 0 w 8, w 8,7 w 9, w 9,6 w 9, w 10, w 11, w 11,7 w 12,0 0 w 12,2 0 0 w 12,5 0 w 12,7 w 13,0 w 13, w 13, w 14,2 w 14,3 w 14,4 w 14, w 15,2 w 15,3 0 w 15, A = b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 b 12 b 13 b 14 b 15 1 A ReLU ) b 0 b 1 0 b 3 0 b 5 b b b 13 b A Weights stored local to PEs. Must broadcast non-zero a_j s to all PEs Accumulation of each output b_i is local to PE
28 EIE unit for quantized sparse/matrix vector product Tuple representing non-zero activation (a j, j) arrives and is enqueued Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect Even Ptr SRAM ank Odd Ptr SRAM ank Pointer Read ol Start/ End Addr Sparse Matrix SRAM Regs Sparse Matrix Access Relative Index Weight Decoder Address Accum ypass Absolute Address Arithmetic Unit Dest Act Regs Src Act Regs Act R/W ReLU (b) Stanford S348V, Winter 2018
29 EIE Efficiency PU Dense (aseline) Speedup 1000x 5x 3x 2x 0.6x 24x 9x 2 14x 5x 0.5x x 92x 22x 10x 8x 1.0x mgpu Dense mgpu ompressed EIE 618x 210x 115x 25x 14x GPU ompressed 1018x 94x 56x 10x GPU Dense 507x 248x 100x PU ompressed 16x 10x 9x 1.0x 63x 34x 0.3x 60x 33x 15x 9x 0.5x 48x 25x 9x 3x 3x 2x 189x 98x 2x 0.5x 2x 15x 3x 0.5x 3x 0.6x 0. Alex-6 Figure 6. Alex-7 Alex-8 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with PU running uncompressed DNN model. There is no batching in all cases. PU Dense (aseline) Energy Efficiency VGG x PU ompressed GPU Dense 119,797x 61,533x 34,522x GPU ompressed mgpu Dense mgpu ompressed EIE 76,784x 14,826x 11,828x 10000x 10,904x 9,485x 24,207x 8,053x 1000x 100x 10x 37x 12x 9x 26x 37x 5x 7x 10x Alex-6 78x 10 59x 15x Alex-7 7x10x 3x 18x 7x Alex-8 102x 20x 10x 17x 10x 6 13x VGG-6 14x VGG-7 2x 5x 8x 25x 14x 5x VGG-8 39x 14x 6x 6x 6x 6x 8x NT-We 25x 7x NT-Wd 15x 20x 4x 5x 7x NT-LSTM 23x 6x 7x 36x 9x Geo Mean Energy efficiency of GPU, PU: ore i7 5930k (6 mobile cores)gpu and EIE compared with PU running uncompressed DNN model. There is no batching in all cases. Table III GPU: GTX Titan X energy numbers. We annotated the toggle rate from the RTLWarning:these arefrom notstate end-to-end: justmodels fully ENHMARK - OF - THE - ART DNN mgpu: K1 netlist, which was dumped to simulation to thetegra gate-level Figure 7. Layer Weight% layers! Act% FLOP% connected Size 9216, , , , , 4096 Description switching activity interchange format (SAIF), and estimated Alex-6 9% 35.1% 3% ompressed the powersources using Prime-Time PX. of energy savings: AlexNet [1] for Alex-7 9% 35.3% 3% omparison aseline. We compare EIE with three diflarge scale image ompression allows all weights to be stored in SRAM (few DRAM loads) ferent off-the-shelf computing units: PU, GPU and mobile classification Alex-8 25% 37.5% 10% GPU. - Low-precision 16-bit fixed-point math (5x more efficient than 32-bit fixed math) VGG-6 4% 18.3% 1% ompressed 1) PU. We use Intel ore i k PU, a Haswell-E Skip math on inputs activations that are zero (65% less math) VGG-16 [3] for class processor, that has been used in NVIDIA Digits Deep VGG-7 4% 37.5% 2% large scale image Stanford S348V, Winter 2018 Learning Dev ox as a PU baseline. To run the benchmark classification and
30 Thoughts EIE paper highlights performance on fully connected layers (see graph above) - Final layers of networks like AlexNet, VGG - ommon in recurrent network topologies like LSTMs ut many state-of-the-art image processing networks have moved to fully convolutional solutions - Recall Inception, SqueezeNet, etc.. Stanford S348V, Winter 2018
31 Summary of hardware techniques Specialized datapaths for dense linear algebra computations - Reduce overhead of control (compared to PUs/GPUs) Reduced precision (computation and storage) Exploit sparsity Accelerate decompression Stanford S348V, Winter 2018
Optimizing Inference via Approximation
Lecture 11: Optimizing Inference via Approximation Visual omputing Systems Take-home midterm To be released Sunday morning 10/23. Hand-in noon Tuesday 10/25. Example forms of questions: - Short answer
More informationEfficient Methods and Hardware for Deep Learning
1 Efficient Methods and Hardware for Deep Learning Song Han Stanford University 2 Deep Learning is Changing Our Lives Self-Driving Machine Translation 2 AlphaGo Smart Robots 3 Models are Getting Larger
More informationDNPU: An Energy-Efficient Deep Neural Network Processor with On-Chip Stereo Matching
DNPU: An Energy-Efficient Deep Neural Network Processor with On-hip Stereo atching Dongjoo Shin, Jinmook Lee, Jinsu Lee, Juhyoung Lee, and Hoi-Jun Yoo Semiconductor System Laboratory School of EE, KAIST
More informationarxiv: v1 [cs.ar] 11 Dec 2017
Multi-Mode Inference Engine for Convolutional Neural Networks Arash Ardakani, Carlo Condo and Warren J. Gross Electrical and Computer Engineering Department, McGill University, Montreal, Quebec, Canada
More informationLRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation
LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationPerformance, Power & Energy
Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationCprE 281: Digital Logic
CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital
More informationNeural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017
Neural Network Approximation Low rank, Sparsity, and Quantization zsc@megvii.com Oct. 2017 Motivation Faster Inference Faster Training Latency critical scenarios VR/AR, UGV/UAV Saves time and energy Higher
More informationExploring the Granularity of Sparsity in Convolutional Neural Networks
Exploring the Granularity of Sparsity in Convolutional Neural Networks Anonymous TMCV submission Abstract Sparsity helps reducing the computation complexity of DNNs by skipping the multiplication with
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationLecture: Pipelining Basics
Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:
More informationAnalog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip
1 Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip Dave Fick, CTO/Founder Mike Henry, CEO/Founder About Mythic 2 Focused on high-performance Edge AI Full stack co-design:
More informationCompact and Computationally Efficient Representation of Deep Neural Networks
WIEDEMANN ET AL. - EFFICIENT REPRESENTATIONS OF DEEP NEURAL NETWORKS 1 Compact and Computationally Efficient Representation of Deep Neural Networks Simon Wiedemann, Klaus-Robert Müller, Member, IEEE, and
More informationLogic and Computer Design Fundamentals. Chapter 8 Sequencing and Control
Logic and Computer Design Fundamentals Chapter 8 Sequencing and Control Datapath and Control Datapath - performs data transfer and processing operations Control Unit - Determines enabling and sequencing
More informationOptimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks Yufei Ma, Yu Cao, Sarma Vrudhula,
More informationPerformance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationarxiv: v1 [cs.ne] 10 May 2018
Laconic Deep Learning Computing arxiv:85.53v [cs.ne] May 28 Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos Electrical and Computer Engineering, University of Toronto
More informationDesign at the Register Transfer Level
Week-7 Design at the Register Transfer Level Algorithmic State Machines Algorithmic State Machine (ASM) q Our design methodologies do not scale well to real-world problems. q 232 - Logic Design / Algorithmic
More informationExploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units
Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern
More informationAMLD Deep Learning in PyTorch. 2. PyTorch tensors
AMLD Deep Learning in PyTorch 2. PyTorch tensors François Fleuret http://fleuret.org/amld/ February 10, 2018 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE PyTorch s tensors François Fleuret AMLD Deep Learning
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationALU A functional unit
ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1
More informationEECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates
EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then
More informationTunable Floating-Point for Energy Efficient Accelerators
Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable
More informationDigital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH
More informationSpiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller
2-. piral 2- Datapath Components: Counters s Design Example: Crosswalk Controller 2-.2 piral Content Mapping piral Theory Combinational Design equential Design ystem Level Design Implementation and Tools
More informationIntroduction to Computer Engineering. CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison
Introduction to Computer Engineering CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison Chapter 3 Digital Logic Structures Slides based on set prepared by
More informationProject Two RISC Processor Implementation ECE 485
Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor
More informationExploring the Granularity of Sparsity in Convolutional Neural Networks
Exploring the Granularity of Sparsity in Convolutional Neural Networks Huizi Mao 1, Song Han 1, Jeff Pool 2, Wenshuo Li 3, Xingyu Liu 1, Yu Wang 3, William J. Dally 1,2 1 Stanford University 2 NVIDIA 3
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering
ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes
More informationChapter 7. Sequential Circuits Registers, Counters, RAM
Chapter 7. Sequential Circuits Registers, Counters, RAM Register - a group of binary storage elements suitable for holding binary info A group of FFs constitutes a register Commonly used as temporary storage
More informationValue-aware Quantization for Training and Inference of Neural Networks
Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, and Peter Vajda 2 1 Department of Computer Science and Engineering Seoul National University {eunhyeok.park,sungjoo.yoo}@gmail.com
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN
More informationChapter 6. BCH Codes
Chapter 6 BCH Codes Description of the Codes Decoding of the BCH Codes Outline Implementation of Galois Field Arithmetic Implementation of Error Correction Nonbinary BCH Codes and Reed-Solomon Codes Weight
More informationECE521 W17 Tutorial 1. Renjie Liao & Min Bai
ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review
More informationAn Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))
An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher
More informationS XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library
S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationarxiv: v1 [cs.ne] 20 Apr 2018
arxiv:1804.07802v1 [cs.ne] 20 Apr 2018 Value-aware Quantization for Training and Inference of Neural Networks Eunhyeok Park 1, Sungjoo Yoo 1, Peter Vajda 2 1 Department of Computer Science and Engineering
More informationCombinational Logic Design Combinational Functions and Circuits
Combinational Logic Design Combinational Functions and Circuits Overview Combinational Circuits Design Procedure Generic Example Example with don t cares: BCD-to-SevenSegment converter Binary Decoders
More informationLecture 13: Sequential Circuits, FSM
Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines 1 Clocks A microprocessor is composed of many different circuits that are operating simultaneously if each
More informationSISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M
8/8/26 S528 arallel Architecture lassification & Single ore Architecture arallel Architecture lassification A Sahu Dept of SE, IIT Guwahati A Sahu Flynn s lassification SISD Architecture ategories M SISD
More informationDeep Learning with Coherent Nanophotonic Circuits
Yichen Shen, Nicholas Harris, Dirk Englund, Marin Soljacic Massachusetts Institute of Technology @ Berkeley, Oct. 2017 1 Neuromorphic Computing Biological Neural Networks Artificial Neural Networks 2 Artificial
More informationBinary addition example worked out
Binary addition example worked out Some terms are given here Exercise: what are these numbers equivalent to in decimal? The initial carry in is implicitly 0 1 1 1 0 (Carries) 1 0 1 1 (Augend) + 1 1 1 0
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationOrigami: Folding Warps for Energy Efficient GPUs
Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University
More informationarxiv: v3 [cs.cv] 20 Nov 2015
DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han Stanford University, Stanford, CA 94305, USA songhan@stanford.edu arxiv:1510.00149v3 [cs.cv]
More informationCSC321 Lecture 16: ResNets and Attention
CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the
More informationGMU, ECE 680 Physical VLSI Design 1
ECE680: Physical VLSI Design Chapter VIII Semiconductor Memory (chapter 12 in textbook) 1 Chapter Overview Memory Classification Memory Architectures The Memory Core Periphery Reliability Case Studies
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationMark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions
Lecture Slides Intro Number Systems Logic Functions EE 0 in Context EE 0 EE 20L Logic Design Fundamentals Logic Design, CAD Tools, Lab tools, Project EE 357 EE 457 Computer Architecture Using the logic
More informationChapter 3 Digital Logic Structures
Chapter 3 Digital Logic Structures Original slides from Gregory Byrd, North Carolina State University Modified by C. Wilcox, M. Strout, Y. Malaiya Colorado State University Computing Layers Problems Algorithms
More informationComputer Architecture
Lecture 2: Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture CPU Evolution What is? 2 Outline Measurements and metrics : Performance, Cost, Dependability, Power Guidelines
More informationYodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration
1 YodaNN 1 : An Architecture for Ultra-Low Power Binary-Weight CNN Acceleration Renzo Andri, Lukas Cavigelli, avide Rossi and Luca Benini Integrated Systems Laboratory, ETH Zürich, Zurich, Switzerland
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationCHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.
CHPTER 9 2008 Pearson Education, Inc. 9-. log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C 7 Z = F 7 + F 6 + F 5 + F 4 + F 3 + F 2 + F + F 0 N = F 7 9-3.* = S + S = S + S S S S0 C in C 0 dder
More informationLecture 3, Performance
Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations
More informationLecture 25. Semiconductor Memories. Issues in Memory
Lecture 25 Semiconductor Memories Issues in Memory Memory Classification Memory Architectures TheMemoryCore Periphery 1 Semiconductor Memory Classification RWM NVRWM ROM Random Access Non-Random Access
More informationGoals for Performance Lecture
Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s
More informationLecture 11. Advanced Dividers
Lecture 11 Advanced Dividers Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 15 Variation in Dividers 15.3, Combinational and Array Dividers Chapter 16, Division
More informationLecture 3, Performance
Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationConsider the following example of a linear system:
LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x + 2x 2 + 3x 3 = 5 x + x 3 = 3 3x + x 2 + 3x 3 = 3 x =, x 2 = 0, x 3 = 2 In general we want to solve n equations
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationBinary Convolutional Neural Network on RRAM
Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19
More informationLecture 13: Sequential Circuits, FSM
Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines Reminder: midterm on Tue 2/28 will cover Chapters 1-3, App A, B if you understand all slides, assignments,
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationEvaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042
Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SR and Freescale under SR task number 2042 Lukasz G. Szafaryn, Kevin Skadron Department of omputer Science
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationGF(2 m ) arithmetic: summary
GF(2 m ) arithmetic: summary EE 387, Notes 18, Handout #32 Addition/subtraction: bitwise XOR (m gates/ops) Multiplication: bit serial (shift and add) bit parallel (combinational) subfield representation
More informationDigital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.
Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEM ORY INPUT-OUTPUT CONTROL DATAPATH
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationFast Matrix Multiplication Over GF3
Fast Matrix Multiplication Over GF3 Matthew Lambert Advised by Dr. B. David Saunders & Dr. Stephen Siegel February 13, 2015 Matrix Multiplication Outline Fast Matrix Multiplication Outline Fast Matrix
More informationParallel Sparse Tensor Decompositions using HiCOO Format
Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline
More information[2] Predicting the direction of a branch is not enough. What else is necessary?
[2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting
More informationMIXED PRECISION TRAINING ON VOLTA GPUS
MIXED PRECISION TRAINING ON VOLTA GPUS Boris Ginsburg, Paulius Micikevicius, Oleksii Kuchaiev, Sergei Nikolaev bginsburg, pauliusm, okuchaiev, snikolaev@nvidia.com 10/17/2017 ACKNOWLEDGMENTS Michael Houston,
More informationLinear Algebra and Eigenproblems
Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details
More informationLecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 23: Illusiveness of Parallel Performance James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L23 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping peel
More informationLecture 2: Metrics to Evaluate Systems
Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video
More informationTYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad
TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad 1. Pruning 2. Quantization 3. Architectural Modifications PRUNING WHY PRUNING? Deep Neural Networks have redundant parameters.
More informationLogic. Basic Logic Functions. Switches in series (AND) Truth Tables. Switches in Parallel (OR) Alternative view for OR
TOPIS: Logic Logic Expressions Logic Gates Simplifying Logic Expressions Sequential Logic (Logic with a Memory) George oole (85-864), English mathematician, oolean logic used in digital computers since
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationDigital Integrated Circuits A Design Perspective
Semiconductor Memories Adapted from Chapter 12 of Digital Integrated Circuits A Design Perspective Jan M. Rabaey et al. Copyright 2003 Prentice Hall/Pearson Outline Memory Classification Memory Architectures
More informationLecture 4: Linear Algebra 1
Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationEE141-Fall 2011 Digital Integrated Circuits
EE4-Fall 20 Digital Integrated Circuits Lecture 5 Memory decoders Administrative Stuff Homework #6 due today Project posted Phase due next Friday Project done in pairs 2 Last Lecture Last lecture Logical
More informationToday. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of
ESE532: System-on-a-Chip Architecture Day 20: November 8, 2017 Energy Today Energy Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators How does parallelism impact energy? 1 2 Message
More informationSystem Data Bus (8-bit) Data Buffer. Internal Data Bus (8-bit) 8-bit register (R) 3-bit address 16-bit register pair (P) 2-bit address
Intel 8080 CPU block diagram 8 System Data Bus (8-bit) Data Buffer Registry Array B 8 C Internal Data Bus (8-bit) F D E H L ALU SP A PC Address Buffer 16 System Address Bus (16-bit) Internal register addressing:
More informationMachine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016
Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text
More informationLogic BIST. Sungho Kang Yonsei University
Logic BIST Sungho Kang Yonsei University Outline Introduction Basics Issues Weighted Random Pattern Generation BIST Architectures Deterministic BIST Conclusion 2 Built In Self Test Test/ Normal Input Pattern
More informationLecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets) Sanjeev Arora Elad Hazan Recap: Structure of a deep
More informationEECS 579: Logic and Fault Simulation. Simulation
EECS 579: Logic and Fault Simulation Simulation: Use of computer software models to verify correctness Fault Simulation: Use of simulation for fault analysis and ATPG Circuit description Input data for
More information