Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip

Similar documents
arxiv: v1 [cs.ar] 11 Dec 2017

Administrative Stuff

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

MTJ-Based Nonvolatile Logic-in-Memory Architecture and Its Application

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Lecture 9: Clocking, Clock Skew, Clock Jitter, Clock Distribution and some FM

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

Introduction to Digital Signal Processing

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

CSE140L: Components and Design Techniques for Digital Systems Lab. Power Consumption in Digital Circuits. Pietro Mercati

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

OHW2013 workshop. An open source PCIe device virtualization framework

3/10/2013. Lecture #1. How small is Nano? (A movie) What is Nanotechnology? What is Nanoelectronics? What are Emerging Devices?

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

ww.padasalai.net

ALU A functional unit

Hardware Acceleration of DNNs

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Author : Fabrice BERNARD-GRANGER September 18 th, 2014

Saving Power in the Data Center. Marios Papaefthymiou

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Lecture 22 Chapters 3 Logic Circuits Part 1

EE241 - Spring 2000 Advanced Digital Integrated Circuits. References

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

Digital Integrated Circuits A Design Perspective. Semiconductor. Memories. Memories

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Advanced Flash and Nano-Floating Gate Memories

Tradeoff between Reliability and Power Management

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Analog and Telecommunication Electronics

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

Semiconductor Memories

Digital Systems Roberto Muscedere Images 2013 Pearson Education Inc. 1

Flash Memory Cell Compact Modeling Using PSP Model

A High-Yield Area-Power Efficient DWT Hardware for Implantable Neural Interface Applications

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

ECE321 Electronics I

LABORATORY MANUAL MICROPROCESSOR AND MICROCONTROLLER

Embedded MCSoC Architecture and Period-Peak Detection (PPD) Algorithm for ECG/EKG Processing

Scalable Cryogenic Control Systems for Quantum Computers

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

Itanium TM Processor Clock Design

CRYPTOGRAPHIC COMPUTING

Quantum Artificial Intelligence and Machine Learning: The Path to Enterprise Deployments. Randall Correll. +1 (703) Palo Alto, CA

Scalable and Power-Efficient Data Mining Kernels

SEMICONDUCTOR MEMORIES

Value-aware Quantization for Training and Inference of Neural Networks

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

Lecture (02) NAND and NOR Gates

2.1. Unit 2. Digital Circuits (Logic)

Tracking board design for the SHAGARE stratospheric balloon project. Supervisor : René Beuchat Student : Joël Vallone

Pytorch Tutorial. Xiaoyong Yuan, Xiyao Ma 2018/01

FERROELECTRIC RAM [FRAM] Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

: MIMO Latency - Revised Definition

SUMMER 18 EXAMINATION Subject Name: Principles of Digital Techniques Model Answer Subject Code:

Digital Techniques. Figure 1: Block diagram of digital computer. Processor or Arithmetic logic unit ALU. Control Unit. Storage or memory unit

Efficient Reluctance Extraction for Large-Scale Power Grid with High- Frequency Consideration

Digital Integrated Circuits A Design Perspective

FPGA Implementation of a Predictive Controller

Impact of parametric mismatch and fluctuations on performance and yield of deep-submicron CMOS technologies. Philips Research, The Netherlands

Serial Parallel Multiplier Design in Quantum-dot Cellular Automata

Moores Law for DRAM. 2x increase in capacity every 18 months 2006: 4GB

LEADING THE EVOLUTION OF COMPUTE MARK KACHMAREK HPC STRATEGIC PLANNING MANAGER APRIL 17, 2018

Consider the following example of a linear system:

Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA

YEAR III SEMESTER - V

CprE 281: Digital Logic

Vidyalankar S.E. Sem. III [CMPN] Digital Logic Design and Analysis Prelim Question Paper Solution

Neuromorphic architectures: challenges and opportunites in the years to come

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Vidyalankar. S.E. Sem. III [EXTC] Digital System Design. Q.1 Solve following : [20] Q.1(a) Explain the following decimals in gray code form

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Large-Scale FPGA implementations of Machine Learning Algorithms

Realization of programmable logic array using compact reversible logic gates 1

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

Introducing a Bioinformatics Similarity Search Solution

These are special traffic patterns that create more stress on a switch

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Qualitative vs Quantitative metrics

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Memory, Latches, & Registers

A 68 Parallel Row Access Neuromorphic Core with 22K Multi-Level Synapses Based on Logic- Compatible Embedded Flash Memory Technology

CSE370: Introduction to Digital Design

I/O Devices. Device. Lecture Notes Week 8

MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION (Autonomous) (ISO/IEC Certified)

on candidate s understanding. 7) For programming language papers, credit may be given to any other program based on equivalent concept.

EECS 579: Logic and Fault Simulation. Simulation

arxiv: v2 [cs.ne] 23 Nov 2017

A Deep Convolutional Neural Network Based on Nested Residue Number System

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

Transcription:

1 Analog Computation in Flash Memory for Datacenter-scale AI Inference in a Small Chip Dave Fick, CTO/Founder Mike Henry, CEO/Founder

About Mythic 2 Focused on high-performance Edge AI Full stack co-design: device physics to new algorithms and applications Founded in 2012 by Mike Henry and Dave Fick While working with the Michigan Integrated Circuits Lab (MICL) Raised $55M from top-tier investors: DFJ, SoftBank, Lux, DCVC Offices in Redwood City & Austin 60+ employees

DNNs are Largely Multiply-Accumulate 3 Primary DNN Calculation is Input Vector * Weight Matrix = Output Vector Input Data XX 0 XX 1 XX NN Neuron Weights AA 0 BB 0 CC 0 AA 1 BB 1 CC 1 = AA NN BB NN CC NN Outputs Equations YY AA = XX 0 AA 0 + XX 1 AA 1 + XX 2 AA 2 YY BB = XX 0 BB 0 + XX 1 BB 1 + XX 2 BB 2 YY CC = XX 0 CC 0 + XX 1 CC 1 + XX 2 CC 2 Key Operation: Multiply-Accumulate, or MAC Figure of Merit: How many picojoules to execute a MAC?

Memory Access Includes Weight Data and Intermediate Data 4 Weight Data Input Data XX 0 XX 1 XX NN Neuron Weights AA 0 BB 0 CC 0 AA 1 BB 1 CC 1 = AA NN BB NN CC NN Outputs Equations YY AA = XX 0 AA 0 + XX 1 AA 1 + XX 2 AA 2 YY BB = XX 0 BB 0 + XX 1 BB 1 + XX 2 BB 2 YY CC = XX 0 CC 0 + XX 1 CC 1 + XX 2 CC 2 Intermediate Data

Intermediate Data Accesses are Naturally Amortized 5 For a 1000 input, 1000 neuron matrix. 1,000 Inputs XX 0 XX 1 XX NN 1,000,000 Weights 1,000 Outputs AA 0 BB 0 CC 0 AA 1 BB 1 CC 1 = AA NN BB NN CC NN YY AA = XX 0 AA 0 + XX 1 AA 1 + XX 2 AA 2 YY BB = XX 0 BB 0 + XX 1 BB 1 + XX 2 BB 2 YY CC = XX 0 CC 0 + XX 1 CC 1 + XX 2 CC 2 Intermediate data accesses are amortized 64-1024x since they are used in many MAC operations

Weight Data Accesses are Not Amortized 6 For a 1000 input, 1000 neuron matrix. 1,000 Inputs XX 0 XX 1 XX NN 1,000,000 Weights 1,000 Outputs AA 0 BB 0 CC 0 AA 1 BB 1 CC 1 = AA NN BB NN CC NN YY AA = XX 0 AA 0 + XX 1 AA 1 + XX 2 AA 2 YY BB = XX 0 BB 0 + XX 1 BB 1 + XX 2 BB 2 YY CC = XX 0 CC 0 + XX 1 CC 1 + XX 2 CC 2 Weight data could need to be stored in DRAM, and it does not have the same amortization as the intermediate data

DNN Processing is All About Weight Memory 7 10+M parameters to store 20+B memory accesses How do we achieve High Energy Efficiency High Performance Edge Power Budget (e.g., 5W) Network Weights MACs @ 30 FPS AlexNet 1 61 M 725 M 22 B ResNet-18 1 11 M 1.8 B 54 B ResNet-50 1 23 M 3.5 B 105 B VGG-19 1 144 M 22 B 660 B OpenPose 2 46 M 180 B 5400 B Very hard to fit this in an Edge solution 1 : 224x224 resolution 2 : 656x368 resolution

Common Techniques for Reducing Weight Energy Consumption 8 Weight Re-use Focus on CNN Re-use weights for multiple windows Can build specialized structures Not all problems map to CNN well Focus on Large Batch Re-use weights for multiple inputs Edge is often batch=1 Increases latency Weight Reduction Shrink the Model Use a smaller network that can fit onchip (e.g., SqueezeNet) Possibly reduced capability Compress the Model Use sparsity to eliminate up to 99% of the parameters Use literal compression Possibly reduced capability Reduce Weight Precision 32b Floating Point => 2-8b Integer Possibly reduced capability

Key Question: Use DRAM or Not? 9 Benefits of DRAM Drawbacks of DRAM Can fit arbitrarily large models Huge energy cost for reading weights Not as much SRAM needed on chip Limited bandwidth getting to weight data Variable energy efficiency & performance depending on application

Common NN Accelerator Design Points 10 Enterprise With DRAM Enterprise No-DRAM Edge With DRAM Edge No-DRAM SRAM <50 MB 100+ MB < 5 MB < 5 MB DRAM 8+ GB - 4-8 GB - Power 70+ W 70+ W 3-5 W 1-3 W Sparsity Light Light Moderate Heavy Precision 32f / 16f / 8i 32f / 16f / 8i 8i 1-8i Accuracy Great Great Moderate Poor Performance High High Very Low Very Low Efficiency 25 pj/mac 2 pj/mac 10 pj/mac 5 pj/mac

Mythic is Fundamentally Different 11 Enterprise With DRAM Enterprise No-DRAM Edge With DRAM Edge No-DRAM Mythic NVM SRAM <50 MB 100+ MB < 5 MB < 5 MB < 5 MB DRAM 8+ GB - 4-8 GB - - Power 70+ W 70+ W 3-5 W 1-3 W 1-5 W Sparsity Light Light Moderate Heavy None Precision 32f / 16f / 8i 32f / 16f / 8i 8i 1-8i 1-8i Accuracy Great Great Moderate Poor Great Performance High High Very Low Very Low High Efficiency 25 pj/mac 2 pj/mac 10 pj/mac 5 pj/mac 0.5 pj/mac

Mythic is Fundamentally Different 12 Enterprise With DRAM Enterprise No-DRAM Edge With DRAM Edge No-DRAM Mythic NVM SRAM <50 MB 100+ MB < 5 MB < 5 MB < 5 MB DRAM 8+ GB - 4-8 GB - - Also, Mythic does this in a 40nm process, compared to 7/10/16nm Power 70+ W 70+ W 3-5 W 1-3 W 1-5 W Sparsity Light Light Moderate Heavy None Precision 32f / 16f / 8i 32f / 16f / 8i 8i 1-8i 1-8i Accuracy Great Great Moderate Poor Great Performance High High Very Low Very Low High Efficiency 25 pj/mac 2 pj/mac 10 pj/mac 5 pj/mac 0.5 pj/mac

Mythic s New Architecture Merges Enterprise and Edge 13 Mythic introduces the Matrix Multiplying Memory Never read weights This effectively makes weight memory access energy-free (only pay for MAC) And eliminates the need for Batch > 1 CNN Focus Sparsity or Compression Nerfed DNN Models Made possible with Mixed-Signal Computing on embedded flash

Revisiting Matrix Multiply 14 Primary DNN Calculation is Input Vector * Weight Matrix = Output Vector Input Data XX 0 XX 1 XX NN Neuron Weights AA 0 BB 0 CC 0 AA 1 BB 1 CC 1 = AA NN BB NN CC NN Outputs Equations YY AA = XX 0 AA 0 + XX 1 AA 1 + XX 2 AA 2 YY BB = XX 0 BB 0 + XX 1 BB 1 + XX 2 BB 2 YY CC = XX 0 CC 0 + XX 1 CC 1 + XX 2 CC 2 Flash Transistors

Analog Circuits Give us the MAC We Need 15 Flash transistors can be modeled as variable resistors representing the weight X 0 DAC V 0 R A0 R B0 R C0 The V=IR current equation will achieve the math we need: Inputs (X) = DAC Weights (R) = Flash transistors Outputs (Y) = ADC Outputs X 1 X 2 DAC DAC V 1 V 2 R A1 R A2 R B1 R B2 R C1 R C2 The ADCs convert current to digital codes, and provide the non-linearity needed for DNN ADC ADC ADC Y A Y B Y C

DACs & ADCs Give Us a Flexible Architecture 16 We have a digital top-level architecture: X 0 DAC V 0 R A0 R B0 R C0 Interconnect X 1 DAC V 1 Intermediate data storage X 2 DAC V 2 R A1 R B1 R C1 Programmability (XLA/ONNX => Mythic IPU) R A2 R B2 R C2 ADC ADC ADC Y A Y B Y C

To Simplify we use Digital Approximation 17 To improve time-to-market, we have left the Input DAC as a future endeavor X 0 X 1 Dig Approx Dig Approx V 0 R A0 V 1 R B0 R C0 We achieve the same result through digital approximation X 2 Dig Approx R A1 V 2 R A2 R B1 R B2 R C1 R C2 Silver lining: we have future improvements available ADC ADC ADC Y A Y B Y C

Mythic Mixed-Signal Computing 18 Single Tile Input Data Digital to Analog Weight Storage + Analog Matrix Multiplier Analog To Digital Activations SRAM SIMD RISC-V Router Made possible with Mixed-Signal Computing on embedded flash

Mythic Mixed-Signal Computing 19 Single Tile Tiles Connected in a Grid Example DNN Mapping (Post-Silicon) Input Data Digital to Analog SRAM Weight Storage + Analog Matrix Multiplier Analog To Digital RISC-V Activations Scene Segmentation Camera Enhancement Object Tracking SIMD Router Network Connections Expandable Grid of Tiles

System Overview Intelligence Processing Unit (IPU) 20 Initial Product 50M weight capacity PCIe 2.1 x4 Basic Control Processor Envisioned Customizations (Gen 1) Up to 250M weight capacity Control Processor Scene Segmentation Camera Enhancement Object Tracking PCIe 2.1 x16 USB 3.0/2.0 Direct Audio/Video Interfaces PCIE Enhanced Control Processor (e.g., ARM)

Mythic is a PCIe Accelerator 21 DRAM Mythic IPU Data PCIe Inference Results Host SoC Inference Model (specified via TensorFlow, Caffe2, or others) Operating System Applications Interfaces Mythic IPU Driver

We Also Support Multiple IPUs 22 Mythic IPUs DRAM Data PCIe Inference Results Host SoC Inference Model (specified via TensorFlow, Caffe2, or others) Operating System Applications Interfaces Mythic IPU Driver

We Account For All Energy Consumed 23 Numbers are for a typical application, e.g. ResNet-50 Batch size = 1 We are relatively applicationagnostic (especially compared to DRAM-based systems) 8b analog compute accounts for about half of our energy We can also run lower precision Control, storage, and PCIe accounts for the other half Energy (pj/mac) Total = 0.5 Analog Compute 0.25 Digital Storage 0.1 PCIe Port 0.1 Control Logic 0.05

Example Application: ResNet-50 24 GPU Performance in an Edge Form Factor! Running at 224x224 resolution. Mythic estimated, GPU/SoC measured

Example Application: OpenPose 25 Running at 656x368 resolution. Mythic estimated, GPU/SoC measured

Timeline 26 Alpha release of software tools and profiler: Late 2018 External samples: Mid 2019 PCIe Dev Boards (1, 4 IPUs) Volume shipments: Late 2019 BGA Chips PCIe Cards (1, 4, 16 IPUs) Four IPU PCIe card

Mythic IPU Overview Low Latency Runs batch size = 1 E.g., single frame delay 27 High Performance 10 s of TMAC/s High Efficiency 0.5 pj/mac aka 500mW / TMAC Hyper-Scalable Ultra low power to high performance Easy to use Topology agnostic (CNN/DNN/RNN) TensorFlow/Caffe2/etc supported Made possible with Mixed-Signal Computing on embedded flash

28 Thank you for listening! Questions?