TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

Similar documents
Towards Accurate Binary Convolutional Neural Network

On the Complexity of Neural-Network-Learned Functions

arxiv: v1 [cs.cv] 25 May 2018

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017

Layer 1. Layer L. difference between the two precisions (B A and B W ) is chosen to balance the sum in (1), as follows: B A B W = round log 2 (2)

Binary Deep Learning. Presented by Roey Nagar and Kostya Berestizshevsky

Frequency-Domain Dynamic Pruning for Convolutional Neural Networks

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Auto-balanced Filter Pruning for Efficient Convolutional Neural Networks

σ(x) = clip( x m + α 2, 0, α) = min(max( x m + α 2, 0), α) (1)

Compressing deep neural networks

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Quantization of Fully Convolutional Networks for Accurate Biomedical Image Segmentation

Accelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning

Deep Learning with Low Precision by Half-wave Gaussian Quantization

Stochastic Layer-Wise Precision in Deep Neural Networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

arxiv: v1 [cs.cv] 21 Nov 2016

Quantisation. Efficient implementation of convolutional neural networks. Philip Leong. Computer Engineering Lab The University of Sydney

Quantization of Fully Convolutional Networks for Accurate Biomedical Image Segmentation

Sparse Regularized Deep Neural Networks For Efficient Embedded Learning

Improved Bayesian Compression

arxiv: v1 [cs.lg] 10 Jan 2019

Convolutional Neural Network Architecture

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. Microsoft Research

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

Multi-Precision Quantized Neural Networks via Encoding Decomposition of { 1, +1}

Deep Neural Network Compression with Single and Multiple Level Quantization

DNN FEATURE MAP COMPRESSION USING LEARNED REPRESENTATION OVER GF(2)

Binary Convolutional Neural Network on RRAM

arxiv: v2 [cs.cv] 17 Nov 2017

arxiv: v1 [cs.cv] 3 Feb 2017

arxiv: v2 [cs.cv] 20 Aug 2018

arxiv: v3 [cs.ne] 2 Feb 2018

Fixed-point Factorized Networks

CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation

Analytical Guarantees on Numerical Precision of Deep Neural Networks

A MAIN/SUBSIDIARY NETWORK FRAMEWORK FOR SIMPLIFYING BINARY NEURAL NETWORKS

arxiv: v4 [cs.cv] 2 Aug 2016

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

arxiv: v2 [cs.ne] 23 Nov 2017

Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning

arxiv: v1 [cs.lg] 10 Nov 2017

arxiv: v1 [cs.cv] 16 Mar 2016

PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz

Scalable Methods for 8-bit Training of Neural Networks

arxiv: v1 [stat.ml] 18 Jan 2019

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

INTEGER NETWORKS FOR DATA COMPRESSION

arxiv: v2 [cs.lg] 3 Dec 2018

Weighted-Entropy-based Quantization for Deep Neural Networks

Multiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet

arxiv: v1 [cs.ne] 20 Apr 2018

Channel Pruning and Other Methods for Compressing CNN

Value-aware Quantization for Training and Inference of Neural Networks

arxiv: v1 [cs.cv] 3 Aug 2017

Towards a Deeper Understanding of Training Quantized Neural Networks

arxiv: v1 [cs.ne] 20 Jun 2016

Analytical Guarantees on Numerical Precision of Deep Neural Networks

Introduction to Machine Learning (67577)

Integer weight training by differential evolution algorithms

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Fast classification using sparsely active spiking networks. Hesham Mostafa Institute of neural computation, UCSD

arxiv: v2 [cs.cv] 6 Dec 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

How to do backpropagation in a brain

arxiv: v1 [cs.cv] 11 Sep 2018

Towards Effective Low-bitwidth Convolutional Neural Networks

arxiv: v3 [cs.lg] 2 Jun 2016

arxiv: v1 [cs.cv] 1 Oct 2015

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Two-Step Quantization for Low-bit Neural Networks

Exploring the Granularity of Sparsity in Convolutional Neural Networks

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

arxiv: v1 [cs.cv] 6 Dec 2018

ADAPTIVE QUANTIZATION OF NEURAL NETWORKS

Data Mining Part 5. Prediction

arxiv: v1 [cs.lg] 19 May 2017

LOGNET: ENERGY-EFFICIENT NEURAL NETWORKS USING LOGARITHMIC COMPUTATION. Stanford University 1, Toshiba 2

Exploring the Granularity of Sparsity in Convolutional Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Learning & Artificial Intelligence WS 2018/2019

SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks

Performance Guaranteed Network Acceleration via High-Order Residual Quantization

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Attention Based Pruning for Shift Networks

ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning

Network Approximation using Tensor Sketching

TETRIS: TilE-matching the TRemendous Irregular Sparsity

Deep Feedforward Networks

Jakub Hajic Artificial Intelligence Seminar I

TRAINING AND INFERENCE WITH INTEGERS IN DEEP NEURAL NETWORKS

From perceptrons to word embeddings. Simon Šuster University of Groningen

TRAINING LONG SHORT-TERM MEMORY WITH SPAR-

Cheng Soon Ong & Christian Walder. Canberra February June 2018

FxpNet: Training deep convolutional neural network in fixed-point representation

Deep Learning: a gentle introduction

Transcription:

TYPES OF MODEL COMPRESSION Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

1. Pruning 2. Quantization 3. Architectural Modifications

PRUNING

WHY PRUNING? Deep Neural Networks have redundant parameters. Such parameters have a negligible value and can be ignored. Removing them does not affect performance. Figure: Distribution of weights after Training

TYPES OF PRUNING Fine Pruning -- Prune the weights Coarse Pruning -- Prune neurons and layers Static Pruning -- Pruning after training Dynamic Pruning -- Pruning during training time

Weight Pruning (After Training) The matrices can be made sparse. A naive method is to drop those weights which are 0 after training. Drop the weights below some threshold. Can be stored in optimized way if matrix becomes sparse. Sparse Matrix Multiplications are faster.

Ensuring Sparsity Addition of L1 regulariser to ensure sparsity

Sparsify at Training Time Iterative pruning and retraining Train for few epochs Prune weights which are 0 or below some threshold Start training again Add regularizer (weight decay) to loss function Loss + wi r L1 regularizer : r = 1, L2 regularizer : r = 2 DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han, Huizi Mao, William J. Dally

Results reported by Deep Compression

Remaining parameters in Different Layers ALEXNET VGG16 DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han, Huizi Mao, William J. Dally

Comments on Weight Pruning 1. 2. 3. 4. Matrices become sparse. Storage in HDD is efficient. Same memory in RAM is occupied by the weight matrices. Matrix multiplication is not faster since each 0 valued weight occupies as much space as before. Optimized Sparse matrix multiplication algorithms need to be coded up separately even for a basic forward pass operation.

Neuron Pruning Previously, we had a sparse weight matrix. Now, we will be effectively removing rows and columns in a weight matrix. Matrix multiplication will be faster improving test time. Drop Neuron uses custom regularizers to prune neurons. Use thresholding to remove all connections of a neuron. DropNeuron: Simplifying the Structure of Deep Neural Networks Wei Pan, Hao Dong, Yike Guo

Dropping Neurons by Regularization DropNeuron: Simplifying the Structure of Deep Neural Networks Wei Pan, Hao Dong, Yike Guo

Dropping principles All input connections to a neuron is forced to be 0 or as close to 0 as possible. (force li_regulariser to be small) All output connections of a neuron is forced to be 0 or as close to zero as possible. (force lo_regulariser to be small) Add regularisers to the loss function and train. Remove all connections less than threshold after training. Discard neuron with no connection. DropNeuron: Simplifying the Structure of Deep Neural Networks Wei Pan, Hao Dong, Yike Guo

Effect of neuron pruning on weight matrices DropNeuron: Simplifying the Structure of Deep Neural Networks Wei Pan, Hao Dong, Yike Guo

Results on FC Layer (MNIST) DropNeuron: Simplifying the Structure of Deep Neural Networks Wei Pan, Hao Dong, Yike Guo

Neuron and Layer Pruning Can we learn hyperparameters by Backpropagation? Hidden Layer / Filter size Number of layers We would actually be learning the architecture Modifying the activation function w and d are binary variables in the equation below. Learning Neural Network Architectures using Backpropagation Suraj Srinivas, R. Venkatesh Babu

Loss Function Learning Neural Network Architectures using Backpropagation Suraj Srinivas, R. Venkatesh Babu

Results Learning Neural Network Architectures using Backpropagation Suraj Srinivas, R. Venkatesh Babu

QUANTIZATION

Binary Quantization Size Drop : 32X Runtime : Much faster (7x) matrix multiplication for binary matrices. Accuracy Drop : Classification error is about 20% on the top 5 accuracy on ILSVRC dataset. COMPRESSING DEEP CONVOLUTIONAL NETWORKS USING VECTOR QUANTIZATION Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev

Binary Quantization while Training Add regularizer and round at the end of training Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or 1 Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio

8-bit uniform quantization Divide the max and min weight values into 256 equal divisions uniformly. Round weights to the nearest point Store weights as 8 bit ints Size Drop : 4X Runtime : Much faster matrix multiplication for 8 bit matrices. Accuracy Drop : Error is acceptable for classification for non critical tasks https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/

8 bit Uniform Quantization while Training Add L1, L2 regularizers to ensure that the min and max values are close.

Non Uniform Quantization/ Weight Sharing perform k-means clustering on weights. Need to store mapping from integers to cluster centers. We only need log (k) bits to code the clusters which results in a compression factor rate of 32/ log (k). In this case the compression rate is 4. DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han, Huizi Mao, William J. Dally

Weight Sharing while Training Iterate Train Cluster weights Make them same Need to ensure the gradients are updated with respect to the weight shared model. DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han, Huizi Mao, William J. Dally

Deep Compression by Song Han DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han, Huizi Mao, William J. Dally

Deep Compression by Song Han DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING Song Han, Huizi Mao, William J. Dally

Product Quantization Partition the given matrix into several submatrices and we perform k-means clustering for all of them. COMPRESSING DEEP CONVOLUTIONAL NETWORKS USING VECTOR QUANTIZATION Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev

Residual Quantization First quantize the vectors into k-centers. Next step is to find out the residuals for each data point(w-c) and perform k-means on the residuals Then the resultant weight vectors are calculated as follows. COMPRESSING DEEP CONVOLUTIONAL NETWORKS USING VECTOR QUANTIZATION Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev

Comparison of Quantization methods on Imagenet COMPRESSING DEEP CONVOLUTIONAL NETWORKS USING VECTOR QUANTIZATION Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev

XNOR Net Binary Weight Networks : Estimate real time weight filter using a binary filter. Only the weights are binarized. Convolutions are only estimated with additions and subtractions (no multiplications required due to binarization). XNOR Networks: Binary estimation of both inputs and weights Input to the convolutions are binary. Binary inputs and weights ensure calculations using XNOR operations. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi

Binary weight networks Estimating binary weights: Objective function : Solution : XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi

XNOR Networks Objective function for dot product approximation: We can approximate the input I and weight filter W by using the following binary operations: XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi

Approximating a convolution using binary operations XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi

Results XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi

Results XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi

FIXED POINT REPRESENTATION FLOATING POINT VS FIXED POINT REPRESENTATION

Fixed point Fixed point formats consist in a signed mantissa and a global scaling factor shared between all fixed point variables. It is usually fixed. Reducing the scaling factor reduces the range and augments the precision of the format. It relies on integer operations. It is hardware-wise cheaper than its floating point counterpart, as the exponent is shared and fixed.

Disadvantages of using fixed point When training deep neural networks : Activations, gradients and parameters have very different ranges. The ranges of the gradients slowly diminish during training. Fixed point arithmetic is not optimised on regular hardware and specialised hardware such as FPGAs are required. As a result the fixed point format with its unique shared fixed exponent is ill-suited to deep learning. The dynamic fixed point format is a variant of the fixed point format in which there are several scaling factors instead of a single global one.

Summary Pruning weights and neurons Uniform Quantization Non Uniform Quantization / Weight Sharing

THANK YOU