PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz

Similar documents
Fine-grained Classification

Channel Pruning and Other Methods for Compressing CNN

Convolutional Neural Network Architecture

Auto-balanced Filter Pruning for Efficient Convolutional Neural Networks

Spatial Transformer Networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

Binary Convolutional Neural Network on RRAM

Exploring the Granularity of Sparsity in Convolutional Neural Networks

Analyzing the Performance of Multilayer Neural Networks for Object Recognition

arxiv: v1 [cs.cv] 31 Jan 2018

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Local Affine Approximators for Improving Knowledge Transfer

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Introduction to (Convolutional) Neural Networks

Understanding How ConvNets See

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Introduction to Deep Learning

Neural Network Approximation. Low rank, Sparsity, and Quantization Oct. 2017

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based

Towards Accurate Binary Convolutional Neural Network

Introduction to Convolutional Neural Networks (CNNs)

arxiv: v1 [cs.cv] 11 May 2015 Abstract

Neural Architectures for Image, Language, and Speech Processing

Encoder Based Lifelong Learning - Supplementary materials

Exploring the Granularity of Sparsity in Convolutional Neural Networks

CSC321 Lecture 16: ResNets and Attention

Convolutional Neural Networks

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Lecture 17: Neural Networks and Deep Learning

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Neural networks and optimization

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Frequency-Domain Dynamic Pruning for Convolutional Neural Networks

Accelerating Convolutional Neural Networks by Group-wise 2D-filter Pruning

Advanced Training Techniques. Prajit Ramachandran

arxiv: v2 [cs.cv] 8 Mar 2018

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Handwritten Indic Character Recognition using Capsule Networks

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

Multiscale methods for neural image processing. Sohil Shah, Pallabi Ghosh, Larry S. Davis and Tom Goldstein Hao Li, Soham De, Zheng Xu, Hanan Samet

CSC321 Lecture 20: Reversible and Autoregressive Models

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Convolutional neural networks

Lecture 5: Logistic Regression. Neural Networks

Deep Convolutional Neural Networks for Pairwise Causality

Differentiable Fine-grained Quantization for Deep Neural Network Compression

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Lecture 7 Convolutional Neural Networks

Optimization geometry and implicit regularization

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Memory-Augmented Attention Model for Scene Text Recognition

Nonparametric regression using deep neural networks with ReLU activation function

Stochastic Layer-Wise Precision in Deep Neural Networks

A practical theory for designing very deep convolutional neural networks. Xudong Cao.

Learning the Number of Neurons in Deep Networks

IN recent years, convolutional neural networks (CNNs) have

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Feature Design. Feature Design. Feature Design. & Deep Learning

arxiv: v1 [cs.cv] 6 Dec 2018

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

ECS289: Scalable Machine Learning

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Introduction to Deep Learning

arxiv: v3 [cs.lg] 5 Jun 2017

Normalization Techniques in Training of Deep Neural Networks

An Investigation of Low-Rank Decomposition for Increasing Inference Speed in Deep Neural Networks With Limited Training Data

Jakub Hajic Artificial Intelligence Seminar I

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Understanding CNNs using visualisation and transformation analysis

Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Improved Bayesian Compression

Deep Residual. Variations

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to Machine Learning Midterm Exam

The K-FAC method for neural network optimization

arxiv: v1 [cs.ne] 20 Apr 2018

Normalization Techniques

Improved Local Coordinate Coding using Local Tangents

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Value-aware Quantization for Training and Inference of Neural Networks

arxiv: v1 [cs.lg] 15 Jun 2018

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

COMPLEX INPUT CONVOLUTIONAL NEURAL NETWORKS FOR WIDE ANGLE SAR ATR

Introduction to Deep Learning CMPT 733. Steven Bergner

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Global Optimality in Matrix and Tensor Factorization, Deep Learning & Beyond

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing

Neural networks. Chapter 20. Chapter 20 1

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Deep Boltzmann Machines

Transcription:

PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017

WHY WE CAN PRUNE CNNS? 2

WHY WE CAN PRUNE CNNS? Optimization failures : Some neurons are "dead": little activation Some neurons are uncorrelated with output Modern CNNs are overparameterized: VGG16 has 138M parameters Alexnet has 61M parameters ImageNet has 1.2M images 3

PRUNING FOR TRANSFER LEARNING Small Dataset Caltech-UCSD Birds (200 classes, <6000 images) 4

PRUNING FOR TRANSFER LEARNING Small Dataset Small Network Training Oriole Goldfinch Accuracy Size/Speed 5

PRUNING FOR TRANSFER LEARNING AlexNet - VGG16 - ResNet Small Dataset Large Pretrained Network Fine-tuning Oriole Goldfinch Accuracy Size/Speed 6

PRUNING FOR TRANSFER LEARNING AlexNet - VGG16 - ResNet Small Dataset Large Pretrained Network Fine-tuning Pruning Smaller Network Oriole Goldfinch Accuracy Size/Speed 7

TYPES OF UNITS Convolutional units Heavy on computation Small on storage Fully connected (dense) units Fast on computations Heavy on storage Our focus Ratio of floating point operations To reduce computation, we focus pruning on convolutional units. Convolutional layers Fully connected layers VGG16 99% 1% Alexnet 89% 11% R3DCNN 90% 10% 8

TYPES OF PRUNING No pruning Fine pruning Coarse pruning Remove connections between neurons/feature maps May require special SW/HW for full speed-up Remove entire neurons/feature maps Instant speed-up No change to HW/SW Our focus 9

NETWORK PRUNING 10

NETWORK PRUNING Training: min W C W, D C: training cost function D: training data W: network weights W: pruned network weights 11

NETWORK PRUNING Training: Pruning: min C W, D min C W, D C W, D W W s. t. W W, W < B C: training cost function D: training data W: network weights W: pruned network weights 12

NETWORK PRUNING Training: Pruning: min C W, D min C W, D C W, D W W s. t. W W, W < B C: training cost function D: training data W: network weights W: pruned network weights s. t. W 0 B 0 l 0 norm, number of non zero elements 13

NETWORK PRUNING Exact solution: combinatorial optimization problem too computationally expensive VGG-16 has W = 4224 convolutional units 2 W =3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498 563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215 14

NETWORK PRUNING Exact solution: combinatorial optimization problem too computationally expensive VGG-16 has W = 4224 convolutional units 2 W =3553871205531788502027616705177895234326962283811349000683834453551638494934980826570988629674816508671333937997942971545498 563185784012615902725922028388957693142 1862796735241131771064707150729404513525374011172364491439311003809147986212244125583682040173009664289254204672705377527023751838969121362871174353608981432683121364 5491611587700632287226757360106388212811709391049243449409694131581866174894684285426551148222434459277138467708468356441728767115601429026774386653664558802884798090 6965876098883394994207765939795994221495102245529321358133169053471175098438846379813927963588224649996889912395677448659534869881828474761387469462375439163452354234 5894518795402778976197641675203085270364961383790287738178866981707575145292010325953635643917893687322226855341345293028465563634475713300900704784609781200491091266 5177085470491781920811732083028359068442910422663939383012657211605418802586239081536469961410441163264284259407567601349688157128480106842375724875121706906188815680 8417681026874596048633568575893047553712713299830093139608694750348505494684606129671946123873358658490052333372765817334544824122023280282312402650277313912908677267 41995809784279019489403498646468630714031376402488628074647455635839933307882358008948992762943104694366519689215 Greedy pruning: Assumes all neurons are independent (same assumption for back propagation) Iteratively, remove neuron with the smallest contribution 15

GREEDY NETWORK PRUNING Iterative pruning Algorithm: 1) Estimate importance of neurons (units) 2) Rank units 3) Remove the least important unit 4) Fine tune network for K iterations 5) Go back to step1) 16

ORACLE 17

ORACLE Caltech-UCSD Birds-200-2011 Dataset 200 classes <6000 training images Method Test accuracy S. Belongie et al *SIFT+SVM 19% From scratch CNN 25% S. Razavian et al *OverFeat+SVM 62% Our baseline VGG16 finetuned 72.2% N. Zhang et al R-CNN 74% S. Branson et al *Pose-CNN 76% J. Krause et al *R-CNN+ 82% *require additional attributes 18

ORACLE VGG16 on Birds-200 dataset Exhaustively computed change in loss by removing one unit First layer Last layer 19

Rank, lower better ORACLE VGG-16 on Birds-200 On average first layers are more important Every layer has very important units Every layer has non important units Layers with pooling are more important Layer # *only convolutional layers 20

APPROXIMATING THE ORACLE 21

APPROXIMATING THE ORACLE Candidate criteria Average activation (discard lower activations) Minimum weight (discard lower l 2 of weight) With first-order Taylor expansion (TE): ignore Absolute difference in cost by removing a neuron: Gradient of the cost wrt. activation h i Unit s output Both computed during standard backprop. 22

APPROXIMATING THE ORACLE Candidate criteria Alternative: Optimal Brain Damage (OBD) by Y. LeCun et al., 1990 Use second order derivatives to estimate importance of neurons: Needs extra comp of second order derivative =0 ignore 23

APPROXIMATING THE ORACLE Comparison to OBD OBD: second-order expansion: we propose: abs of first-order expansion: =0 Assuming y = δc δh i h i For perfectly trained model: if y is Gaussian 24

APPROXIMATING THE ORACLE Comparison to OBD OBD: second-order expansion: we propose: abs of first-order expansion: =0 Assuming y = δc δh i h i For perfectly trained model: No extra computations We look at absolute difference Can t predict exact change in loss if y is Gaussian 25

Correlation with oracle EVALUATING PRUNING CRITERIA Spearman s rank correlation with oracle: VGG16 on Birds-200 1 0.8 0.6 0.4 0.2 0 0.27 Mean rank correlation (across layers) 0.56 0.59 0.73 Min weight Activation OBD Taylor expansion 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Min weight OBD Activation Taylor Expansion 1 2 3 4 5 6 7 8 9 10 11 12 13 Layer # 26

FLOPs per unit EVALUATING PRUNING CRITERIA Pruning with objective Regularize criteria with objective: VGG16 70 60 Regularizer can be: FLOPs Memory Bandwidth Target device Exact inference time 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Layer # 27

RESULTS 28

Remove 1 conv unit every 30 updates RESULTS VGG16 on Birds 200 dataset 29

RESULTS VGG16 on Birds 200 dataset #convolutional kernels GFLOPs Training from scratch doesn t work Taylor shows the best result vs any other metric for pruning 30

RESULTS AlexNet on Oxford Flowers102 102 classes ~2k training images ~6k testing images Changing number of updates between pruning iterations 1000 up 60 up 30 up 10 up 32

RESULTS AlexNet on Oxford Flowers102 102 classes ~2k training images ~6k testing images Changing number of updates between pruning iterations 1000 up 3.8x FLOPS reduction 2.4x actual speed up 60 up 30 up 10 up 33

RESULTS VGG16 on ImageNet Pruned over 7 epochs Top-5 validation set 34

RESULTS VGG16 on ImageNet Pruned over 7 epochs Top-5 validation set Fine-tuning 7 epochs GFLOPs FLOPS reduction Actual speed up Top-5 31 1x 89.5% 12 2.6x 2.5x -2.5% 8 3.9x 3.3x -5.0% 35

RESULTS R3DCNN for gesture recognition 3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures P. Molchanov, Gesture recognition with 3D CNNs, GTC 2016 36

RESULTS R3DCNN for gesture recognition 3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures 12.6x Reduction in FLOPs Drop in accuracy 2.5% 5.2x Speed-up P. Molchanov, Gesture recognition with 3D CNNs, GTC 2016 37

How many neurons we need to classify a cat? 38

DOGS VS. CATS Dogs vs. Cats classification Marco Lugo s solution, 3 rd place : @kaggle 25,000 images 39

DOGS VS. CATS Fine-tuned ResNet-101 Full network 99.2% Pruned network 99.0 % 40

Convolutional units DOGS VS. CATS Fine-tuned ResNet-101 60000 50000 52 672 units 15x Full network 99.2% 40000 30000 20000 10000 0 Compression 3472 units 0 500 1000 Pruning iteration Pruned network 99.0 % 41

CONCLUSIONS Pruning as greedy feature selection New criteria based on Taylor expansion Pruning is especially effective (and necessary!) for transfer learning Pruning can incorporate desired objectives (such as FLOPs) Read more in our ICLR2017 paper: https://openreview.net/pdf?id=sjgciw5gl 42

THANK YOU!