PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz

Size: px

Start display at page:

Download "PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz"

Shon Thomas
6 years ago
Views:

1 PRUNING CONVOLUTIONAL NEURAL NETWORKS Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz 2017

2 WHY WE CAN PRUNE CNNS? 2

3 WHY WE CAN PRUNE CNNS? Optimization failures : Some neurons are "dead": little activation Some neurons are uncorrelated with output Modern CNNs are overparameterized: VGG16 has 138M parameters Alexnet has 61M parameters ImageNet has 1.2M images 3

4 PRUNING FOR TRANSFER LEARNING Small Dataset Caltech-UCSD Birds (200 classes, <6000 images) 4

5 PRUNING FOR TRANSFER LEARNING Small Dataset Small Network Training Oriole Goldfinch Accuracy Size/Speed 5

6 PRUNING FOR TRANSFER LEARNING AlexNet - VGG16 - ResNet Small Dataset Large Pretrained Network Fine-tuning Oriole Goldfinch Accuracy Size/Speed 6

7 PRUNING FOR TRANSFER LEARNING AlexNet - VGG16 - ResNet Small Dataset Large Pretrained Network Fine-tuning Pruning Smaller Network Oriole Goldfinch Accuracy Size/Speed 7

8 TYPES OF UNITS Convolutional units Heavy on computation Small on storage Fully connected (dense) units Fast on computations Heavy on storage Our focus Ratio of floating point operations To reduce computation, we focus pruning on convolutional units. Convolutional layers Fully connected layers VGG16 99% 1% Alexnet 89% 11% R3DCNN 90% 10% 8

9 TYPES OF PRUNING No pruning Fine pruning Coarse pruning Remove connections between neurons/feature maps May require special SW/HW for full speed-up Remove entire neurons/feature maps Instant speed-up No change to HW/SW Our focus 9

10 NETWORK PRUNING 10

11 NETWORK PRUNING Training: min W C W, D C: training cost function D: training data W: network weights W: pruned network weights 11

12 NETWORK PRUNING Training: Pruning: min C W, D min C W, D C W, D W W s. t. W W, W < B C: training cost function D: training data W: network weights W: pruned network weights 12

13 NETWORK PRUNING Training: Pruning: min C W, D min C W, D C W, D W W s. t. W W, W < B C: training cost function D: training data W: network weights W: pruned network weights s. t. W 0 B 0 l 0 norm, number of non zero elements 13

14 NETWORK PRUNING Exact solution: combinatorial optimization problem too computationally expensive VGG-16 has W = 4224 convolutional units 2 W =

15 NETWORK PRUNING Exact solution: combinatorial optimization problem too computationally expensive VGG-16 has W = 4224 convolutional units 2 W = Greedy pruning: Assumes all neurons are independent (same assumption for back propagation) Iteratively, remove neuron with the smallest contribution 15

16 GREEDY NETWORK PRUNING Iterative pruning Algorithm: 1) Estimate importance of neurons (units) 2) Rank units 3) Remove the least important unit 4) Fine tune network for K iterations 5) Go back to step1) 16

17 ORACLE 17

18 ORACLE Caltech-UCSD Birds Dataset 200 classes <6000 training images Method Test accuracy S. Belongie et al *SIFT+SVM 19% From scratch CNN 25% S. Razavian et al *OverFeat+SVM 62% Our baseline VGG16 finetuned 72.2% N. Zhang et al R-CNN 74% S. Branson et al *Pose-CNN 76% J. Krause et al *R-CNN+ 82% *require additional attributes 18

19 ORACLE VGG16 on Birds-200 dataset Exhaustively computed change in loss by removing one unit First layer Last layer 19

20 Rank, lower better ORACLE VGG-16 on Birds-200 On average first layers are more important Every layer has very important units Every layer has non important units Layers with pooling are more important Layer # *only convolutional layers 20

21 APPROXIMATING THE ORACLE 21

22 APPROXIMATING THE ORACLE Candidate criteria Average activation (discard lower activations) Minimum weight (discard lower l 2 of weight) With first-order Taylor expansion (TE): ignore Absolute difference in cost by removing a neuron: Gradient of the cost wrt. activation h i Unit s output Both computed during standard backprop. 22

23 APPROXIMATING THE ORACLE Candidate criteria Alternative: Optimal Brain Damage (OBD) by Y. LeCun et al., 1990 Use second order derivatives to estimate importance of neurons: Needs extra comp of second order derivative =0 ignore 23

24 APPROXIMATING THE ORACLE Comparison to OBD OBD: second-order expansion: we propose: abs of first-order expansion: =0 Assuming y = δc δh i h i For perfectly trained model: if y is Gaussian 24

25 APPROXIMATING THE ORACLE Comparison to OBD OBD: second-order expansion: we propose: abs of first-order expansion: =0 Assuming y = δc δh i h i For perfectly trained model: No extra computations We look at absolute difference Can t predict exact change in loss if y is Gaussian 25

26 Correlation with oracle EVALUATING PRUNING CRITERIA Spearman s rank correlation with oracle: VGG16 on Birds Mean rank correlation (across layers) Min weight Activation OBD Taylor expansion Min weight OBD Activation Taylor Expansion Layer # 26

27 FLOPs per unit EVALUATING PRUNING CRITERIA Pruning with objective Regularize criteria with objective: VGG Regularizer can be: FLOPs Memory Bandwidth Target device Exact inference time Layer # 27

28 RESULTS 28

29 Remove 1 conv unit every 30 updates RESULTS VGG16 on Birds 200 dataset 29

30 RESULTS VGG16 on Birds 200 dataset #convolutional kernels GFLOPs Training from scratch doesn t work Taylor shows the best result vs any other metric for pruning 30

31 RESULTS AlexNet on Oxford Flowers classes ~2k training images ~6k testing images Changing number of updates between pruning iterations 1000 up 60 up 30 up 10 up 32

RESULTS AlexNet on Oxford Flowers102 102 classes ~2k

of updates between pruning iterations 1000 up 3.

32 RESULTS AlexNet on Oxford Flowers classes ~2k training images ~6k testing images Changing number of updates between pruning iterations 1000 up 3.8x FLOPS reduction 2.4x actual speed up 60 up 30 up 10 up 33

33 RESULTS VGG16 on ImageNet Pruned over 7 epochs Top-5 validation set 34

34 RESULTS VGG16 on ImageNet Pruned over 7 epochs Top-5 validation set Fine-tuning 7 epochs GFLOPs FLOPS reduction Actual speed up Top x 89.5% x 2.5x -2.5% 8 3.9x 3.3x -5.0% 35

35 RESULTS R3DCNN for gesture recognition 3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures P. Molchanov, Gesture recognition with 3D CNNs, GTC

36 RESULTS R3DCNN for gesture recognition 3D-CNN with recurrent layers fine-tuned for 25 dynamic gestures 12.6x Reduction in FLOPs Drop in accuracy 2.5% 5.2x Speed-up P. Molchanov, Gesture recognition with 3D CNNs, GTC

37 How many neurons we need to classify a cat? 38

38 DOGS VS. CATS Dogs vs. Cats classification Marco Lugo s solution, 3 rd place 25,000 images 39

39 DOGS VS. CATS Fine-tuned ResNet-101 Full network 99.2% Pruned network 99.0 % 40

40 Convolutional units DOGS VS. CATS Fine-tuned ResNet units 15x Full network 99.2% Compression 3472 units Pruning iteration Pruned network 99.0 % 41

41 CONCLUSIONS Pruning as greedy feature selection New criteria based on Taylor expansion Pruning is especially effective (and necessary!) for transfer learning Pruning can incorporate desired objectives (such as FLOPs) Read more in our ICLR2017 paper: 42

42 THANK YOU!

Fine-grained Classification

Fine-grained Classification Marcel Simon Department of Mathematics and Computer Science, Germany marcel.simon@uni-jena.de http://www.inf-cv.uni-jena.de/ Seminar Talk 23.06.2015 Outline 1 Motivation 2 3