Understanding CNNs using visualisation and transformation analysis

Size: px

Start display at page:

Download "Understanding CNNs using visualisation and transformation analysis"

Logan Cannon
5 years ago
Views:

1 Understanding CNNs using visualisation and transformation analysis Andrea Vedaldi Medical Imaging Summer School August 2016

2 Image representations 2 encoder Φ representation An encoder maps the data into a vectorial representation Facilitate labelling of images, text, sound, videos,

3 Modern convolutional nets 3? [AlexNet by Krizhevsky et al. 2012] Excellent performance in image understanding tasks Learn a sequence of general-purpose representations Millions of parameters learned from data The meaning of the representation is unclear

4 Understanding visual representations 4 Visualizing representations Backpropagation networks and deconvolution Representations: equivalence & transformations

5 Visualization: Pre-Image 5 Image Space Φ Representation Space Φ -1 Φ(x) Φ -1 The representation is not injective

6 Visualization: Pre-Image 6 Image Space Representation Space Φ -1 Φ(x) The representation is not injective The reconstruction ambiguity provides useful information about the representation

7 Finding a Pre-Image 7 A simple yet general and effective method Image Representation Pre-Image Reconstruction Start from random noise Optimize using stochastic gradient descent

8 Finding a Pre-Image 8 A simple yet general and effective method No prior TV-norm β = 1 TV-norm β = 2

9 Related Work 9 Analysis tools Visualizing higher-layer features of a deep network Ethan et al [intermediate features] Deep inside convolutional networks Simonyan et al [deepest features, aka deep dreams ] DeConvNets Zeiler et al. In ECCV, 2014 [intermediate features] Understanding neural networks through deep visualisation Yosinksi et al [intermediate features] Artistic tools Google s inceptionsm Mordvintsev et al Style synthesis and transfer Gatys et al. 2015

10 Inversion 10 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Conv 1 ReLU 1 Max pool 1 LRN 1 Conv 2 ReLU 2 Max pool 2 LRN 2 Conv 4 ReLU 4 Conv 5 ReLU 5 Max pool 5 FC 6 ReLU 6 Conv 3 ReLU 3 FC 7 ReLU 7 AlexNet [Krizhevsky et al. 2012] FC 8

11 Inversion 11 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

12 Inversion 12 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

13 Inversion 13 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

14 Inversion 14 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

15 Inversion 15 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

16 Inversion 16 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

17 Inversion 17 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

18 Inversion 18 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

19 Inversion 19 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

20 Inversion 20 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

21 Inversion 21 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

22 Inversion 22 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

23 Inversion 23 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

24 Inversion 24 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

25 Inversion 25 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

26 Inversion 26 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

27 Inversion 27 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

28 Inversion 28 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

29 Inversion 29 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

30 Inversion 30 conv 1 conv 2 conv 3 conv 4 conv 5 fc 6 fc 7 fc 8 Original Image

31 Inverting a Deep CNN Conv 1 31 Conv 2 Original Image Conv 3 Conv 4 Conv 5 ReLU 6 FC 6 fc 8 FC 7 FC 8

32 CNNs = visual codes? 32 conv1 conv5 fc8

33 Activation Maximization 33 Look for an image that maximally activates a specific feature component

34 34

35 35

36 36

37 37

38 38

39 39

40 Remember: the starting point is white noise 40 Not an image!

41 41

42 conv1 42

43 conv2 43

44 conv3 44

45 conv4 45

46 conv5 46

47 Network comparison 47 conv5 features AlexNet VGG-M VGG-VD

48 Caricaturization 48 [Google Inceptionism 2015, Mahendran et al. 2016] Emphasise patterns that are detected by a certain representation Key differences: the starting point is the image x0 particular configurations of features are emphasized, not individual features

49 Caricaturization (VGG-M) input conv2 conv3 49 conv4

50 Caricaturization (VGG-M) conv5 fc6 fc7 50 fc8

51 51

52 Interlude: neural art 52 Surprisingly, the filters learned by discriminative neural networks capture well the style of an image. This can be used to transfer the style of an image (e.g. a painting) to any other. Optimisation based L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks. In Proc. NIPS, Feed-forward neural network equivalents D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feedforward synthesis of textures and stylized images. Proc. ICML, J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In Proc. ECCV, 2016.

53 Generation by moment matching 53 sum loss loss loss loss loss Moment matching Content statistics: same as inversion Style statistics: cross-channel correlations

54 54

55 55

56 56

57 57

58 58

59 59

60 60

61 61

62 62

63 63

64 64

65 65

66 66

67 67

68 68

69 Understanding visual representations 69 Visualizing representations Backpropagation networks and deconvolution Representations: equivalence & transformations

70 Backpropagation 70 Compute derivatives using the chain rule bike c1 c2 c3 c4 c5 f6 f7 f8 loss error w1 w2 w3 w4 w5 w6 w7 w8 x forward R backward derror derror derror derror derror derror derror derror dw1 dw2 dw3 dw4 dw5 dw6 dw7 dw8

71 Chain rule: scalar version 71 x0 x1 xn-1 f1 f2 fn-1 fn xn

72 Chain rule: scalar version 72 xn xn-1 x1 x0 A composition of n functions Derivative chain rule

73 Tensor-valued functions 73 E.g. linear convolution = bank of 3D filters height width channels instances input x H W C 1 or N filters F Hf Wf C K output y H - Hf + 1 W - Wf + 1 K 1 or N

74 Vector representation 74 3D tensors vec vectors

75 Derivative of tensor-valued functions 75 Derivative (Jacobian): every output element w.r.t. every input element! The vec operator allows us to use a familiar matrix notation for the derivatives

76 Chain rule: tensor version 76 Using vec() and matrix notation xn xn-1 x1 x0

77 The (unbearable) size of tensor derivatives 77 The size of these Jacobian matrices is huge. Example: 275 B elements TB of memory required!!

78 Unless the output is a scalar 78 Scalar This is always the case if the last layer is the loss function Now the Jacobian has the same size as x. Example: 524K elements Just 2MB of memory 1 1 1

79 Backpropagation 79 Assume that xn is a scalar (e.g. loss) xn xn-1 x1 x0 compute this first! small explicitly compute uber matrices do not explicitly compute

80 Backpropagation 80 Assume that xn is a scalar (e.g. loss) xn xn-1 x1 x0 small explicitly compute uber matrices do not explicitly compute

81 Backpropagation 81 Assume that xn is a scalar (e.g. loss) xn xn-1 x1 x0 small explicitly compute uber matrix do not explicitly compute

82 Backpropagation 82 Assume that xn is a scalar (e.g. loss) xn xn-1 x1 x0 small explicitly compute

83 Projected function derivative 83 The BP-reversed layer z y x function projected onto p projected function derivative = An equivalent circuit is obtained by introducing a transposed function f T

84 Anatomy of a building block 84 forward (eval) x vl_nnconv y W, b y = vl_nnconv(x, W, b)

85 Anatomy of a building block 85 forward (eval) x vl_nnconv y z() z ε R W, b y = vl_nnconv(x, W, b)

86 Anatomy of a building block 86 forward (eval) x vl_nnconv y z ε R W, b y = vl_nnconv(x, W, b)

87 Anatomy of a building block 87 forward (eval) x vl_nnconv y z ε R W, b y = vl_nnconv(x, W, b) backward (backprop) dz dx vl_nnconv dz dy dz dw dz db dzdx = vl_nnconv(x, W, b, dzdy)

88 Backpropagation network 88 BP induces a reversed network xn xn-1 x1 x0 dxn dxn-1 dx1 dx0 where Note: the BP network is linear in dx1,, dxn-1,dxn. Why?

89 Backpropagation network 89 BP induces a transposed network forward x0 x1 xn-1 xn dx0 dx1 dxn-1 dxn backward

90 Backpropagation network 90 Conv, ReLU, MP and their transposed blocks forward x0 x1 x2 conv ReLU MP x3 dx0 dx1 dx2 conv BP ReLU BP MP BP dx3 backward

91 Sufficient statistics and bottlenecks 91 Usually much less information is needed forward x0 x1 x2 conv ReLU MP x3 nothing! on/off mask pooling switches dx0 dx1 dx2 conv BP ReLU BP MP BP dx3 backward

92 Three visualisation techniques 92 Modified backpropagation networks x0 x1 x2 conv ReLU MP x3 nothing! on/off mask pooling switches SaliNet Equiv. to [Simonyian et al. 14] dx0 dx1 dx2 conv BP ReLU BP MP BP dx3 DeConvNet Same as [Zeiler Fergus 14] dx0 dx1 dx2 dx3 conv BP ReLU MP BP DeSaliNet [Mahendran Vedaldi 16, Springenberg et al. 15] dx0 dx1 dx2 dx3 conv BP ReLU + ReLU BP MP BP

93 Results 93 - VGG-VD Pool VGG-VD FC8 - Image DeConvNet SaliNet DeSaliNet DeConvNet SaliNet DeSaliNet

94 ECCV ECCV #1160 ECCV-16 submission ID 1160 They are largely not neuron selective pool fc8 pool5 fc Rnd. neuron 363 pool Rnd. noise 362 fc8 9 Max neuron # Key limitation of DeConvNets and similar DeConvNet DeSaliNet SaliNet Fig. 4: Lack of neuron selectivity. The bottleneck information r is fixed to the one computed during the forward pass pxq through AlexNet and the output of : pe, rq is comgood for saliency puted by choosing e as: the most active neuron (top row), a second neuron at random (middle), or as a positive random mixture of all neurons (bottom row). Results barely Bad for studying individual neurons differ, particularly for the deeper layers. See figure 1 for the original house input image (use inversion, act. max., etc. instead) x. Best viewed on screen

95 Understanding visual representations 95 Visualizing representations Backpropagation networks and deconvolution Representations: equivalence & transformations

96 When are two representations the same? 96 Learning representations means that there is an endless number of them Variants obtained by learning on different datasets, or different local optima CNN-A x CNN-B E Equivalence ΦB(x) = E ΦA(x)

97 Equivalence 97 AlexNet, same training data, different parametrization: CNN-A CNN-B FC ΦA ΦB Are ΦA and ΦB equivalent?

98 Equivalence 98 AlexNet, same training data, different parametrization: CNN-A CNN-B 5 FC label E 5 6 Classif. loss Train with SGD

99 Equivalence with different random seeds 99 AlexNet A E AlexNet B 100 Baseline 75 x FC Before training gx FC Top-5 Error [%] 50 After training 25 gx E 5 FC 0 E conv1 E conv2 E conv3 E conv4 E conv

100 Equivalence of similar architecture 100 Train on two different datasets ILSVRC12 dataset Places dataset CNN-IMNET CNN-PLACES FC

101 Equivalence with different training data 101 Places E Image Net 100 Baseline 75 x FC Naive stitching gx FC Top-5 Error [%] 50 Learned stitches 25 gx E 5 FC 0 E conv1 E conv2 E conv3 E conv4 E conv

embedding space R d congruous pair y near far

102 Meaningful representations 102 representation Semantic similarity Vector similarity (distance) x embedding space R d congruous pair y near far incongruous pair z Desiderata: invariance and distinctiveness

103 Invariance is task dependent 103 Are x and y the same category? near Are x and y the same colour? far

104 Equivariance 104 image space feature space x g gx Mg Φ(gx) x g gx Φ Φ(x) Φ(x ) Mg Φ(gx ) g is an image transformation g is the corresponding feature transformation Φ(gx) = Mg Φ(x)

105 Invariance 105 image space feature space x g gx Φ(x) = Φ(gx) Mg x g gx Φ Mg Φ(x ) = Φ(gx ) g is an image transformation g is the corresponding feature transformation the map Mg is the identity

106 No equivariance 106 image space feature space x g gx Φ(gx) x g gx Φ Φ(x) Φ(x ) Φ(gx ) g is an image transformation g is the corresponding feature transformation the map Mg does not exists (or is intractable)

107 Representations and transformations 107 image space feature space Equivariance There is a (simple) Mg g is an image transformation Invariance Mg is the identity Neither There is no (simple) Mg Φ(gx) = Mg Φ(x)

108 An empirical test of equivariance 108 Regularized linear regression g Φ Φ(x) Mg Mg Φ(x) Φ(gx) Φ Ag Φ(x) + bg (learned empirically)

109 An empirical test of equivariance 109 Regularized linear regression g Φ Mg Φ Ag Φ(x) + bg (learned empirically) permutation convolution by1 1 filter bank Ag

110 Φ M Φ An empirical test of equivariance 110 HOG features rotation 45 deg Φ Mg Φ Φ Mg Φ Φ Mg Φ

111 Φ M Φ An empirical test of equivariance 111 HOG features rotation 45 deg Φ Mg Φ Φ Mg Φ Φ Mg Φ

112 CNN: a sequence of representations 112 x FC dog Φ Ψ label x FC Φ Ψ Classif. error We run the same analysis on a typical CNN architecture AlexNet [Krizevsky et al. 12] 5 convolutional layers + fully-connected layers Trained on ImageNet ILSVRC

113 A discriminative goal to learn equivariance 113 x FC dog Φ Ψ label x FC Ψ Classif. error g Mg Mg is learned empirically Φ All the other layers (representation and classifier) are frozen

114 Vertical flips 114 g 60 Uncompensated, no TF x FC Uncompensated, TF gx FC Top-5 Error [%] Compensated TF w perm. 15 gx FC Compensated TF, learned gx FC 0 Mg conv Mg conv Mg conv Mg conv Mg conv

115 Summary 115 Modern CNNs are still new technology Expect bigger and meaner CNNs to further improve performance CNNs are more than big balls of parameters We do not really understand what they do Understanding deep nets What do deep networks do? Are there computational / statistical principles to be learned? How much do deep nets learn about the visual world? Possible angles Visualisations Probing statistical properties

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based UNDERSTANDING CNN ADAS Tasks Self Driving Localizati on Perception Planning/ Control Driver state Vehicle Diagnosis Smart factory Methods Traditional Deep-Learning based Non-machine Learning Machine-Learning