Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Size: px

Start display at page:

Download "Convolutional Neural Networks II. Slides from Dr. Vlad Morariu"

Nancy Bryant
5 years ago
Views:

1 Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1

2 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2

3 Learning rate The effects of step size (or learning rate ) 3

4 Multiple gradient update formulas The effects of different update form formulas (image credits to Alec Radford) 4

5 Stochastic Gradient Descent (SGD) Update weights for each sample E = 1 2 yn y n 2 + Fast, online Sensitive to noise w i t + 1 = w i t ε En w i Minibatch SGD: Update weights for a small set of samples E = 1 2 n B y n y n 2 w i t + 1 = w i t ε EB w i + Fast, online + Robust to noise Slide credit: Bohyung Han 5

6 Momentum Remember the previous direction + Converge faster + Avoid oscillation v i t = αv i t 1 ε E w i (t) w t + 1 = w t + v(t) Slide credit: Bohyung Han 6

7 Weight Decay Penalize the size of the weights C = E i w i 2 w i t + 1 = w i t ε C w i = w i t ε E w i λw i + Improve generalization a lot! Slide credit: Bohyung Han 7

8 Issues in Deep Neural Networks Large amount of training time There are sometimes a lot of training data Many iterations (epochs) are typically required for optimization Computing gradients in each iteration takes too much time Overfitting Learned function fits training data well, but performs poorly on new data (high capacity model, not enough training data) Vanishing gradient problem E w ki = n z i n w ki d y i z y n dz i n Sigmoid E y i n = n z i n w ki d y i Gradients in the lower layers are typically extremely small Optimizing multi-layer neural networks takes huge amount of time n dz i n j w ij d y j n dz j n E y j n Slide credit: adapted from Bohyung Han 8

9 New winter and revival in early 2000 s New winter in the early 2000 s due to problems with training NNs Support Vector Machines (SVMs), Random Forests (RF) easy to train, nice theory Revival again by Name change ( neural networks -> deep learning ) + Algorithmic developments unsupervised layer-wise pre-training ReLU, dropout, layer normalizatoin + Big data + GPU computing = Large outperformance on many datasets (Vision: ILSVRC 12) 9

Big Data ImageNet Large Scale Visual Recognition Challenge 1000 categories w/ 1000 images per category 1.2 million training images, 50,000 validation, 150,000 testing O. Russakovsky, J.

10 Big Data ImageNet Large Scale Visual Recognition Challenge 1000 categories w/ 1000 images per category 1.2 million training images, 50,000 validation, 150,000 testing O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV,

11 AlexNet Architecture Figure credit: Krizhevsky et al, NIPS million parameters! Various tricks ReLU nonlinearity Overlapping pooling Local response normalization Dropout set hidden neuron output to 0 with probability.5 Data augmentation Training on GPUs Alex Krizhevsky, Ilya Sutskeyer, Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS,

12 GPU Computing Big data and big models require lots of computational power GPUs thousands of cores for parallel operations multiple GPUs still took about 5-6 days to train AlexNet on two NVIDIA GTX 580 3GB GPUs (much faster today) 12

13 Architecture overview Image credit: LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, Components: Convolution layers Pooling/Subsampling layers Fully connected layers, batch normalization layers 13

14 Convolutional Layer xx3 image height 3 depth width 14

15 Convolutional Layer xx3 image 5x5x3 filter 3 Convolve the filter with the image i.e. slide over the image spatially, computing dot products 11 Jan

16 Convolutional Layer xx3 image Filters always extend the full depth of the input volume 5x5x3 filter 3 Convolve the filter with the image i.e. slide over the image spatially, computing dot products 11 Jan

17 Convolutional Layer xx3 image 5x5x3 filter 3 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product 11 Jan + bias)

18 Convolutional Layer xx3 image 5x5x3 filter activation map 28 convolve (slide) over all spatial locations Jan

19 Convolutional Layer consider a second, green filter xx3 image 5x5x3 filter activation maps 28 convolve (slide) over all spatial locations Jan

20 Convolutional Layer For example, if we had 6 5x5 filters, we ll get 6 separate activation maps: activation maps 28 Convolution Layer Jan 2016 We stack these up to get a new image of size 28x28x6! 20

21 Convolutional Layer ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 28 3 CONV, ReLU e.g. 6 5x5x3 filters Jan

22 Convolutional Layer ConvNet is a sequence of Convolutional Layers, interspersed with activation functions CONV, ReLU e.g. 6 5x5x3 filters 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU. 11 Jan

23 The brain/neuron view of CONV Layer xx3 image 5x5x3 filter 3 1 number: the result of taking a dot product between the filter and this part of the image (i.e. 5*5*3 = 75-dimensional dot product) 11 Jan

24 The brain/neuron view of CONV Layer xx3 image 5x5x3 filter 3 It s just a neuron with local connectivity... 1 number: the result of taking a dot product between the 11 Jan 2016 filter and this part of the image (i.e. 5*5*3 = 75-dimensional dot product) 24

25 The brain/neuron view of CONV Layer 28 An activation map is a 28x28 sheet of neuron outputs: 1. Each is connected to a small region in the input 2. All of them share parameters x5 filter -> 5x5 receptive field for each neuron 11 Jan

26 The brain/neuron view of CONV Layer 28 E.g. with 5 filters, CONV layer consists of neurons arranged in a 3D grid (28x28x5) There will be 5 different neurons 11 Jan 2016 all looking at the same region in the input volume 26

27 Pooling Layer - makes the representations smaller and more manageable - operates over each activation map independently: 11 Jan

28 Pooling Layer MAX POOLING x Single depth slice y max pool with 2x2 filters and stride Jan

29 Many more details Data preprocessing Initialization Learning rate Batch normalization layer ReLU, prelu, other activation functions 29

30 AlexNet 11 Jan

31 Convolutional filter visualization [From recent Yann LeCun slides] 11 Jan

32 Convolutional filter visualization [From recent Yann LeCun slides] 11 Jan 2016

33 Convolutional filter visualization one filter => one activation map example 5x5 filters ( total) We call the layer convolutional because it is related to convolution of two signals: 11 Jan 2016 elementwise multiplication and sum of a filter and the signal (image) 33

34 Case Study: VGGNet [Simonyan and Zisserman, 2014] Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2 best model 11.2% top 5 error in ILSVRC > 7.3% top 5 error 11 Jan

35 Case Study: GoogLeNet [Szegedy et al., 2014] Inception module 11 Jan 2016 ILSVRC 2014 winner (6.7% top 5 error) 35

36 Case Study: ResNet [He et al., 2015] 11 Jan

37 Questions? References (& great tutorials):

38 Unsupervised Neural Networks Autoencoders Encode then decode the same input No supervision needed output x hidden layer (Restricted) Boltzman Machines (RBMs) Stochastic networks that can learn representations Restricted version: neurons must form bipartite graph input x H. Bourlard and Y. Kamp Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59, 4-5 (September 1988), hidden layer input x Ackley, David H; Hinton Geoffrey E; Sejnowski, Terrence J, "A learning algorithm for Boltzmann machines", Cognitive science, Elsevier, Smolensky, Paul. "Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations,

not (Bengio et al, 1994) Image credit: Chritopher Olah s blog http://colah.github.

39 Recurrent Neural Networks Networks with loops The output of a layer is used as input for the same (or lower) layer Can model dynamics (e.g. in space or time) Loops are unrolled Now a standard feed-forward network with many layers Suffers from vanishing gradient problem In theory, can learn long term memory, in practice not (Bengio et al, 1994) Image credit: Chritopher Olah s blog Sepp Hochreiter (1991), Untersuchungen zu dynamischen neuronalen Netzen, Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber. Y. Bengio, P. Simard, P. Frasconi. Learning Long-Term Dependencies with Gradient Descent is Difficult. In TNN

Long Short Term Memory (LSTM) Image credit: Christopher Colah s blog, http://colah.github.

40 Long Short Term Memory (LSTM) Image credit: Christopher Colah s blog, A type of RNN explicitly designed not to have the vanishing or exploding gradient problem Models long-term dependencies Memory is propagated and accessed by gates Used for speech recognition, language modeling Hochreiter, Sepp; and Schmidhuber, Jürgen. Long Short-Term Memory. Neural Computation,

41 Image Classification Performance Figure from: K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. arxiv (slides) Image Classification Top-5 Errors (%) Slide credit: Bohyung Han 41

Deep Neural Networks CMSC 422 MARINE CARPUAT. Deep learning slides credit: Vlad Morariu

Deep Neural Networks CMSC 422 MARINE CARPUAT marie@cs.umd.edu Deep learig slides credit: Vlad Morariu Traiig (Deep) Neural Networks Computatioal graphs Improvemets to gradiet descet Stochastic gradiet