Introduction to CNN and PyTorch

Size: px

Start display at page:

Download "Introduction to CNN and PyTorch"

Scott Jefferson
5 years ago
Views:

1 Introduction to CNN and PyTorch Kripasindhu Sarkar Kaiserslautern University, DFKI Deutsches Forschungszentrum für Künstliche Intelligenz Some of the contents are taken from Stanford CS231n Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

2 Contents Machine learning fundamentals Convolutional Neural Networks Models, parameters, scores Loss function Back-propagation Convolution filters Convolution operator CNN architecture Popular CNN architectures General training principles: transfer learning/model initializations/optimization strategies Introduction to PyTorch General overview Examples from Official documentation Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

3 Machine learning - Classification Given an image predict the label Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

4 Machine learning - Classification Classifier (magic) label cat input Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

5 Machine learning - Classification Train Classifier (Learn W) Learn: Model f() with parameters W Predict Classifier (Use W) label cat input Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

6 Machine learning - Classification Model/Score function - F(X, W) Takes input: data sample - X and parameters - W W - internal parameters or weights Maps input data X to class scores More score for a class - more likely it belongs to that class N classes - N different scores Predicted class - class with max score? Loss function L (F, data samples) Data samples - { (Xi, yi) } Measures how good is F Higher loss - bad model Lower loss - good model Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

7 Parametric Approach Image f(x,w) Array of 32x32x3 numbers (3072 numbers total) 10 numbers giving class scores W parameters or weights Lecture 2-7 April 5, 2018

8 Parametric Approach: Linear Classifier Image f(x,w) = Wx f(x,w) Array of 32x32x3 numbers (3072 numbers total) 10 numbers giving class scores W parameters or weights Lecture 2-8 April 5, 2018

9 Parametric Approach: Linear Classifier Image f(x,w) = Wx +3072x1 b 10x1 10x1 10x3072 f(x,w) Array of 32x32x3 numbers (3072 numbers total) 10 numbers giving class scores W parameters or weights Lecture 2-9 April 5, 2018

10 Example with an image with 4 pixels, and 3 classes (cat/dog/ship) Stretch pixels into column Input image Lecture 2-10 April 5, 2018

Example with an image with 4 pixels, and 3 classes (cat/dog/ship) Stretch pixels into column 56 56 24 231 2 Input image 0.2-0.5 0.1 2.0 1.

11 Example with an image with 4 pixels, and 3 classes (cat/dog/ship) Stretch pixels into column Input image Cat score Dog score Ship score = W b Lecture 2-11 April 5, 2018

12 Machine learning - Classification Model/Score function - F(X, W) Takes input: data sample - X and parameters - W W - internal parameters or weights Maps input data X to class scores More score for a class - more likely it belongs to that class N classes - N different scores Predicted class - class with max score? Loss function L (F, data samples) Data samples - { (Xi, yi) } Measures how good is F Higher loss - bad model Lower loss - good model Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

13 Suppose: 3 training examples, 3 classes. are: With some W the scores A loss function tells how good our current classifier is Given a dataset of examples cat car frog Where is image and is (integer) label Loss over the dataset is a sum of loss over examples: Lecture 3-13 April 10, 2018

14 Suppose: 3 training examples, 3 classes. are: With some W the scores Multiclass SVM loss: Given an example is the image and where where is the (integer) label, and using the shorthand for the scores vector: cat car frog the SVM loss has the form: Lecture 3-14 April 10, 2018

15 Suppose: 3 training examples, 3 classes. are: With some W the scores Multiclass SVM loss: Given an example Hinge loss where is the image and where is the (integer) label, cat car frog and using the shorthand for the scores vector: the SVM loss has the form: Lecture 3-15 April 10, 2018

16 E.g. Suppose that we found a W such that L = 0. Is this W unique? No! 2W is also has L = 0! How do we choose between W and 2W? Lecture 3-16 April 10, 2018

$Simple examples L2 regularization: L1 regularization: More complex: Dropout Batch normalization Stochastic depth, fractional pooling, etc Elastic net (L1 + L2): Lecture 3-17 April 10,$

17 Regularization Data loss: Model predictions should match training data = regularization strength (hyperparameter) Regularization: Prevent the model from doing too well on training data Simple examples L2 regularization: L1 regularization: More complex: Dropout Batch normalization Stochastic depth, fractional pooling, etc Elastic net (L1 + L2): Lecture 3-17 April 10, 2018

18 Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog exp Probabilities must sum to 1 normalize unnormalized probabilities probabilities Lecture 3-18 April 10, 2018

19 Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog Unnormalized log-probabilities / logits exp Probabilities must sum to 1 normalize unnormalized probabilities Li = -log(0.13) = 0.89 Cross entropy loss probabilities Lecture 3-19 April 10, 2018

20 Machine learning - Classification Model/Score function - F(X, W) Takes input: data sample - X and parameters - W W - internal parameters or weights Maps input data X to class scores More score for a class - more likely it belongs to that class N classes - N different scores Predicted class - class with max score? Loss function L (F, data samples) Data samples - { (Xi, yi) } Measures how good F is Higher loss - bad model Lower loss - good model Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

21 Machine learning - Classification Model/Score function - F(X, W) Takes input: data sample - X and parameters - W W - internal parameters or weights Maps input data X to class scores More score for a class - more likely it belongs to that class N classes - N different scores Predicted class - class with max score? Loss function L (F, data samples) Data samples - { (Xi, yi) } Measures how good F is Higher loss - bad model Lower loss - good model Determine the parameters W that give the lowest loss! Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

22 Optimization Find W that minimizes the loss function L(X, W). Random search? Follow the -ve gradient Gradient of f(x): f(x) - f(x) : direction of the steepest descend The direction along which the function decreases the maximum amount Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

23 Optimization Minimize the loss function L(W) wrt. W Find the gradient of L: Update W as W = W - h x h = step size/learning rate Do this iteratively Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

24 Optimization Find W that minimizes the loss function. W_2 original W negative gradient direction W_1 April 10, Introduction to CNN and PyTorch - Kripasindhu Sarkar - May

25 Optimization - Backpropagation A systematic method of computing gradient of a function f(x) Writing compound expression of the function f(x) as a computational graph. Recursively applying chain rule. Often called backward phase of the network. Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

26 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Lecture 4-26 April 13, 2017

27 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Lecture 4-27 April 13, 2017

28 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Lecture 4-28 April 13, 2017

29 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Lecture 4-29 April 13, 2017

30 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Lecture 4-30 April 13, 2017

31 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Lecture 4-31 April 13, 2017

32 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Want: Lecture 4-32 April 13, 2017

33 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Chain rule: Want: Upstream gradient Lecture 4-33 Local gradient April 13, 2017

34 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Chain rule: Want: Upstream gradient Lecture 4-34 Local gradient April 13, 2017

35 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Chain rule: Want: Upstream gradient Lecture 4-35 Local gradient April 13, 2017

36 Backpropagation: a simple example e.g. x = -2, y = 5, z = -4 Chain rule: Want: Upstream gradient Lecture 4-36 Local gradient April 13, 2017

37 local gradient f gradients Lecture 4-37 April 12, 2018

38 Machine learning - Summary Model/Score function - F(X, W) Loss function L (F, data samples) Maps input data X to class scores More score for a class - more likely it belongs to that class Measures how good F is Good parameters are found by - minimizing loss function L(W) wrt the variables W The function is minimized by iteratively updating the weights in the negative gradient direction. Gradients are computed using backpropagation. Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

39 Machine learning - Examples Models - F(X, W) Loss functions Linear Fully Connected (FC), multilayered FC CNNs SVM Cross Entropy Euclidean/L2 (mostly for regression) Optimization strategy Gradient descent ADAM, RMSPROP Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

40 Machine learning - Examples Models - F(X, W) Linear Fully Connected (FC), multilayered FC Loss functions CNNs SVM Cross Entropy Euclidean/L2 (mostly for regression) Optimization strategy Gradient descent ADAM, RMSPROP Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

41 Neural networks: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network x 3072 W1 h 100 W2 s 10 Lecture 4-41 April 12, 2018

42 Impulses carried toward cell body dendrite presynaptic terminal axon cell body Impulses carried away from cell body This image by Felipe Perucho is licensed under CC-BY 3.0 sigmoid activation function Lecture 4-42 April 12, 2018

43 Convolutional Neural Networks Lecture 5-43 April 17, 2018

44 Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 10 x 3072 weights 1 10 Lecture 5-44 April 17, 2018

45 Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 10 x 3072 weights number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product) Lecture 5-45 April 17, 2018

46 Convolution Layer 32x32x3 image -> preserve spatial structure 32 height 32 width 3 depth Lecture 5-46 April 17, 2018

47 Convolution Layer 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. slide over the image spatially, computing dot products 32 3 Lecture 5-47 April 17, 2018

48 Convolution Layer Filters always extend the full depth of the input volume 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. slide over the image spatially, computing dot products 32 3 Lecture 5-48 April 17, 2018

49 Convolution Layer 32x32x3 image 5x5x3 filter number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias) 3 Lecture 5-49 April 17, 2018

50 Convolution Layer activation map 32x32x3 image 5x5x3 filter convolve (slide) over all spatial locations Lecture 5-50 April 17, 2018

51 Convolution Layer 32 consider a second, green filter activation maps 32x32x3 image 5x5x3 filter 28 convolve (slide) over all spatial locations Lecture 5-33 April 17, 2018

52 For example, if we had 6 5x5 filters, we ll get 6 separate activation maps: activation maps Convolution Layer We stack these up to get a new image of size 28x28x6! Lecture 5-52 April 17, 2018

53 Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions CONV, ReLU e.g. 6 5x5x3 filters 28 6 Lecture 5-53 April 17, 2018

54 Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions CONV, ReLU e.g. 6 5x5x3 filters CONV, ReLU e.g. 10 5x5x6 filters 6 CONV, ReLU Lecture 5-54 April 17, 2018

55 Preview [Zeiler and Fergus 2013] Lecture 5-55 April 17, 2018

56 A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 Lecture 5-56 April 17, 2018

57 A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 Lecture 5-57 April 17, 2018

58 A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 Lecture 5-58 April 17, 2018

59 A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 Lecture 5-59 April 17, 2018

60 A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 => 5x5 output Lecture 5-60 April 17, 2018

61 Common settings: K = (powers of 2, e.g. 32, 64, 128, 512) - F = 3, S = 1, P = 1 - F = 5, S = 1, P = 2 - F = 5, S = 2, P =? (whatever fits) - F = 1, S = 1, P = 0 Lecture 5-61 April 17, 2018

62 Corresponding PyTorch miethod Class torch.nn.conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=true) Lecture 5-62 April 17, 2018

63 Remember back to E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (32 -> 28 -> 24...). Shrinking too fast is not good, doesn t work well CONV, ReLU e.g. 6 5x5x3 filters CONV, ReLU e.g. 10 5x5x6 filters 6 CONV, ReLU Lecture 5-63 April 17, 2018

64 Pooling layer - makes the representations smaller and more manageable - operates over each activation map independently: Lecture 5-64 April 17, 2018

65 MAX POOLING Single depth slice x max pool with 2x2 filters and stride y Lecture 5-65 April 17, 2018

66 Summary - ConvNets stack CONV,POOL,FC layers - Trend towards smaller filters and deeper architectures - Trend towards getting rid of POOL/FC layers (just CONV) - Typical architectures look like [(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX where N is usually up to ~5, M is large, 0 <= K <= 2. - but recent advances such as ResNet/GoogLeNet challenge this paradigm Lecture 5-66 April 17, 2018

67 CNN Architectures Case Studies - AlexNet VGG GoogLeNet ResNet Also... - NiN (Network in Network) Wide ResNet ResNeXT Stochastic Depth Squeeze-and-Excitation Network - DenseNet FractalNet SqueezeNet NASNet Lecture 9-6 May 1, 2018

68 Review: LeNet-5 [LeCun et al., 1998] Conv filters were 5x5, applied at stride 1 Subsampling (Pooling) layers were 2x2 applied at stride 2 i.e. architecture is [CONV-POOL-CONV-POOL-FC-FC] Lecture 9-6 May 1, 2018

69 Case Study: AlexNet [Krizhevsky et al. 2012] Architecture: CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8 Lecture 9-6 May 1, 2018

70 Case Study: AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores) Details/Retrospectives: -first use of ReLU -used Norm layers (not common anymore) -heavy data augmentation -dropout 0.5 -batch size 128 -SGD Momentum 0.9 -Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus -L2 weight decay 5e-4 Lecture 9-70 May 1, 2018

71 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners 152 layers 152 layers 152 layers Deeper Networks 19 layers shallow 8 layers 22 layers 8 layers Lecture 9-71 May 1, 2018

72 Case Study: VGGNet [Simonyan and Zisserman, 2014] Small filters, Deeper networks 8 layers (AlexNet) -> layers (VGG16Net) Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride % top 5 error in ILSVRC 13 (ZFNet) -> 7.3% top 5 error in ILSVRC 14 AlexNet VGG16 Lecture 9-72 VGG19 May 1, 2018

Efficient Inception module No FC layers Only 5 million parameters!

73 Case Study: GoogLeNet [Szegedy et al., 2014] Deeper networks, with computational efficiency - 22 layers Efficient Inception module No FC layers Only 5 million parameters! 12x less than AlexNet - ILSVRC 14 classification winner (6.7% top 5 error) Inception module Lecture 9-73 May 1, 2018

74 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Revolution of Depth 152 layers layers 19 layers shallow 8 layers layers 22 layers 8 layers Lecture 9-74 May 1, 2018

75 Case Study: ResNet [He et al., 2015] relu Very deep networks using residual connections F(x) + x layer model for ImageNet - ILSVRC 15 classification winner (3.57% top 5 error) - Swept all classification and detection competitions in ILSVRC 15 and COCO 15! F(x) relu X identity X Residual block Lecture 9-75 May 1, 2018

76 Transfer Learning You need a lot of a data if you want to train/use CNNs Lecture 7-76 April 24, 2018

77 D Transfer Learning BU S TE You need a lot of a data if you want to train/use CNNs Lecture 7-77 April 24, 2018

78 Donahue et al, DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, ICML 2014 Razavian et al, CNN Features Off-the-Shelf: An Astounding Baseline for Recognition, CVPR Workshops 2014 Transfer Learning with CNNs 1. Train on Imagenet 2. Small Dataset (C classes) 3. Bigger dataset FC-1000 FC-C FC-C FC-4096 FC-4096 FC-4096 FC-4096 FC-4096 MaxPool MaxPool Conv-512 Conv-512 Conv-512 Conv-512 Conv-512 Conv-512 MaxPool MaxPool MaxPool Conv-512 Conv-512 Conv-512 Conv-512 Conv-512 Conv-512 Reinitialize this and train Freeze these MaxPool MaxPool MaxPool Conv-256 Conv-256 Conv-256 Conv-256 Conv-256 Conv-256 With bigger dataset, train more layers MaxPool MaxPool MaxPool MaxPool Conv-128 Conv-128 Conv-128 Conv-128 Conv-128 Conv-128 MaxPool MaxPool MaxPool Conv-64 Conv-64 Conv-64 Conv-64 Conv-64 Conv-64 Image Image Image Train these FC Lecture 7-1 Freeze these Lower learning rate when finetuning; 1/10 of original LR is good starting point April 24, 2018

79 very similar dataset FC-1000 FC-4096 FC-4096 very different dataset MaxPool Conv-512 Conv-512 MaxPool Conv-512 very little data Use Linear You re in Classifier on top trouble Try layer linear classifier from different stages quite a lot of data Finetune a few layers More specific Conv-512 MaxPool Conv-256 Conv-256 MaxPool More generic Conv-128 Conv-128 MaxPool Conv-64 Finetune a larger number of layers Conv-64 Image 10 Lecture 7-4 April 24, 2018

80 Deep Learning Software Lecture 8-2 April 26, 2018

81 A zoo of frameworks! PaddlePaddle Caffe Caffe2 (UC Berkeley) (Facebook) Chainer (Baidu) MXNet CNTK (Amazon) Torch PyTorch (NYU / Facebook) (Facebook) (Microsoft) Developed by U Washington, CMU, MIT, Hong Kong U, etc but main framework of choice at AWS Deeplearning4j Theano TensorFlow (U Montreal) (Google) And others... Lecture 8-2 April 26, 2018

82 Recall: Computational Graphs x * s (scores) hinge loss L + W R Lecture 8-2 April 26, 2018

83 Recall: Computational Graphs input image weights loss Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, Reproduced with permission. Lecture 8-2 April 26, 2018

84 Computational Graphs Numpy x y z * a + b Σ c Lecture 8-2 April 26, 2018

85 Computational Graphs Numpy x y z * Good: - Clean API, easy to write numeric code a + b Σ c Bad: - Have to compute our own gradients - Can t run on GPU Lecture 8-3 April 26, 2018

86 Computational Graphs Numpy x y z PyTorch * a + b Σ c Looks exactly like numpy! Lecture 8-3 April 26, 2018

87 Computational Graphs Numpy x y z PyTorch * a + b Σ c PyTorch handles gradients for us! Lecture 8-3 April 26, 2018

88 PyTorch - Tutorial Tensors Tensors are similar to NumPy s ndarrays, with the addition being that Tensors can also be used on a GPU to accelerate computing. import torch x = torch.rand(5, 3) print(x) Out: tensor([[ [ [ [ [ , , , , , , , , , , ], ], ], ], ]]) Operations y = torch.rand(5, 3) print(x + y) Examples taken from - PyTorch: 60 Minute Blitz tutorial Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

89 PyTorch - Tensor If you set its attribute.requires_grad as True, it starts to track all operations on it. When you finish your computation call.backward() and have all the gradients computed automatically. x = torch.ones(2, 2, requires_grad=true) y = x + 2 //more operations z = y * y * 3 out = z.mean() Gradients //Doing back propagation on the entire computation graph out.backward() //print d(out)/dx print(x.grad) Out: tensor([[ , [ , ], ]]) Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

90 Review: LeNet-5 [LeCun et al., 1998] Conv filters were 5x5, applied at stride 1 Subsampling (Pooling) layers were 2x2 applied at stride 2 i.e. architecture is [CONV-POOL-CONV-POOL-FC-FC] Lecture 9-9 May 1, 2018

91 Machine learning - Summary Model/Score function - F(X, W) Loss function L (F, data samples) Maps input data X to class scores More score for a class - more likely it belongs to that class Measures how good F is Good parameters are found by - minimizing loss function L(W) wrt the variables W The function is minimized by iteratively updating the weights in the negative gradient direction. Gradients are computed using backpropagation. Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

PyTorch - Defining the network/model import torch import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def init (self): super(net, self).

linear(16 * 5 * 5, 120) self.fc2 = nn.linear(120, 84) self.fc3 = nn.linear(84, 10) def forward(self, x): # Max pooling over a (2, 2) window x = F.max_pool2d(F.relu(self.

92 PyTorch - Defining the network/model import torch import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def init (self): super(net, self). init () # 1 input image channel, 6 output channels, 5x5 square convolution # kernel self.conv1 = nn.conv2d(1, 6, 5) self.conv2 = nn.conv2d(6, 16, 5) # an affine operation: y = Wx + b self.fc1 = nn.linear(16 * 5 * 5, 120) self.fc2 = nn.linear(120, 84) self.fc3 = nn.linear(84, 10) def forward(self, x): # Max pooling over a (2, 2) window x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2)) # If the size is a square you can only specify a single number x = F.max_pool2d(F.relu(self.conv2(x)), 2) x = x.view(-1, self.num_flat_features(x)) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x def num_flat_features(self, x): size = x.size()[1:] # all dimensions except the batch dimension num_features = 1 for s in size: num_features *= s return num_features An nn.module contains layers, and a method forward(input) that returns the output. The learnable parameters of a model are returned by net.parameters() Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

93 PyTorch - Defining the network import torch import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def init (self): super(net, self). init () # 1 input image channel, 6 output channels, 5x5 square convolution # kernel self.conv1 = nn.conv2d(1, 6, 5) self.conv2 = nn.conv2d(6, 16, 5) # an affine operation: y = Wx + b self.fc1 = nn.linear(16 * 5 * 5, 120) self.fc2 = nn.linear(120, 84) self.fc3 = nn.linear(84, 10) def forward(self, x): # Max pooling over a (2, 2) window x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2)) # If the size is a square you can only specify a single number x = F.max_pool2d(F.relu(self.conv2(x)), 2) x = x.view(-1, self.num_flat_features(x)) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x def num_flat_features(self, x): size = x.size()[1:] # all dimensions except the batch dimension num_features = 1 for s in size: num_features *= s return num_features input = torch.randn(1, 1, 32, 32) out = net(input) print(out) tensor([[ , , , , , , , , ]]) net.zero_grad() out.backward(torch.randn(1, 10)) , Recap: torch.tensor - A multi-dimensional array with support for autograd operations like backward(). Also holds the gradient w.r.t. the tensor. nn.module - Neural network module. Convenient way of encapsulating parameters, with helpers for moving them to GPU, exporting, loading, etc. nn.parameter - A kind of Tensor, that is automatically registered as a parameter when assigned as an attribute to a Module. Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

94 PyTorch - Loss Function output = net(input) target = torch.arange(1, 11) # a dummy target, for example target = target.view(1, -1) # make it the same shape as output criterion = nn.mseloss() loss = criterion(output, target) print(loss) tensor( ) Backprop net.zero_grad() # zeroes the gradient buffers of all parameters print('conv1.bias.grad before backward') print(net.conv1.bias.grad) loss.backward() print('conv1.bias.grad after backward') print(net.conv1.bias.grad) conv1.bias.grad before backward tensor([ 0., 0., 0., 0., 0., 0.]) conv1.bias.grad after backward tensor([ , , , , , ]) Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

95 PyTorch - Optimize weight = weight - learning_rate * gradient learning_rate = 0.01 for f in net.parameters(): f.data.sub_(f.grad.data * learning_rate) import torch.optim as optim # create your optimizer optimizer = optim.sgd(net.parameters(), lr=0.01) # in your training loop: optimizer.zero_grad() # zero the gradient buffers output = net(input) loss = criterion(output, target) loss.backward() optimizer.step() # Does the update Train the network Step multiple times using all the data Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

96 PyTorch - Complete training Setup data trans = transforms.compose([transforms.totensor(), transforms.normalize((0.5,), (1.0,))]) # if not exist, download mnist dataset train_set = dset.mnist(root=root, train=true, transform=trans, download=true) test_set = dset.mnist(root=root, train=false, transform=trans, download=true) Train! for epoch in xrange(10): for batch_idx, (x, target) in enumerate(train_loader): batch_size = 100 train_loader = torch.utils.data.dataloader( optimizer.zero_grad() dataset=train_set, out = model(x) batch_size=batch_size, loss = criterion(out, target) shuffle=true) loss.backward() test_loader = torch.utils.data.dataloader( optimizer.step() dataset=test_set, batch_size=batch_size, shuffle=false) model = Net() optimizer = optim.sgd(model.parameters(), lr=0.01, momentum=0.9) criterion = nn.crossentropyloss() Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

97 Further reading CS231n: Introduction to CNN and PyTorch - Kripasindhu Sarkar - May 2018

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation