Statistical Machine Learning

Size: px

Start display at page:

Download "Statistical Machine Learning"

Merry Wilkerson
5 years ago
Views:

1 Statistical Machine Learning Lecture 9 Numerical optimization and deep learning Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University niklas.wahlstrom@it.uu.se 1 / 32 N. Wahlström, 2018

2 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 X 1 X p. 2 / 32 N. Wahlström, 2018

3 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 H 1 X 1 X p. ( H 1 = β (1) 01 + ) p j=1 β(1) j1 X j Z Z = β (2) 1 H 1 2 / 32 N. Wahlström, 2018

4 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 H 1. Z X 1 X p ( H 1 = β (1) 01 + ) p j=1 β(1) j1 X j ( H 2 = β (1) 02 + ) p j=1 β(1) j2 X j Z = 2 m=1 β (2) m H m 2 / 32 N. Wahlström, 2018

5 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 X H 1 1. X p. Z ( H M H 1 = β (1) 01 + ) p j=1 β(1) j1 X M j Z = β m (2) H m ( H 2 = β (1) 02 + ) m=1 p j=1 β(1) j2 X j. ( H M = β (1) 0M + ) p j=1 β(1) jm X j 2 / 32 N. Wahlström, 2018

6 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 1 X H 1 1. X p. Z ( H M H 1 = β (1) 01 + ) p j=1 β(1) j1 X M j Z = β (2) 0 + β m (2) H m ( H 2 = β (1) 02 + ) m=1 p j=1 β(1) j2 X j. ( H M = β (1) 0M + ) p j=1 β(1) jm X j 2 / 32 N. Wahlström, 2018

7 Summary of Lecture 8 (I/III) - Neural network H = A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 1 X 1 X p. [ H1. H M Hidden units ] 2 / 32 N. Wahlström, H = (W (1)T X + b (1)T ) b (1) = [ β (1) 01 W (1) = β (1) β(1) ] 0M... β(1) 1M..... β (1) p1... β(1) pm Z offset vector weight matrix Z = W (2)T H + b (2)T b (2) = [ β (2) 0 ] β (2) W (2) = 1. β (2) M

8 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 1 X 1 X p.. Z H = (W (1)T X + b (1)T ) Z = W (2)T H + b (2)T 2 / 32 N. Wahlström, 2018 The non-linearity acts element-wise.

9 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Hidden units Output 1 X 1 X p. 1 1 H (1) 1 H (1) 2.. H (1) M 1 H (1) = (W (1)T X + b (1)T ) H (2) = (W (2)T H (1) + b (2)T ) 2 / 32 N. Wahlström, 2018 Z = W (3)T H (2) + b (3)T H (2) 1 H (2) 2 H (2) M 2 Z The model learns better using a deep network (several layers) instead of a wide and shallow network.

10 Summary of Lecture 8 (II/III) - classification For K > 2 classes we want to predict the class probability for all K classes q k (X; θ) = Pr(Y = k X). We extend the logistic function to the softmax activation function q k (X; θ) = g k (Z) = ez k Kl=1, k = 1,..., K. e Z l Inputs Hidden units Logits Class probability 1 softmax(z) 1 Z 1 g 1 q 1 (X; θ)..... Z K g K q K (X; θ) X 1 X p 3 / 32 N. Wahlström, 2018 softmax(z) = [g 1 (Z),..., g K (Z)] T

11 Summary of Lecture 8 (III/III) - Example K = 3 classes Consider an example with three classes K = 3. Inputs 1 X 1 X p. Hidden units softmax(z) Class True output probabilities (one-hot 1 Logits encoding) Z 1 =-0.5 g 1 q 1 (X; θ)= 0.28 Y 1 = 0. Z 2 = 0.4 Z 3 = -2.5 g 2 g 3 q 2 (X; θ)= 0.68 q 3 (X; θ)= 0.04 Y 2 = 1 Y 3 = 0 The network is trained by minimizing the cross-entropy K L(X, Y, θ) = Y k log(q k (X; θ)) = log 0.68 = 0.39 k=1 4 / 32 N. Wahlström, 2018

12 Summary of Lecture 8 (III/III) - Example K = 3 classes Consider an example with three classes K = 3. Inputs 1 X 1 X p. Hidden units softmax(z) Class True output probabilities (one-hot 1 Logits encoding) Z 1 =-1.1 g 1 q 1 (X; θ)= 0.51 Y 1 = 0. Z 2 =-4.0 Z 3 =-1.2 g 2 g 3 q 2 (X; θ)= 0.03 q 3 (X; θ)= 0.46 Y 2 = 1 Y 3 = 0 The network is trained by minimizing the cross-entropy K L(X, Y, θ) = Y k log(q k (X; θ)) = log 0.03 = 3.51 k=1 4 / 32 N. Wahlström, 2018

13 Summary of Lecture 8 (III/III) - Example K = 3 classes Consider an example with three classes K = 3. Inputs 1 X 1 X p. Hidden units softmax(z) Class True output probabilities (one-hot 1 Logits encoding) Z 1 = 0.9 g 1 q 1 (X; θ)= 0.03 Y 1 = 0. Z 2 = 4.2 Z 3 = 0.2 g 2 g 3 q 2 (X; θ)= 0.95 q 3 (X; θ)= 0.02 Y 2 = 1 Y 3 = 0 The network is trained by minimizing the cross-entropy K L(X, Y, θ) = Y k log(q k (X; θ)) = log 0.95 = 0.05 k=1 4 / 32 N. Wahlström, 2018

14 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent 5 / 32 N. Wahlström, 2018

15 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. 5 / 32 N. Wahlström, 2018

16 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. 5 / 32 N. Wahlström, 2018

17 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. Peer-review: Deadline March 1, 23:59 (Thursday). 5 / 32 N. Wahlström, 2018

18 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. Peer-review: Deadline March 1, 23:59 (Thursday). The guest lecture: March 15 at Daniel Langkilde at Annotell. Room: 2446:ITC. 5 / 32 N. Wahlström, 2018

19 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. Peer-review: Deadline March 1, 23:59 (Thursday). The guest lecture: March 15 at Daniel Langkilde at Annotell. Room: 2446:ITC. Reading instructions slightly updated, see homepage. 5 / 32 N. Wahlström, 2018

20 Convolutional neural networks Convolutional neural networks (CNN) are a special kind neural networks tailored for problems where the input data has a grid-like structure. Examples Digital images (2D grid of pixels) Audio waveform data (1D grid, times series) Volumetric data e.g. CT scans (3D grid) The description here will focus on images. 6 / 32 N. Wahlström, 2018

21 Data representation of images Consider a grayscale image of 6 6 pixels. Each pixel value represents the color. The value ranges from 0 (total absence, black) to 1 (total presence, white) Image Data representation / 32 N. Wahlström, 2018

22 Data representation of images Consider a grayscale image of 6 6 pixels. Each pixel value represents the color. The value ranges from 0 (total absence, black) to 1 (total presence, white) The pixels are the input variables X 1,1, X 1,2,..., X 6,6. Image Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 7 / 32 N. Wahlström, 2018

23 The convolutional layer Consider a hidden layer with 6 6 hidden units. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 Hidden units 8 / 32 N. Wahlström, 2018

24 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 Hidden units 8 / 32 N. Wahlström, 2018

25 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3 β (1) 3,3 β (1) 0 Hidden units 8 / 32 N. Wahlström, 2018

26 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 β (1) 3,3 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 8 / 32 N. Wahlström, 2018

27 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 1 β (1) 0 Hidden units X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 β (1) 1,3 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018

28 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018

29 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018

30 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018

31 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 0 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 0 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 0 β (1) 3,3 8 / 32 N. Wahlström, 2018

32 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6, β (1) 0 β (1) 1,3 β (1) 3,3 Hidden units Conv. layer uses sparse interactions and parameter sharing 8 / 32 N. Wahlström, 2018

33 With stride 2 we get half the number of rows and columns in the hidden layer. 9 / 32 N. Wahlström, 2018 Condensing information with strides Problem: As we proceed though the network we want to condense the information. Solution: Apply the kernel to every two pixels. We use a stride of 2 (instead of 1). Input variables 1 β (1) 0 Hidden units X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 1,3 β (1) 3,3

34 With stride 2 we get half the number of rows and columns in the hidden layer. 9 / 32 N. Wahlström, 2018 Condensing information with strides Problem: As we proceed though the network we want to condense the information. Solution: Apply the kernel to every two pixels. We use a stride of 2 (instead of 1). Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 0 β (1) 1,3 β (1) 3,3 Hidden units

35 With stride 2 we get half the number of rows and columns in the hidden layer. 9 / 32 N. Wahlström, 2018 Condensing information with strides Problem: As we proceed though the network we want to condense the information. Solution: Apply the kernel to every two pixels. We use a stride of 2 (instead of 1). Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6, β (1) 0 β (1) 1,3 β (1) 3,3 Hidden units

36 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,1 β (1) 3,3,1 β (1) 0,1 Hidden units 10 / 32 N. Wahlström, 2018

37 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,2 β (1) 3,3,2 β (1) 0,2 Hidden units 10 / 32 N. Wahlström, 2018

38 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,3 β (1) 3,3,3 β (1) 0,3 Hidden units 10 / 32 N. Wahlström, 2018

39 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 0,4 β (1) 1,3,4 β (1) 3,3,4 Hidden units 10 / 32 N. Wahlström, 2018

40 Hidden layers are organized in tensors of size (rows columns channels). 10 / 32 N. Wahlström, 2018 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,1 β (1) 3,3,1 β (1) 0,1 β (1) 0,4 β (1) 1,3,4 β (1) 3,3,4 Hidden units dim dim 6 6 4

41 What is a tensor? A tensor is a generalization of scalar, vector and matrix to arbitrary order. Scalar order 0 Vector order 1 Matrix order 2 Tensor any order (here order 3) 11 / 32 N. Wahlström, 2018 a = 3 3 b = W = T :,:,1 = 2 1, T :,:,2 =

42 Multiple channels (cont.) 12 / 32 N. Wahlström, 2018 A kernel operates on all channels in a hidden layer. Each kernel has the size (kernel rows kernel colomns input channels), here (3 3 4). We stack all kernel parameters in a weight tensor with size (kernel rows kernel colomns input channels output channels), here ( ) In the example below we also used stride 2. First hidden layer 1 Second hidden layer b (2) 0,6 W (2) :,:,:,6 Convolutional layer W (2) R , b (2) R 6

43 Full CNN architecture A full CNN usually consist of multiple convolutional layers (here two) and a few final dense layers (here two). If we have a classification problem at hand, we end with a softmax activation function to produce class probabilities. Input variables dim X1,1X1,2X1,3X1,4X1,5X1,6β (1) 1,3,1 X2,1X2,2X2,3X2,4X2,5X2,6 X3,1X3,2X3,3X3,4X3,5X3,6β (1) 3,3,1 X4,1X4,2X4,3X4,4X4,5X4,6 X5,1X5,2X5,3X5,4X5,5X5,6 X6,1X6,2X6,3X6,4X6,5X6,6 β (1) 0,1 β (1) 0,4 β (1) 1,3,4 β (1) 3,3,4 Hidden units dim Hidden units dim b (2) 0,6 β (2) 1 1,3,4,6 β (2) 3,3,4,6 Hidden units dim Logits dim 10 Z1. Z10. Outputs dim 10. q1(x; θ) q10(x; θ) Convolutional layer W (1) R b (1) R 4 Convolutional layer W (2) R b (2) R 6 Dense layer W (3) R b (3) R 30 Dense Softmax layer W (4) R b (4) R 10 Here we use 30 hidden units in the last hidden layer and consider a classification problem with K = 10 classes. 13 / 32 N. Wahlström, 2018

44 Why deep? Example: Image classification Input: pixels of an image Output: object identity Each hidden layer extracts increasingly abstract features. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks Computer Vision - ECCV (2014). 14 / 32 N. Wahlström, 2018

45 Skin cancer background One result on the use of deep learning in medicine - Detecting skin cancer (February 2017) Andre Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M. and Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, , February, / 32 N. Wahlström, 2018

46 Skin cancer background One result on the use of deep learning in medicine - Detecting skin cancer (February 2017) Andre Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M. and Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, , February, Some background figures (from the US) on skin cancer: Melanomas represents less than 5% of all skin cancers, but accounts for 75% of all skin-cancer-related deaths. Early detection absolutely critical. Estimated 5-year survival rate for melanoma: Over 99% if detected in its earlier stages and 14% is detected in its later stages. 15 / 32 N. Wahlström, 2018

47 Skin cancer task Image copyright Nature (doi: /nature21056) 16 / 32 N. Wahlström, 2018

48 Skin cancer taxonomy used Image copyright Nature doi: /nature21056) 17 / 32 N. Wahlström, 2018

49 Skin cancer solution (ultrabrief) In the paper the used the following network architecture 18 / 32 N. Wahlström, 2018

50 Skin cancer solution (ultrabrief) In the paper the used the following network architecture Initialize all parameters from a neural network trained on 1.28 million images (transfer learning). 18 / 32 N. Wahlström, 2018

51 Skin cancer solution (ultrabrief) In the paper the used the following network architecture 18 / 32 N. Wahlström, 2018 Initialize all parameters from a neural network trained on 1.28 million images (transfer learning). From this initialization we learn new model parameters using clinical images ( 100 times more images than any previous study). Use the model to predict class based on unseen data.

52 Skin cancer indication of the results sensitivity = true positive positive specificity = true negative negative 19 / 32 N. Wahlström, 2018

53 Skin cancer indication of the results sensitivity = true positive positive Letter true negative RESEARCH specificity = negative 19 / 32 N. Wahlström, 2018 Image copyright Nature (doi: /nature21056)

54 Numerical optimization We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n n L(x i, y i, θ) i=1 J(θ) 20 / 32 N. Wahlström, 2018

55 Numerical optimization We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n n L(x i, y i, θ) i=1 J(θ) θ A The best possible solution θ is the global minimizer ( θ A ) 20 / 32 N. Wahlström, 2018

56 Numerical optimization We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n n L(x i, y i, θ) i=1 J(θ) 20 / 32 N. Wahlström, 2018 θ B θ A θ C The best possible solution θ is the global minimizer ( θ A ) The global minimizer is typically very hard to find, and we have to settle for a local minimizer ( θ A, θ B, θ C )

57 Iterative solution - Example 1D J(θ) 21 / 32 N. Wahlström, 2018

58 Iterative solution - Example 1D J(θ) θ 0 21 / 32 N. Wahlström, 2018

59 Iterative solution - Example 1D J(θ) θ 1 In our search for a local optimizer we do an initial guess of θ / 32 N. Wahlström, 2018

60 Iterative solution - Example 1D J(θ) θ 2 In our search for a local optimizer we do an initial guess of θ and update θ iteratively. 21 / 32 N. Wahlström, 2018

61 Iterative solution - Example 1D J(θ) θ 4 In our search for a local optimizer we do an initial guess of θ and update θ iteratively. 21 / 32 N. Wahlström, 2018

62 Iterative solution - Example 1D J(θ) θ 5 In our search for a local optimizer we do an initial guess of θ and update θ iteratively. 21 / 32 N. Wahlström, 2018

63 Iterative solution (gradient descent) Example 2D J(3) / 32 N. Wahlström, θ = [β0, β1 ]T R2

64 Iterative solution (gradient descent) - Example 2D Pick a θ 0 θ = [β 0, β 1 ] T R 2 22 / 32 N. Wahlström, 2018

65 Iterative solution (gradient descent) - Example 2D g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

66 Iterative solution (gradient descent) - Example 2D g 1 -.g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

67 Iterative solution (gradient descent) - Example 2D g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

68 Iterative solution (gradient descent) - Example 2D g 2 -g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

69 Iterative solution (gradient descent) - Example 2D θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

70 Iterative solution (gradient descent) - Example 2D θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t + 1 We call γ R the step length or learning rate. 22 / 32 N. Wahlström, 2018

71 Iterative solution (gradient descent) - Example 2D θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t + 1 We call γ R the step length or learning rate. How big learning rate γ should we have? 22 / 32 N. Wahlström, 2018

72 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

73 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

74 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

75 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

76 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

77 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

78 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

79 Learning rate γ too low J(θ) θ 23 / 32 N. Wahlström, 2018

80 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

81 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

82 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

83 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

84 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

85 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

86 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

87 Learning rate γ too low γ too high J(θ) θ θ 23 / 32 N. Wahlström, 2018

88 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

89 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

90 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

91 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

92 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

93 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

94 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

95 Learning rate γ too low γ too high γ ok J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

96 A good strategy to find a good learning rate is: if the error keeps getting worse or oscillates widely, reduce the learning rate if the error is fairly consistently but slowly increasing, increase the learning rate. 23 / 32 N. Wahlström, 2018 Learning rate γ too low γ too high γ ok J(θ) θ θ θ

97 A good strategy to find a good learning rate is: if the error keeps getting worse or oscillates widely, reduce the learning rate if the error is fairly consistently but slowly increasing, increase the learning rate. 23 / 32 N. Wahlström, 2018 Learning rate γ too low γ too high γ ok J(θ) θ θ θ

98 Computational challenge 1 - dim(θ) is big At each optimization step we need to compute the gradient g t = θ J(θ t ) = 1 n θ L(x i, y i, θ t ). n i=1 Computational challenge 1 - dim(θ) big: A neural network contains a lot of parameters. Computing the gradient is costly. Solution: A NN is a composition of multiple layers. Hence, each term θ L(x i, y i, θ) can be computed efficiently by repeatedly applying the chain rule. This is called the back-propagation algorithm. Not part of the course. 24 / 32 N. Wahlström, 2018

99 Computational challenge 2 - n is big At each optimization step we need to compute the gradient g t = θ J(θ t ) = 1 n θ L(x i, y i, θ t ). n i=1 Computational challenge 2 - n big: We typically use a lot of training data n for training the neural netowork. Computing the gradient is costly. Solution: For each iteration, we only use a small part of the data set to compute the gradient g t. This is called the stochastic gradient descent. 25 / 32 N. Wahlström, 2018

100 Stochastic gradient descent A big data set is often redundant = many data points are similar. { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y / 32 N. Wahlström, 2018

101 Stochastic gradient descent A big data set is often redundant = many data points are similar. { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 If the training data is big θ J(θ) n 2 i=1 θl(x i, y i, θ) and θ J(θ) n i= n 2 +1 θl(x i, y i, θ). 26 / 32 N. Wahlström, 2018

102 Stochastic gradient descent A big data set is often redundant = many data points are similar. { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 If the training data is big θ J(θ) n 2 i=1 θl(x i, y i, θ) and θ J(θ) n i= n 2 +1 θl(x i, y i, θ). We can do the update with only half the computation cost! θ t+1 = θ t γ 1 n 2 θ L(x i, y i, θ t ), n/2 i=1 θ t+2 = θ t+1 γ 1 n θ L(x i, y i, θ t+1 ). n/2 i= n / 32 N. Wahlström, 2018

103 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 1 = θ 0 γ θ L(x 1, y 1, θ 0 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018

104 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 2 = θ 1 γ θ L(x 2, y 2, θ 1 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018

105 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 3 = θ 2 γ θ L(x 3, y 3, θ 2 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018

106 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 4 = θ 3 γ θ L(x 4, y 4, θ 3 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018

107 Stochastic gradient descent Training data {}}{ x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 }{{} Mini-batch θ 1 = θ 0 γ 1 5 θ L(x i, y i, θ 0 ) 5 i=1 The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. 27 / 32 N. Wahlström, 2018

108 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20. } {{ } Mini-batch θ 2 = θ 1 γ i=6 θ L(x i, y i, θ 1 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. 27 / 32 N. Wahlström, 2018

109 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20. } {{ } Mini-batch θ 3 = θ 2 γ i=11 θ L(x i, y i, θ 2 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. 27 / 32 N. Wahlström, 2018

110 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20. } {{ } Mini-batch θ 4 = θ 3 γ i=16 θ L(x i, y i, θ 3 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. One pass through the training data is called an epoch. 27 / 32 N. Wahlström, 2018

111 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. 28 / 32 N. Wahlström, 2018

112 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. 28 / 32 N. Wahlström, 2018

113 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

114 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

115 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 1 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

116 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 2 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

117 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 3 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

118 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 4 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

119 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 4 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

120 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

121 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 5 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

122 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 6 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

123 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 7 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

124 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 8 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

125 Mini-batch gradient descent The full stochastic gradient descent algorithm (a.k.a mini-batch gradient descent) is as follows 1. Initialize θ 0, set t 1, choose batch size n b and number of epochs E. 2. For i = 1 to E (a) Randomly shuffle the training data {(x i, y i )} n i=1. (b) For j = 1 to n n b (i) Approximate the gradient of the loss function using the mini-batch {(x i, y i)} jn b, i=(j 1)n b +1 ĝ t = 1 jnb n b i=(j 1)n b +1 θl(x i, y i, θ). θ=θ t (ii) Do a gradient step θ t+1 = θ t γĝ t. (iii) Update the iteration index t t + 1. At each time we get a stochastic approximation of the true gradient ĝ t 1 ni=1 n θ L(x i, y i, θ), hence the name. θ=θt 29 / 32 N. Wahlström, 2018

126 Some comments - Why now? Neural networks have been around for more than fifty years. Why have they become so popular now (again)? To solve really interesting problems you need: 1. Efficient learning algorithms 2. Efficient computational hardware 3. A lot of labeled data! These three factors have not been fulfilled to a satisfactory level until the last 5-10 years. 30 / 32 N. Wahlström, 2018

127 Some pointers A book has recently been published I. Goodfellow, Y. Bengio and A. Courville Deep learning 31 / 32 N. Wahlström, 2018

128 Some pointers A book has recently been published I. Goodfellow, Y. Bengio and A. Courville Deep learning A well written and timely introduction: LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep learning, Nature, 521(7553), / 32 N. Wahlström, 2018

129 Some pointers A book has recently been published I. Goodfellow, Y. Bengio and A. Courville Deep learning A well written and timely introduction: LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep learning, Nature, 521(7553), You will also find more material than you can possibly want here 31 / 32 N. Wahlström, 2018

130 A few concepts to summarize lecture 9 Convolutional neural network (CNN): A NN with a particular structure tailored for input data with a grid-like structure, like for example images. Kernel: (a.k.a filter) A set of parameters that is convolved with a hidden layer. Each kernel produces a new channel. Channel: A set of hidden units produced by the same kernel. Each hidden layer consists of one or more channels. Stride: A positive integer deciding how many steps to move the kernel during the convolution. Tensor: A generalization of matrices to arbitrary order. Gradient descent: An iterative optimization algorithm where we at iteration take a step proportional to the negative gradient. Learning rate: (a.k.a step length). A scalar tuning parameter deciding the length of each gradient step in gradient descent. Stochstic gradient descent (SGD): A version of gradient descent where we at each iteration only use a small part of the training data (a mini-batch). Mini-batch: The group of training data that we use at each iteration in SGD Batch size: The number of data points in one mini-batch 32 / 32 N. Wahlström, 2018

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting