Statistical Machine Learning

Size: px
Start display at page:

Download "Statistical Machine Learning"

Transcription

1 Statistical Machine Learning Lecture 9 Numerical optimization and deep learning Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University niklas.wahlstrom@it.uu.se 1 / 32 N. Wahlström, 2018

2 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 X 1 X p. 2 / 32 N. Wahlström, 2018

3 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 H 1 X 1 X p. ( H 1 = β (1) 01 + ) p j=1 β(1) j1 X j Z Z = β (2) 1 H 1 2 / 32 N. Wahlström, 2018

4 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 H 1. Z X 1 X p ( H 1 = β (1) 01 + ) p j=1 β(1) j1 X j ( H 2 = β (1) 02 + ) p j=1 β(1) j2 X j Z = 2 m=1 β (2) m H m 2 / 32 N. Wahlström, 2018

5 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 X H 1 1. X p. Z ( H M H 1 = β (1) 01 + ) p j=1 β(1) j1 X M j Z = β m (2) H m ( H 2 = β (1) 02 + ) m=1 p j=1 β(1) j2 X j. ( H M = β (1) 0M + ) p j=1 β(1) jm X j 2 / 32 N. Wahlström, 2018

6 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 1 X H 1 1. X p. Z ( H M H 1 = β (1) 01 + ) p j=1 β(1) j1 X M j Z = β (2) 0 + β m (2) H m ( H 2 = β (1) 02 + ) m=1 p j=1 β(1) j2 X j. ( H M = β (1) 0M + ) p j=1 β(1) jm X j 2 / 32 N. Wahlström, 2018

7 Summary of Lecture 8 (I/III) - Neural network H = A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 1 X 1 X p. [ H1. H M Hidden units ] 2 / 32 N. Wahlström, H = (W (1)T X + b (1)T ) b (1) = [ β (1) 01 W (1) = β (1) β(1) ] 0M... β(1) 1M..... β (1) p1... β(1) pm Z offset vector weight matrix Z = W (2)T H + b (2)T b (2) = [ β (2) 0 ] β (2) W (2) = 1. β (2) M

8 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 1 X 1 X p.. Z H = (W (1)T X + b (1)T ) Z = W (2)T H + b (2)T 2 / 32 N. Wahlström, 2018 The non-linearity acts element-wise.

9 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Hidden units Output 1 X 1 X p. 1 1 H (1) 1 H (1) 2.. H (1) M 1 H (1) = (W (1)T X + b (1)T ) H (2) = (W (2)T H (1) + b (2)T ) 2 / 32 N. Wahlström, 2018 Z = W (3)T H (2) + b (3)T H (2) 1 H (2) 2 H (2) M 2 Z The model learns better using a deep network (several layers) instead of a wide and shallow network.

10 Summary of Lecture 8 (II/III) - classification For K > 2 classes we want to predict the class probability for all K classes q k (X; θ) = Pr(Y = k X). We extend the logistic function to the softmax activation function q k (X; θ) = g k (Z) = ez k Kl=1, k = 1,..., K. e Z l Inputs Hidden units Logits Class probability 1 softmax(z) 1 Z 1 g 1 q 1 (X; θ)..... Z K g K q K (X; θ) X 1 X p 3 / 32 N. Wahlström, 2018 softmax(z) = [g 1 (Z),..., g K (Z)] T

11 Summary of Lecture 8 (III/III) - Example K = 3 classes Consider an example with three classes K = 3. Inputs 1 X 1 X p. Hidden units softmax(z) Class True output probabilities (one-hot 1 Logits encoding) Z 1 =-0.5 g 1 q 1 (X; θ)= 0.28 Y 1 = 0. Z 2 = 0.4 Z 3 = -2.5 g 2 g 3 q 2 (X; θ)= 0.68 q 3 (X; θ)= 0.04 Y 2 = 1 Y 3 = 0 The network is trained by minimizing the cross-entropy K L(X, Y, θ) = Y k log(q k (X; θ)) = log 0.68 = 0.39 k=1 4 / 32 N. Wahlström, 2018

12 Summary of Lecture 8 (III/III) - Example K = 3 classes Consider an example with three classes K = 3. Inputs 1 X 1 X p. Hidden units softmax(z) Class True output probabilities (one-hot 1 Logits encoding) Z 1 =-1.1 g 1 q 1 (X; θ)= 0.51 Y 1 = 0. Z 2 =-4.0 Z 3 =-1.2 g 2 g 3 q 2 (X; θ)= 0.03 q 3 (X; θ)= 0.46 Y 2 = 1 Y 3 = 0 The network is trained by minimizing the cross-entropy K L(X, Y, θ) = Y k log(q k (X; θ)) = log 0.03 = 3.51 k=1 4 / 32 N. Wahlström, 2018

13 Summary of Lecture 8 (III/III) - Example K = 3 classes Consider an example with three classes K = 3. Inputs 1 X 1 X p. Hidden units softmax(z) Class True output probabilities (one-hot 1 Logits encoding) Z 1 = 0.9 g 1 q 1 (X; θ)= 0.03 Y 1 = 0. Z 2 = 4.2 Z 3 = 0.2 g 2 g 3 q 2 (X; θ)= 0.95 q 3 (X; θ)= 0.02 Y 2 = 1 Y 3 = 0 The network is trained by minimizing the cross-entropy K L(X, Y, θ) = Y k log(q k (X; θ)) = log 0.95 = 0.05 k=1 4 / 32 N. Wahlström, 2018

14 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent 5 / 32 N. Wahlström, 2018

15 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. 5 / 32 N. Wahlström, 2018

16 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. 5 / 32 N. Wahlström, 2018

17 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. Peer-review: Deadline March 1, 23:59 (Thursday). 5 / 32 N. Wahlström, 2018

18 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. Peer-review: Deadline March 1, 23:59 (Thursday). The guest lecture: March 15 at Daniel Langkilde at Annotell. Room: 2446:ITC. 5 / 32 N. Wahlström, 2018

19 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. Peer-review: Deadline March 1, 23:59 (Thursday). The guest lecture: March 15 at Daniel Langkilde at Annotell. Room: 2446:ITC. Reading instructions slightly updated, see homepage. 5 / 32 N. Wahlström, 2018

20 Convolutional neural networks Convolutional neural networks (CNN) are a special kind neural networks tailored for problems where the input data has a grid-like structure. Examples Digital images (2D grid of pixels) Audio waveform data (1D grid, times series) Volumetric data e.g. CT scans (3D grid) The description here will focus on images. 6 / 32 N. Wahlström, 2018

21 Data representation of images Consider a grayscale image of 6 6 pixels. Each pixel value represents the color. The value ranges from 0 (total absence, black) to 1 (total presence, white) Image Data representation / 32 N. Wahlström, 2018

22 Data representation of images Consider a grayscale image of 6 6 pixels. Each pixel value represents the color. The value ranges from 0 (total absence, black) to 1 (total presence, white) The pixels are the input variables X 1,1, X 1,2,..., X 6,6. Image Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 7 / 32 N. Wahlström, 2018

23 The convolutional layer Consider a hidden layer with 6 6 hidden units. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 Hidden units 8 / 32 N. Wahlström, 2018

24 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 Hidden units 8 / 32 N. Wahlström, 2018

25 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3 β (1) 3,3 β (1) 0 Hidden units 8 / 32 N. Wahlström, 2018

26 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 β (1) 3,3 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 8 / 32 N. Wahlström, 2018

27 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 1 β (1) 0 Hidden units X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 β (1) 1,3 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018

28 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018

29 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018

30 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018

31 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 0 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 0 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 0 β (1) 3,3 8 / 32 N. Wahlström, 2018

32 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6, β (1) 0 β (1) 1,3 β (1) 3,3 Hidden units Conv. layer uses sparse interactions and parameter sharing 8 / 32 N. Wahlström, 2018

33 With stride 2 we get half the number of rows and columns in the hidden layer. 9 / 32 N. Wahlström, 2018 Condensing information with strides Problem: As we proceed though the network we want to condense the information. Solution: Apply the kernel to every two pixels. We use a stride of 2 (instead of 1). Input variables 1 β (1) 0 Hidden units X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 1,3 β (1) 3,3

34 With stride 2 we get half the number of rows and columns in the hidden layer. 9 / 32 N. Wahlström, 2018 Condensing information with strides Problem: As we proceed though the network we want to condense the information. Solution: Apply the kernel to every two pixels. We use a stride of 2 (instead of 1). Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 0 β (1) 1,3 β (1) 3,3 Hidden units

35 With stride 2 we get half the number of rows and columns in the hidden layer. 9 / 32 N. Wahlström, 2018 Condensing information with strides Problem: As we proceed though the network we want to condense the information. Solution: Apply the kernel to every two pixels. We use a stride of 2 (instead of 1). Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6, β (1) 0 β (1) 1,3 β (1) 3,3 Hidden units

36 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,1 β (1) 3,3,1 β (1) 0,1 Hidden units 10 / 32 N. Wahlström, 2018

37 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,2 β (1) 3,3,2 β (1) 0,2 Hidden units 10 / 32 N. Wahlström, 2018

38 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,3 β (1) 3,3,3 β (1) 0,3 Hidden units 10 / 32 N. Wahlström, 2018

39 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 0,4 β (1) 1,3,4 β (1) 3,3,4 Hidden units 10 / 32 N. Wahlström, 2018

40 Hidden layers are organized in tensors of size (rows columns channels). 10 / 32 N. Wahlström, 2018 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,1 β (1) 3,3,1 β (1) 0,1 β (1) 0,4 β (1) 1,3,4 β (1) 3,3,4 Hidden units dim dim 6 6 4

41 What is a tensor? A tensor is a generalization of scalar, vector and matrix to arbitrary order. Scalar order 0 Vector order 1 Matrix order 2 Tensor any order (here order 3) 11 / 32 N. Wahlström, 2018 a = 3 3 b = W = T :,:,1 = 2 1, T :,:,2 =

42 Multiple channels (cont.) 12 / 32 N. Wahlström, 2018 A kernel operates on all channels in a hidden layer. Each kernel has the size (kernel rows kernel colomns input channels), here (3 3 4). We stack all kernel parameters in a weight tensor with size (kernel rows kernel colomns input channels output channels), here ( ) In the example below we also used stride 2. First hidden layer 1 Second hidden layer b (2) 0,6 W (2) :,:,:,6 Convolutional layer W (2) R , b (2) R 6

43 Full CNN architecture A full CNN usually consist of multiple convolutional layers (here two) and a few final dense layers (here two). If we have a classification problem at hand, we end with a softmax activation function to produce class probabilities. Input variables dim X1,1X1,2X1,3X1,4X1,5X1,6β (1) 1,3,1 X2,1X2,2X2,3X2,4X2,5X2,6 X3,1X3,2X3,3X3,4X3,5X3,6β (1) 3,3,1 X4,1X4,2X4,3X4,4X4,5X4,6 X5,1X5,2X5,3X5,4X5,5X5,6 X6,1X6,2X6,3X6,4X6,5X6,6 β (1) 0,1 β (1) 0,4 β (1) 1,3,4 β (1) 3,3,4 Hidden units dim Hidden units dim b (2) 0,6 β (2) 1 1,3,4,6 β (2) 3,3,4,6 Hidden units dim Logits dim 10 Z1. Z10. Outputs dim 10. q1(x; θ) q10(x; θ) Convolutional layer W (1) R b (1) R 4 Convolutional layer W (2) R b (2) R 6 Dense layer W (3) R b (3) R 30 Dense Softmax layer W (4) R b (4) R 10 Here we use 30 hidden units in the last hidden layer and consider a classification problem with K = 10 classes. 13 / 32 N. Wahlström, 2018

44 Why deep? Example: Image classification Input: pixels of an image Output: object identity Each hidden layer extracts increasingly abstract features. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks Computer Vision - ECCV (2014). 14 / 32 N. Wahlström, 2018

45 Skin cancer background One result on the use of deep learning in medicine - Detecting skin cancer (February 2017) Andre Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M. and Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, , February, / 32 N. Wahlström, 2018

46 Skin cancer background One result on the use of deep learning in medicine - Detecting skin cancer (February 2017) Andre Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M. and Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, , February, Some background figures (from the US) on skin cancer: Melanomas represents less than 5% of all skin cancers, but accounts for 75% of all skin-cancer-related deaths. Early detection absolutely critical. Estimated 5-year survival rate for melanoma: Over 99% if detected in its earlier stages and 14% is detected in its later stages. 15 / 32 N. Wahlström, 2018

47 Skin cancer task Image copyright Nature (doi: /nature21056) 16 / 32 N. Wahlström, 2018

48 Skin cancer taxonomy used Image copyright Nature doi: /nature21056) 17 / 32 N. Wahlström, 2018

49 Skin cancer solution (ultrabrief) In the paper the used the following network architecture 18 / 32 N. Wahlström, 2018

50 Skin cancer solution (ultrabrief) In the paper the used the following network architecture Initialize all parameters from a neural network trained on 1.28 million images (transfer learning). 18 / 32 N. Wahlström, 2018

51 Skin cancer solution (ultrabrief) In the paper the used the following network architecture 18 / 32 N. Wahlström, 2018 Initialize all parameters from a neural network trained on 1.28 million images (transfer learning). From this initialization we learn new model parameters using clinical images ( 100 times more images than any previous study). Use the model to predict class based on unseen data.

52 Skin cancer indication of the results sensitivity = true positive positive specificity = true negative negative 19 / 32 N. Wahlström, 2018

53 Skin cancer indication of the results sensitivity = true positive positive Letter true negative RESEARCH specificity = negative 19 / 32 N. Wahlström, 2018 Image copyright Nature (doi: /nature21056)

54 Numerical optimization We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n n L(x i, y i, θ) i=1 J(θ) 20 / 32 N. Wahlström, 2018

55 Numerical optimization We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n n L(x i, y i, θ) i=1 J(θ) θ A The best possible solution θ is the global minimizer ( θ A ) 20 / 32 N. Wahlström, 2018

56 Numerical optimization We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n n L(x i, y i, θ) i=1 J(θ) 20 / 32 N. Wahlström, 2018 θ B θ A θ C The best possible solution θ is the global minimizer ( θ A ) The global minimizer is typically very hard to find, and we have to settle for a local minimizer ( θ A, θ B, θ C )

57 Iterative solution - Example 1D J(θ) 21 / 32 N. Wahlström, 2018

58 Iterative solution - Example 1D J(θ) θ 0 21 / 32 N. Wahlström, 2018

59 Iterative solution - Example 1D J(θ) θ 1 In our search for a local optimizer we do an initial guess of θ / 32 N. Wahlström, 2018

60 Iterative solution - Example 1D J(θ) θ 2 In our search for a local optimizer we do an initial guess of θ and update θ iteratively. 21 / 32 N. Wahlström, 2018

61 Iterative solution - Example 1D J(θ) θ 4 In our search for a local optimizer we do an initial guess of θ and update θ iteratively. 21 / 32 N. Wahlström, 2018

62 Iterative solution - Example 1D J(θ) θ 5 In our search for a local optimizer we do an initial guess of θ and update θ iteratively. 21 / 32 N. Wahlström, 2018

63 Iterative solution (gradient descent) Example 2D J(3) / 32 N. Wahlström, θ = [β0, β1 ]T R2

64 Iterative solution (gradient descent) - Example 2D Pick a θ 0 θ = [β 0, β 1 ] T R 2 22 / 32 N. Wahlström, 2018

65 Iterative solution (gradient descent) - Example 2D g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

66 Iterative solution (gradient descent) - Example 2D g 1 -.g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

67 Iterative solution (gradient descent) - Example 2D g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

68 Iterative solution (gradient descent) - Example 2D g 2 -g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

69 Iterative solution (gradient descent) - Example 2D θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018

70 Iterative solution (gradient descent) - Example 2D θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t + 1 We call γ R the step length or learning rate. 22 / 32 N. Wahlström, 2018

71 Iterative solution (gradient descent) - Example 2D θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t + 1 We call γ R the step length or learning rate. How big learning rate γ should we have? 22 / 32 N. Wahlström, 2018

72 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

73 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

74 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

75 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

76 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

77 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

78 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018

79 Learning rate γ too low J(θ) θ 23 / 32 N. Wahlström, 2018

80 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

81 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

82 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

83 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

84 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

85 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

86 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018

87 Learning rate γ too low γ too high J(θ) θ θ 23 / 32 N. Wahlström, 2018

88 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

89 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

90 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

91 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

92 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

93 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

94 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

95 Learning rate γ too low γ too high γ ok J(θ) θ θ θ 23 / 32 N. Wahlström, 2018

96 A good strategy to find a good learning rate is: if the error keeps getting worse or oscillates widely, reduce the learning rate if the error is fairly consistently but slowly increasing, increase the learning rate. 23 / 32 N. Wahlström, 2018 Learning rate γ too low γ too high γ ok J(θ) θ θ θ

97 A good strategy to find a good learning rate is: if the error keeps getting worse or oscillates widely, reduce the learning rate if the error is fairly consistently but slowly increasing, increase the learning rate. 23 / 32 N. Wahlström, 2018 Learning rate γ too low γ too high γ ok J(θ) θ θ θ

98 Computational challenge 1 - dim(θ) is big At each optimization step we need to compute the gradient g t = θ J(θ t ) = 1 n θ L(x i, y i, θ t ). n i=1 Computational challenge 1 - dim(θ) big: A neural network contains a lot of parameters. Computing the gradient is costly. Solution: A NN is a composition of multiple layers. Hence, each term θ L(x i, y i, θ) can be computed efficiently by repeatedly applying the chain rule. This is called the back-propagation algorithm. Not part of the course. 24 / 32 N. Wahlström, 2018

99 Computational challenge 2 - n is big At each optimization step we need to compute the gradient g t = θ J(θ t ) = 1 n θ L(x i, y i, θ t ). n i=1 Computational challenge 2 - n big: We typically use a lot of training data n for training the neural netowork. Computing the gradient is costly. Solution: For each iteration, we only use a small part of the data set to compute the gradient g t. This is called the stochastic gradient descent. 25 / 32 N. Wahlström, 2018

100 Stochastic gradient descent A big data set is often redundant = many data points are similar. { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y / 32 N. Wahlström, 2018

101 Stochastic gradient descent A big data set is often redundant = many data points are similar. { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 If the training data is big θ J(θ) n 2 i=1 θl(x i, y i, θ) and θ J(θ) n i= n 2 +1 θl(x i, y i, θ). 26 / 32 N. Wahlström, 2018

102 Stochastic gradient descent A big data set is often redundant = many data points are similar. { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 If the training data is big θ J(θ) n 2 i=1 θl(x i, y i, θ) and θ J(θ) n i= n 2 +1 θl(x i, y i, θ). We can do the update with only half the computation cost! θ t+1 = θ t γ 1 n 2 θ L(x i, y i, θ t ), n/2 i=1 θ t+2 = θ t+1 γ 1 n θ L(x i, y i, θ t+1 ). n/2 i= n / 32 N. Wahlström, 2018

103 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 1 = θ 0 γ θ L(x 1, y 1, θ 0 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018

104 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 2 = θ 1 γ θ L(x 2, y 2, θ 1 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018

105 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 3 = θ 2 γ θ L(x 3, y 3, θ 2 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018

106 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 4 = θ 3 γ θ L(x 4, y 4, θ 3 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018

107 Stochastic gradient descent Training data {}}{ x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 }{{} Mini-batch θ 1 = θ 0 γ 1 5 θ L(x i, y i, θ 0 ) 5 i=1 The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. 27 / 32 N. Wahlström, 2018

108 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20. } {{ } Mini-batch θ 2 = θ 1 γ i=6 θ L(x i, y i, θ 1 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. 27 / 32 N. Wahlström, 2018

109 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20. } {{ } Mini-batch θ 3 = θ 2 γ i=11 θ L(x i, y i, θ 2 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. 27 / 32 N. Wahlström, 2018

110 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20. } {{ } Mini-batch θ 4 = θ 3 γ i=16 θ L(x i, y i, θ 3 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. One pass through the training data is called an epoch. 27 / 32 N. Wahlström, 2018

111 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. 28 / 32 N. Wahlström, 2018

112 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. 28 / 32 N. Wahlström, 2018

113 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

114 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

115 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 1 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

116 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 2 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

117 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 3 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

118 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 4 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

119 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 4 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018

120 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

121 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 5 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

122 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 6 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

123 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 7 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

124 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 8 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018

125 Mini-batch gradient descent The full stochastic gradient descent algorithm (a.k.a mini-batch gradient descent) is as follows 1. Initialize θ 0, set t 1, choose batch size n b and number of epochs E. 2. For i = 1 to E (a) Randomly shuffle the training data {(x i, y i )} n i=1. (b) For j = 1 to n n b (i) Approximate the gradient of the loss function using the mini-batch {(x i, y i)} jn b, i=(j 1)n b +1 ĝ t = 1 jnb n b i=(j 1)n b +1 θl(x i, y i, θ). θ=θ t (ii) Do a gradient step θ t+1 = θ t γĝ t. (iii) Update the iteration index t t + 1. At each time we get a stochastic approximation of the true gradient ĝ t 1 ni=1 n θ L(x i, y i, θ), hence the name. θ=θt 29 / 32 N. Wahlström, 2018

126 Some comments - Why now? Neural networks have been around for more than fifty years. Why have they become so popular now (again)? To solve really interesting problems you need: 1. Efficient learning algorithms 2. Efficient computational hardware 3. A lot of labeled data! These three factors have not been fulfilled to a satisfactory level until the last 5-10 years. 30 / 32 N. Wahlström, 2018

127 Some pointers A book has recently been published I. Goodfellow, Y. Bengio and A. Courville Deep learning 31 / 32 N. Wahlström, 2018

128 Some pointers A book has recently been published I. Goodfellow, Y. Bengio and A. Courville Deep learning A well written and timely introduction: LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep learning, Nature, 521(7553), / 32 N. Wahlström, 2018

129 Some pointers A book has recently been published I. Goodfellow, Y. Bengio and A. Courville Deep learning A well written and timely introduction: LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep learning, Nature, 521(7553), You will also find more material than you can possibly want here 31 / 32 N. Wahlström, 2018

130 A few concepts to summarize lecture 9 Convolutional neural network (CNN): A NN with a particular structure tailored for input data with a grid-like structure, like for example images. Kernel: (a.k.a filter) A set of parameters that is convolved with a hidden layer. Each kernel produces a new channel. Channel: A set of hidden units produced by the same kernel. Each hidden layer consists of one or more channels. Stride: A positive integer deciding how many steps to move the kernel during the convolution. Tensor: A generalization of matrices to arbitrary order. Gradient descent: An iterative optimization algorithm where we at iteration take a step proportional to the negative gradient. Learning rate: (a.k.a step length). A scalar tuning parameter deciding the length of each gradient step in gradient descent. Stochstic gradient descent (SGD): A version of gradient descent where we at each iteration only use a small part of the training data (a mini-batch). Mini-batch: The group of training data that we use at each iteration in SGD Batch size: The number of data points in one mini-batch 32 / 32 N. Wahlström, 2018

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning Convolutional Neural Network (CNNs) University of Waterloo October 30, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Convolutional Networks Convolutional

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Convolutional Neural Networks

Convolutional Neural Networks Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Convolutional Neural Networks. Srikumar Ramalingam

Convolutional Neural Networks. Srikumar Ramalingam Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Deep Learning (CNNs)

Deep Learning (CNNs) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell

More information

Machine Learning Lecture 10

Machine Learning Lecture 10 Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

An overview of deep learning methods for genomics

An overview of deep learning methods for genomics An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1 Snapshot 1. Brief introduction to convolutional neural networks What is deep

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Machine Learning Lecture 12

Machine Learning Lecture 12 Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward

More information

Topic 3: Neural Networks

Topic 3: Neural Networks CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6 Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Introduction to Machine Learning HW6

Introduction to Machine Learning HW6 CS 189 Spring 2018 Introduction to Machine Learning HW6 Your self-grade URL is http://eecs189.org/self_grade?question_ids=1_1,1_ 2,2_1,2_2,3_1,3_2,3_3,4_1,4_2,4_3,4_4,4_5,4_6,5_1,5_2,6. This homework is

More information

Welcome to the Machine Learning Practical Deep Neural Networks. MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1

Welcome to the Machine Learning Practical Deep Neural Networks. MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1 Welcome to the Machine Learning Practical Deep Neural Networks MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1 Introduction to MLP; Single Layer Networks (1) Steve Renals Machine Learning

More information

Lecture 3 Classification, Logistic Regression

Lecture 3 Classification, Logistic Regression Lecture 3 Classification, Logistic Regression Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se F. Lindsten Summary

More information

Bayesian Networks (Part I)

Bayesian Networks (Part I) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University. Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Solutions. Part I Logistic regression backpropagation with a single training example

Solutions. Part I Logistic regression backpropagation with a single training example Solutions Part I Logistic regression backpropagation with a single training example In this part, you are using the Stochastic Gradient Optimizer to train your Logistic Regression. Consequently, the gradients

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 31 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/62 Lecture 1b Logistic regression & neural network October 2, 2015 2/62 Table of contents 1 1 Bird s-eye

More information

RegML 2018 Class 8 Deep learning

RegML 2018 Class 8 Deep learning RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Spatial Transformer Networks

Spatial Transformer Networks BIL722 - Deep Learning for Computer Vision Spatial Transformer Networks Max Jaderberg Andrew Zisserman Karen Simonyan Koray Kavukcuoglu Contents Introduction to Spatial Transformers Related Works Spatial

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x) 0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

More information

Feedforward Neural Networks. Michael Collins, Columbia University

Feedforward Neural Networks. Michael Collins, Columbia University Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation

More information

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm Name: Student number: This is a closed-book test. It is marked out of 15 marks. Please answer ALL of the

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron

More information

Lecture 2: Learning with neural networks

Lecture 2: Learning with neural networks Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28..

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Neural Networks and Deep Learning.

Neural Networks and Deep Learning. Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 CS 1674: Intro to Computer Vision Final Review Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 Final info Format: multiple-choice, true/false, fill in the blank, short answers, apply an

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued) Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 12 Neural Networks 24.11.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI)

More information