Statistical Machine Learning
|
|
- Merry Wilkerson
- 5 years ago
- Views:
Transcription
1 Statistical Machine Learning Lecture 9 Numerical optimization and deep learning Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University niklas.wahlstrom@it.uu.se 1 / 32 N. Wahlström, 2018
2 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 X 1 X p. 2 / 32 N. Wahlström, 2018
3 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 H 1 X 1 X p. ( H 1 = β (1) 01 + ) p j=1 β(1) j1 X j Z Z = β (2) 1 H 1 2 / 32 N. Wahlström, 2018
4 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 H 1. Z X 1 X p ( H 1 = β (1) 01 + ) p j=1 β(1) j1 X j ( H 2 = β (1) 02 + ) p j=1 β(1) j2 X j Z = 2 m=1 β (2) m H m 2 / 32 N. Wahlström, 2018
5 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 X H 1 1. X p. Z ( H M H 1 = β (1) 01 + ) p j=1 β(1) j1 X M j Z = β m (2) H m ( H 2 = β (1) 02 + ) m=1 p j=1 β(1) j2 X j. ( H M = β (1) 0M + ) p j=1 β(1) jm X j 2 / 32 N. Wahlström, 2018
6 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 1 X H 1 1. X p. Z ( H M H 1 = β (1) 01 + ) p j=1 β(1) j1 X M j Z = β (2) 0 + β m (2) H m ( H 2 = β (1) 02 + ) m=1 p j=1 β(1) j2 X j. ( H M = β (1) 0M + ) p j=1 β(1) jm X j 2 / 32 N. Wahlström, 2018
7 Summary of Lecture 8 (I/III) - Neural network H = A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 1 X 1 X p. [ H1. H M Hidden units ] 2 / 32 N. Wahlström, H = (W (1)T X + b (1)T ) b (1) = [ β (1) 01 W (1) = β (1) β(1) ] 0M... β(1) 1M..... β (1) p1... β(1) pm Z offset vector weight matrix Z = W (2)T H + b (2)T b (2) = [ β (2) 0 ] β (2) W (2) = 1. β (2) M
8 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Output 1 1 X 1 X p.. Z H = (W (1)T X + b (1)T ) Z = W (2)T H + b (2)T 2 / 32 N. Wahlström, 2018 The non-linearity acts element-wise.
9 Summary of Lecture 8 (I/III) - Neural network A neural network is a sequential construction of several linear regression models. Inputs Hidden units Hidden units Output 1 X 1 X p. 1 1 H (1) 1 H (1) 2.. H (1) M 1 H (1) = (W (1)T X + b (1)T ) H (2) = (W (2)T H (1) + b (2)T ) 2 / 32 N. Wahlström, 2018 Z = W (3)T H (2) + b (3)T H (2) 1 H (2) 2 H (2) M 2 Z The model learns better using a deep network (several layers) instead of a wide and shallow network.
10 Summary of Lecture 8 (II/III) - classification For K > 2 classes we want to predict the class probability for all K classes q k (X; θ) = Pr(Y = k X). We extend the logistic function to the softmax activation function q k (X; θ) = g k (Z) = ez k Kl=1, k = 1,..., K. e Z l Inputs Hidden units Logits Class probability 1 softmax(z) 1 Z 1 g 1 q 1 (X; θ)..... Z K g K q K (X; θ) X 1 X p 3 / 32 N. Wahlström, 2018 softmax(z) = [g 1 (Z),..., g K (Z)] T
11 Summary of Lecture 8 (III/III) - Example K = 3 classes Consider an example with three classes K = 3. Inputs 1 X 1 X p. Hidden units softmax(z) Class True output probabilities (one-hot 1 Logits encoding) Z 1 =-0.5 g 1 q 1 (X; θ)= 0.28 Y 1 = 0. Z 2 = 0.4 Z 3 = -2.5 g 2 g 3 q 2 (X; θ)= 0.68 q 3 (X; θ)= 0.04 Y 2 = 1 Y 3 = 0 The network is trained by minimizing the cross-entropy K L(X, Y, θ) = Y k log(q k (X; θ)) = log 0.68 = 0.39 k=1 4 / 32 N. Wahlström, 2018
12 Summary of Lecture 8 (III/III) - Example K = 3 classes Consider an example with three classes K = 3. Inputs 1 X 1 X p. Hidden units softmax(z) Class True output probabilities (one-hot 1 Logits encoding) Z 1 =-1.1 g 1 q 1 (X; θ)= 0.51 Y 1 = 0. Z 2 =-4.0 Z 3 =-1.2 g 2 g 3 q 2 (X; θ)= 0.03 q 3 (X; θ)= 0.46 Y 2 = 1 Y 3 = 0 The network is trained by minimizing the cross-entropy K L(X, Y, θ) = Y k log(q k (X; θ)) = log 0.03 = 3.51 k=1 4 / 32 N. Wahlström, 2018
13 Summary of Lecture 8 (III/III) - Example K = 3 classes Consider an example with three classes K = 3. Inputs 1 X 1 X p. Hidden units softmax(z) Class True output probabilities (one-hot 1 Logits encoding) Z 1 = 0.9 g 1 q 1 (X; θ)= 0.03 Y 1 = 0. Z 2 = 4.2 Z 3 = 0.2 g 2 g 3 q 2 (X; θ)= 0.95 q 3 (X; θ)= 0.02 Y 2 = 1 Y 3 = 0 The network is trained by minimizing the cross-entropy K L(X, Y, θ) = Y k log(q k (X; θ)) = log 0.95 = 0.05 k=1 4 / 32 N. Wahlström, 2018
14 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent 5 / 32 N. Wahlström, 2018
15 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. 5 / 32 N. Wahlström, 2018
16 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. 5 / 32 N. Wahlström, 2018
17 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. Peer-review: Deadline March 1, 23:59 (Thursday). 5 / 32 N. Wahlström, 2018
18 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. Peer-review: Deadline March 1, 23:59 (Thursday). The guest lecture: March 15 at Daniel Langkilde at Annotell. Room: 2446:ITC. 5 / 32 N. Wahlström, 2018
19 Contents Lecture 9 1. Convolutional neural network (cont.) 2. Numerical optmization 3. Stochastic gradient descent Practicals Lab: Either February 28, 8:15-12:00 or March 1, 13:15-17:00. Room: 1515:ITC. Sign up! Pick up a lab-pm and bring to lab session. Next lecture is next Tuesday. Last 45 min will be Q/A session. Send any questions that you have to fredrik.lindsten@liu.se. Peer-review: Deadline March 1, 23:59 (Thursday). The guest lecture: March 15 at Daniel Langkilde at Annotell. Room: 2446:ITC. Reading instructions slightly updated, see homepage. 5 / 32 N. Wahlström, 2018
20 Convolutional neural networks Convolutional neural networks (CNN) are a special kind neural networks tailored for problems where the input data has a grid-like structure. Examples Digital images (2D grid of pixels) Audio waveform data (1D grid, times series) Volumetric data e.g. CT scans (3D grid) The description here will focus on images. 6 / 32 N. Wahlström, 2018
21 Data representation of images Consider a grayscale image of 6 6 pixels. Each pixel value represents the color. The value ranges from 0 (total absence, black) to 1 (total presence, white) Image Data representation / 32 N. Wahlström, 2018
22 Data representation of images Consider a grayscale image of 6 6 pixels. Each pixel value represents the color. The value ranges from 0 (total absence, black) to 1 (total presence, white) The pixels are the input variables X 1,1, X 1,2,..., X 6,6. Image Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 7 / 32 N. Wahlström, 2018
23 The convolutional layer Consider a hidden layer with 6 6 hidden units. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 Hidden units 8 / 32 N. Wahlström, 2018
24 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 Hidden units 8 / 32 N. Wahlström, 2018
25 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3 β (1) 3,3 β (1) 0 Hidden units 8 / 32 N. Wahlström, 2018
26 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 β (1) 3,3 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 8 / 32 N. Wahlström, 2018
27 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 1 β (1) 0 Hidden units X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 β (1) 1,3 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018
28 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018
29 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018
30 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 3,3 8 / 32 N. Wahlström, 2018
31 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 1 β (1) 0 Hidden units X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 0 β (1) 1,3 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 0 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 0 β (1) 3,3 8 / 32 N. Wahlström, 2018
32 The convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6, β (1) 0 β (1) 1,3 β (1) 3,3 Hidden units Conv. layer uses sparse interactions and parameter sharing 8 / 32 N. Wahlström, 2018
33 With stride 2 we get half the number of rows and columns in the hidden layer. 9 / 32 N. Wahlström, 2018 Condensing information with strides Problem: As we proceed though the network we want to condense the information. Solution: Apply the kernel to every two pixels. We use a stride of 2 (instead of 1). Input variables 1 β (1) 0 Hidden units X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 β (1) 1,3 β (1) 3,3
34 With stride 2 we get half the number of rows and columns in the hidden layer. 9 / 32 N. Wahlström, 2018 Condensing information with strides Problem: As we proceed though the network we want to condense the information. Solution: Apply the kernel to every two pixels. We use a stride of 2 (instead of 1). Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 0 β (1) 1,3 β (1) 3,3 Hidden units
35 With stride 2 we get half the number of rows and columns in the hidden layer. 9 / 32 N. Wahlström, 2018 Condensing information with strides Problem: As we proceed though the network we want to condense the information. Solution: Apply the kernel to every two pixels. We use a stride of 2 (instead of 1). Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6, β (1) 0 β (1) 1,3 β (1) 3,3 Hidden units
36 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,1 β (1) 3,3,1 β (1) 0,1 Hidden units 10 / 32 N. Wahlström, 2018
37 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,2 β (1) 3,3,2 β (1) 0,2 Hidden units 10 / 32 N. Wahlström, 2018
38 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,3 β (1) 3,3,3 β (1) 0,3 Hidden units 10 / 32 N. Wahlström, 2018
39 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 0,4 β (1) 1,3,4 β (1) 3,3,4 Hidden units 10 / 32 N. Wahlström, 2018
40 Hidden layers are organized in tensors of size (rows columns channels). 10 / 32 N. Wahlström, 2018 Multiple channels One kernel per layer does not give enough flexibility. We use multiple kernels (visualized with different colors). Each kernel produces its own set of hidden units a channel. Input variables X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 5,1 X 5,2 X 5,3 X 5,4 X 5,5 X 5,6 X 6,1 X 6,2 X 6,3 X 6,4 X 6,5 X 6,6 1 β (1) 1,3,1 β (1) 3,3,1 β (1) 0,1 β (1) 0,4 β (1) 1,3,4 β (1) 3,3,4 Hidden units dim dim 6 6 4
41 What is a tensor? A tensor is a generalization of scalar, vector and matrix to arbitrary order. Scalar order 0 Vector order 1 Matrix order 2 Tensor any order (here order 3) 11 / 32 N. Wahlström, 2018 a = 3 3 b = W = T :,:,1 = 2 1, T :,:,2 =
42 Multiple channels (cont.) 12 / 32 N. Wahlström, 2018 A kernel operates on all channels in a hidden layer. Each kernel has the size (kernel rows kernel colomns input channels), here (3 3 4). We stack all kernel parameters in a weight tensor with size (kernel rows kernel colomns input channels output channels), here ( ) In the example below we also used stride 2. First hidden layer 1 Second hidden layer b (2) 0,6 W (2) :,:,:,6 Convolutional layer W (2) R , b (2) R 6
43 Full CNN architecture A full CNN usually consist of multiple convolutional layers (here two) and a few final dense layers (here two). If we have a classification problem at hand, we end with a softmax activation function to produce class probabilities. Input variables dim X1,1X1,2X1,3X1,4X1,5X1,6β (1) 1,3,1 X2,1X2,2X2,3X2,4X2,5X2,6 X3,1X3,2X3,3X3,4X3,5X3,6β (1) 3,3,1 X4,1X4,2X4,3X4,4X4,5X4,6 X5,1X5,2X5,3X5,4X5,5X5,6 X6,1X6,2X6,3X6,4X6,5X6,6 β (1) 0,1 β (1) 0,4 β (1) 1,3,4 β (1) 3,3,4 Hidden units dim Hidden units dim b (2) 0,6 β (2) 1 1,3,4,6 β (2) 3,3,4,6 Hidden units dim Logits dim 10 Z1. Z10. Outputs dim 10. q1(x; θ) q10(x; θ) Convolutional layer W (1) R b (1) R 4 Convolutional layer W (2) R b (2) R 6 Dense layer W (3) R b (3) R 30 Dense Softmax layer W (4) R b (4) R 10 Here we use 30 hidden units in the last hidden layer and consider a classification problem with K = 10 classes. 13 / 32 N. Wahlström, 2018
44 Why deep? Example: Image classification Input: pixels of an image Output: object identity Each hidden layer extracts increasingly abstract features. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks Computer Vision - ECCV (2014). 14 / 32 N. Wahlström, 2018
45 Skin cancer background One result on the use of deep learning in medicine - Detecting skin cancer (February 2017) Andre Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M. and Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, , February, / 32 N. Wahlström, 2018
46 Skin cancer background One result on the use of deep learning in medicine - Detecting skin cancer (February 2017) Andre Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M. and Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, , February, Some background figures (from the US) on skin cancer: Melanomas represents less than 5% of all skin cancers, but accounts for 75% of all skin-cancer-related deaths. Early detection absolutely critical. Estimated 5-year survival rate for melanoma: Over 99% if detected in its earlier stages and 14% is detected in its later stages. 15 / 32 N. Wahlström, 2018
47 Skin cancer task Image copyright Nature (doi: /nature21056) 16 / 32 N. Wahlström, 2018
48 Skin cancer taxonomy used Image copyright Nature doi: /nature21056) 17 / 32 N. Wahlström, 2018
49 Skin cancer solution (ultrabrief) In the paper the used the following network architecture 18 / 32 N. Wahlström, 2018
50 Skin cancer solution (ultrabrief) In the paper the used the following network architecture Initialize all parameters from a neural network trained on 1.28 million images (transfer learning). 18 / 32 N. Wahlström, 2018
51 Skin cancer solution (ultrabrief) In the paper the used the following network architecture 18 / 32 N. Wahlström, 2018 Initialize all parameters from a neural network trained on 1.28 million images (transfer learning). From this initialization we learn new model parameters using clinical images ( 100 times more images than any previous study). Use the model to predict class based on unseen data.
52 Skin cancer indication of the results sensitivity = true positive positive specificity = true negative negative 19 / 32 N. Wahlström, 2018
53 Skin cancer indication of the results sensitivity = true positive positive Letter true negative RESEARCH specificity = negative 19 / 32 N. Wahlström, 2018 Image copyright Nature (doi: /nature21056)
54 Numerical optimization We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n n L(x i, y i, θ) i=1 J(θ) 20 / 32 N. Wahlström, 2018
55 Numerical optimization We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n n L(x i, y i, θ) i=1 J(θ) θ A The best possible solution θ is the global minimizer ( θ A ) 20 / 32 N. Wahlström, 2018
56 Numerical optimization We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n n L(x i, y i, θ) i=1 J(θ) 20 / 32 N. Wahlström, 2018 θ B θ A θ C The best possible solution θ is the global minimizer ( θ A ) The global minimizer is typically very hard to find, and we have to settle for a local minimizer ( θ A, θ B, θ C )
57 Iterative solution - Example 1D J(θ) 21 / 32 N. Wahlström, 2018
58 Iterative solution - Example 1D J(θ) θ 0 21 / 32 N. Wahlström, 2018
59 Iterative solution - Example 1D J(θ) θ 1 In our search for a local optimizer we do an initial guess of θ / 32 N. Wahlström, 2018
60 Iterative solution - Example 1D J(θ) θ 2 In our search for a local optimizer we do an initial guess of θ and update θ iteratively. 21 / 32 N. Wahlström, 2018
61 Iterative solution - Example 1D J(θ) θ 4 In our search for a local optimizer we do an initial guess of θ and update θ iteratively. 21 / 32 N. Wahlström, 2018
62 Iterative solution - Example 1D J(θ) θ 5 In our search for a local optimizer we do an initial guess of θ and update θ iteratively. 21 / 32 N. Wahlström, 2018
63 Iterative solution (gradient descent) Example 2D J(3) / 32 N. Wahlström, θ = [β0, β1 ]T R2
64 Iterative solution (gradient descent) - Example 2D Pick a θ 0 θ = [β 0, β 1 ] T R 2 22 / 32 N. Wahlström, 2018
65 Iterative solution (gradient descent) - Example 2D g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018
66 Iterative solution (gradient descent) - Example 2D g 1 -.g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018
67 Iterative solution (gradient descent) - Example 2D g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018
68 Iterative solution (gradient descent) - Example 2D g 2 -g θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018
69 Iterative solution (gradient descent) - Example 2D θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t / 32 N. Wahlström, 2018
70 Iterative solution (gradient descent) - Example 2D θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t + 1 We call γ R the step length or learning rate. 22 / 32 N. Wahlström, 2018
71 Iterative solution (gradient descent) - Example 2D θ = [β 0, β 1 ] T R 2 1. Pick a θ 0 2. while(not converged) Update θ t+1 = θ t γg t, where g t = θ J(θ) Update t t + 1 We call γ R the step length or learning rate. How big learning rate γ should we have? 22 / 32 N. Wahlström, 2018
72 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018
73 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018
74 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018
75 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018
76 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018
77 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018
78 Learning rate J(θ) θ 23 / 32 N. Wahlström, 2018
79 Learning rate γ too low J(θ) θ 23 / 32 N. Wahlström, 2018
80 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018
81 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018
82 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018
83 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018
84 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018
85 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018
86 Learning rate γ too low J(θ) θ θ 23 / 32 N. Wahlström, 2018
87 Learning rate γ too low γ too high J(θ) θ θ 23 / 32 N. Wahlström, 2018
88 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018
89 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018
90 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018
91 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018
92 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018
93 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018
94 Learning rate γ too low γ too high J(θ) θ θ θ 23 / 32 N. Wahlström, 2018
95 Learning rate γ too low γ too high γ ok J(θ) θ θ θ 23 / 32 N. Wahlström, 2018
96 A good strategy to find a good learning rate is: if the error keeps getting worse or oscillates widely, reduce the learning rate if the error is fairly consistently but slowly increasing, increase the learning rate. 23 / 32 N. Wahlström, 2018 Learning rate γ too low γ too high γ ok J(θ) θ θ θ
97 A good strategy to find a good learning rate is: if the error keeps getting worse or oscillates widely, reduce the learning rate if the error is fairly consistently but slowly increasing, increase the learning rate. 23 / 32 N. Wahlström, 2018 Learning rate γ too low γ too high γ ok J(θ) θ θ θ
98 Computational challenge 1 - dim(θ) is big At each optimization step we need to compute the gradient g t = θ J(θ t ) = 1 n θ L(x i, y i, θ t ). n i=1 Computational challenge 1 - dim(θ) big: A neural network contains a lot of parameters. Computing the gradient is costly. Solution: A NN is a composition of multiple layers. Hence, each term θ L(x i, y i, θ) can be computed efficiently by repeatedly applying the chain rule. This is called the back-propagation algorithm. Not part of the course. 24 / 32 N. Wahlström, 2018
99 Computational challenge 2 - n is big At each optimization step we need to compute the gradient g t = θ J(θ t ) = 1 n θ L(x i, y i, θ t ). n i=1 Computational challenge 2 - n big: We typically use a lot of training data n for training the neural netowork. Computing the gradient is costly. Solution: For each iteration, we only use a small part of the data set to compute the gradient g t. This is called the stochastic gradient descent. 25 / 32 N. Wahlström, 2018
100 Stochastic gradient descent A big data set is often redundant = many data points are similar. { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y / 32 N. Wahlström, 2018
101 Stochastic gradient descent A big data set is often redundant = many data points are similar. { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 If the training data is big θ J(θ) n 2 i=1 θl(x i, y i, θ) and θ J(θ) n i= n 2 +1 θl(x i, y i, θ). 26 / 32 N. Wahlström, 2018
102 Stochastic gradient descent A big data set is often redundant = many data points are similar. { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 If the training data is big θ J(θ) n 2 i=1 θl(x i, y i, θ) and θ J(θ) n i= n 2 +1 θl(x i, y i, θ). We can do the update with only half the computation cost! θ t+1 = θ t γ 1 n 2 θ L(x i, y i, θ t ), n/2 i=1 θ t+2 = θ t+1 γ 1 n θ L(x i, y i, θ t+1 ). n/2 i= n / 32 N. Wahlström, 2018
103 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 1 = θ 0 γ θ L(x 1, y 1, θ 0 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018
104 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 2 = θ 1 γ θ L(x 2, y 2, θ 1 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018
105 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 3 = θ 2 γ θ L(x 3, y 3, θ 2 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018
106 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 θ 4 = θ 3 γ θ L(x 4, y 4, θ 3 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) 27 / 32 N. Wahlström, 2018
107 Stochastic gradient descent Training data {}}{ x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 }{{} Mini-batch θ 1 = θ 0 γ 1 5 θ L(x i, y i, θ 0 ) 5 i=1 The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. 27 / 32 N. Wahlström, 2018
108 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20. } {{ } Mini-batch θ 2 = θ 1 γ i=6 θ L(x i, y i, θ 1 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. 27 / 32 N. Wahlström, 2018
109 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20. } {{ } Mini-batch θ 3 = θ 2 γ i=11 θ L(x i, y i, θ 2 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. 27 / 32 N. Wahlström, 2018
110 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20. } {{ } Mini-batch θ 4 = θ 3 γ i=16 θ L(x i, y i, θ 3 ) The extreme version of this strategy is to use only one data point at each training step (called online learning) We typically do something in between (not one data point, and not all data). We use a smaller set called mini-batch. One pass through the training data is called an epoch. 27 / 32 N. Wahlström, 2018
111 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. 28 / 32 N. Wahlström, 2018
112 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. 28 / 32 N. Wahlström, 2018
113 Stochastic gradient descent { Training data }} { x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 x 18 x 19 x 20 y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 y 9 y 10 y 11 y 12 y 13 y 14 y 15 y 16 y 17 y 18 y 19 y 20 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018
114 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: Epoch: If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018
115 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 1 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018
116 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 2 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018
117 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 3 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018
118 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 4 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018
119 Stochastic gradient descent { Training data (reshuffled) }} { x 7 x 10 x 3 x 20 x 16 x 2 x 1 x 18 x 19 x 12 x 6 x 11 x 17 x 15 x 5 x 14 x 4 x 9 x 13 x 8 y 7 y 10 y 3 y 20 y 16 y 2 y 1 y 18 y 19 y 12 y 6 y 11 y 17 y 15 y 5 y 14 y 4 y 9 y 13 y 8 Iteration: 4 Epoch: 1 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. 28 / 32 N. Wahlström, 2018
120 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018
121 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 5 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018
122 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 6 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018
123 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 7 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018
124 Stochastic gradient descent { Training data (reshuffled again) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 8 Epoch: 2 If we pick the mini-batches in order, they might be unbalanced and not representative for the whole data set. Therefore, we pick data points at random from the training data to form a mini-batch. One implementation is to randomly reshuffle the data before dividing it into mini-batches. After each epoch we do another reshuffling and another pass through the data set. 28 / 32 N. Wahlström, 2018
125 Mini-batch gradient descent The full stochastic gradient descent algorithm (a.k.a mini-batch gradient descent) is as follows 1. Initialize θ 0, set t 1, choose batch size n b and number of epochs E. 2. For i = 1 to E (a) Randomly shuffle the training data {(x i, y i )} n i=1. (b) For j = 1 to n n b (i) Approximate the gradient of the loss function using the mini-batch {(x i, y i)} jn b, i=(j 1)n b +1 ĝ t = 1 jnb n b i=(j 1)n b +1 θl(x i, y i, θ). θ=θ t (ii) Do a gradient step θ t+1 = θ t γĝ t. (iii) Update the iteration index t t + 1. At each time we get a stochastic approximation of the true gradient ĝ t 1 ni=1 n θ L(x i, y i, θ), hence the name. θ=θt 29 / 32 N. Wahlström, 2018
126 Some comments - Why now? Neural networks have been around for more than fifty years. Why have they become so popular now (again)? To solve really interesting problems you need: 1. Efficient learning algorithms 2. Efficient computational hardware 3. A lot of labeled data! These three factors have not been fulfilled to a satisfactory level until the last 5-10 years. 30 / 32 N. Wahlström, 2018
127 Some pointers A book has recently been published I. Goodfellow, Y. Bengio and A. Courville Deep learning 31 / 32 N. Wahlström, 2018
128 Some pointers A book has recently been published I. Goodfellow, Y. Bengio and A. Courville Deep learning A well written and timely introduction: LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep learning, Nature, 521(7553), / 32 N. Wahlström, 2018
129 Some pointers A book has recently been published I. Goodfellow, Y. Bengio and A. Courville Deep learning A well written and timely introduction: LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep learning, Nature, 521(7553), You will also find more material than you can possibly want here 31 / 32 N. Wahlström, 2018
130 A few concepts to summarize lecture 9 Convolutional neural network (CNN): A NN with a particular structure tailored for input data with a grid-like structure, like for example images. Kernel: (a.k.a filter) A set of parameters that is convolved with a hidden layer. Each kernel produces a new channel. Channel: A set of hidden units produced by the same kernel. Each hidden layer consists of one or more channels. Stride: A positive integer deciding how many steps to move the kernel during the convolution. Tensor: A generalization of matrices to arbitrary order. Gradient descent: An iterative optimization algorithm where we at iteration take a step proportional to the negative gradient. Learning rate: (a.k.a step length). A scalar tuning parameter deciding the length of each gradient step in gradient descent. Stochstic gradient descent (SGD): A version of gradient descent where we at each iteration only use a small part of the training data (a mini-batch). Mini-batch: The group of training data that we use at each iteration in SGD Batch size: The number of data points in one mini-batch 32 / 32 N. Wahlström, 2018
CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS
CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting
More informationMachine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/
More informationDeep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning
Convolutional Neural Network (CNNs) University of Waterloo October 30, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Convolutional Networks Convolutional
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationConvolutional Neural Networks
Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationConvolutional Neural Networks. Srikumar Ramalingam
Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationNeural Networks 2. 2 Receptive fields and dealing with image inputs
CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There
More informationClassification with Perceptrons. Reading:
Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationDeep Learning (CNNs)
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell
More informationMachine Learning Lecture 10
Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes
More informationHow to do backpropagation in a brain
How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25
More informationAn overview of deep learning methods for genomics
An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1 Snapshot 1. Brief introduction to convolutional neural networks What is deep
More informationRecurrent Neural Networks with Flexible Gates using Kernel Activation Functions
2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationMachine Learning Lecture 12
Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More informationDeep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville
Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward
More informationTopic 3: Neural Networks
CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationAdministration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6
Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationIntroduction to Machine Learning HW6
CS 189 Spring 2018 Introduction to Machine Learning HW6 Your self-grade URL is http://eecs189.org/self_grade?question_ids=1_1,1_ 2,2_1,2_2,3_1,3_2,3_3,4_1,4_2,4_3,4_4,4_5,4_6,5_1,5_2,6. This homework is
More informationWelcome to the Machine Learning Practical Deep Neural Networks. MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1
Welcome to the Machine Learning Practical Deep Neural Networks MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1 Introduction to MLP; Single Layer Networks (1) Steve Renals Machine Learning
More informationLecture 3 Classification, Logistic Regression
Lecture 3 Classification, Logistic Regression Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se F. Lindsten Summary
More informationBayesian Networks (Part I)
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationNonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.
Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationJakub Hajic Artificial Intelligence Seminar I
Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network
More informationSolutions. Part I Logistic regression backpropagation with a single training example
Solutions Part I Logistic regression backpropagation with a single training example In this part, you are using the Stochastic Gradient Optimizer to train your Logistic Regression. Consequently, the gradients
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More informationDeep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 31 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/62 Lecture 1b Logistic regression & neural network October 2, 2015 2/62 Table of contents 1 1 Bird s-eye
More informationRegML 2018 Class 8 Deep learning
RegML 2018 Class 8 Deep learning Lorenzo Rosasco UNIGE-MIT-IIT June 18, 2018 Supervised vs unsupervised learning? So far we have been thinking of learning schemes made in two steps f(x) = w, Φ(x) F, x
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationSpatial Transformer Networks
BIL722 - Deep Learning for Computer Vision Spatial Transformer Networks Max Jaderberg Andrew Zisserman Karen Simonyan Koray Kavukcuoglu Contents Introduction to Spatial Transformers Related Works Spatial
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationThe Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems
The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,
More informationGRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)
0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationMachine Learning. Neural Networks
Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE
More informationCOMP 551 Applied Machine Learning Lecture 14: Neural Networks
COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationConvolutional Neural Networks II. Slides from Dr. Vlad Morariu
Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate
More informationWhat Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1
What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single
More informationFeedforward Neural Networks. Michael Collins, Columbia University
Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation
More informationMidterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm
Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm Name: Student number: This is a closed-book test. It is marked out of 15 marks. Please answer ALL of the
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Neural networks Daniel Hennes 21.01.2018 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Logistic regression Neural networks Perceptron
More informationLecture 2: Learning with neural networks
Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationLecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants
Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28..
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationNeural Networks and Deep Learning.
Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationConvolutional Neural Network Architecture
Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation
More informationCS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016
CS 1674: Intro to Computer Vision Final Review Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 Final info Format: multiple-choice, true/false, fill in the blank, short answers, apply an
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationTopics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationLecture 12. Neural Networks Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationLecture 12. Neural Networks Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 12 Neural Networks 24.11.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI)
More information