EE-559 Deep learning 9. Autoencoders and generative models

Size: px
Start display at page:

Download "EE-559 Deep learning 9. Autoencoders and generative models"

Transcription

1 EE-559 Deep learning 9. Autoencoders and generative models François Fleuret [version of: May 1, 2018] ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

2 Embeddings and generative models François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 2 / 80

3 Many applications such as image synthesis, denoising, super-resolution, speech synthesis, compression, etc. require to go beyond classification and regression, and model explicitly a high dimension signal. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 3 / 80

4 Many applications such as image synthesis, denoising, super-resolution, speech synthesis, compression, etc. require to go beyond classification and regression, and model explicitly a high dimension signal. This modeling consists of finding meaningful degrees of freedom that describe the signal, and are of lesser dimension. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 3 / 80

5 Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 4 / 80

6 Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 4 / 80

7 Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 4 / 80

8 f Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 4 / 80

9 f Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 4 / 80

10 f g Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 4 / 80

11 Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 4 / 80

12 When dealing with real-world signals, this objective involves the same theoretical and practical issues as for classification or regression: defining the right class of high-dimension models, and optimizing them. Regarding synthesis, we saw that deep feed-forward architectures exhibit good generative properties, which motivates their use explicitly for that purpose. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 5 / 80

13 Autoencoders François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 6 / 80

14 An autoencoder combines an encoder f that embeds the original space X into a latent space of lower dimension F, and a decoder g to map back to X, such that their composition g f is [close to] the identity on the data. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 7 / 80

15 An autoencoder combines an encoder f that embeds the original space X into a latent space of lower dimension F, and a decoder g to map back to X, such that their composition g f is [close to] the identity on the data. Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 7 / 80

16 An autoencoder combines an encoder f that embeds the original space X into a latent space of lower dimension F, and a decoder g to map back to X, such that their composition g f is [close to] the identity on the data. f Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 7 / 80

17 An autoencoder combines an encoder f that embeds the original space X into a latent space of lower dimension F, and a decoder g to map back to X, such that their composition g f is [close to] the identity on the data. f g Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 7 / 80

18 An autoencoder combines an encoder f that embeds the original space X into a latent space of lower dimension F, and a decoder g to map back to X, such that their composition g f is [close to] the identity on the data. f g Latent space F Original space X A proper autoencoder has to capture a good parametrization of the signal, and in particular the statistical dependencies between the signal components. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 7 / 80

19 Let q be the data distribution over X. A good autoencoder could be characterized with the MSE loss E X q [ X g f (X ) 2] 0. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 8 / 80

20 Let q be the data distribution over X. A good autoencoder could be characterized with the MSE loss E X q [ X g f (X ) 2] 0. Given two parametrized mappings f ( ; w) and g( ; w), training consists of minimizing an empirical estimate of that loss 1 ŵ f, ŵ g = argmin w f,w g N N x n g(f (x n; w f ); w g ) 2. n=1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 8 / 80

21 Let q be the data distribution over X. A good autoencoder could be characterized with the MSE loss E X q [ X g f (X ) 2] 0. Given two parametrized mappings f ( ; w) and g( ; w), training consists of minimizing an empirical estimate of that loss 1 ŵ f, ŵ g = argmin w f,w g N N x n g(f (x n; w f ); w g ) 2. n=1 A simple example of such an autoencoder would be with both f and g linear, in which case the optimal solution is given by PCA. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 8 / 80

22 Let q be the data distribution over X. A good autoencoder could be characterized with the MSE loss E X q [ X g f (X ) 2] 0. Given two parametrized mappings f ( ; w) and g( ; w), training consists of minimizing an empirical estimate of that loss 1 ŵ f, ŵ g = argmin w f,w g N N x n g(f (x n; w f ); w g ) 2. n=1 A simple example of such an autoencoder would be with both f and g linear, in which case the optimal solution is given by PCA. Better results can be achieved with more sophisticated classes of mappings, in particular deep architectures. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 8 / 80

23 Transposed convolutions François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 9 / 80

24 Constructing deep generative architectures, such as the decoder of an autoencoder, requires layers to increase the signal dimension, the contrary of what we have done so far with feed-forward networks. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 10 / 80

25 Constructing deep generative architectures, such as the decoder of an autoencoder, requires layers to increase the signal dimension, the contrary of what we have done so far with feed-forward networks. Generative processes that consist of optimizing the input rely on back-propagation to expend the signal from a low-dimension representation to the high-dimension signal space. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 10 / 80

26 Constructing deep generative architectures, such as the decoder of an autoencoder, requires layers to increase the signal dimension, the contrary of what we have done so far with feed-forward networks. Generative processes that consist of optimizing the input rely on back-propagation to expend the signal from a low-dimension representation to the high-dimension signal space. The same can be done in the forward pass with transposed convolution layers whose forward operation corresponds to a convolution layer backward pass. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 10 / 80

27 Consider a 1d convolution with a kernel κ y i = (x κ) i = a = u x i+a 1 κ a x u κ u i+1. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 11 / 80

28 Consider a 1d convolution with a kernel κ y i = (x κ) i = a = u x i+a 1 κ a x u κ u i+1. We get [ ] l = l x u x u = l y i y i i x u = l κ u i+1. y i i which looks a lot like a standard convolution layer, except that the kernel coefficients are visited in reverse order. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 11 / 80

29 This is actually the standard convolution operator from signal processing. If denotes this operation, we have (x κ) i = a x a κ i a+1. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 12 / 80

30 This is actually the standard convolution operator from signal processing. If denotes this operation, we have (x κ) i = a x a κ i a+1. Coming back to the backward pass of the convolution layer, if then [ ] l = x y = x κ [ ] l κ. y François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 12 / 80

31 In the deep-learning field, since it corresponds to transposing the weight matrix of the equivalent fully-connected layer, it is called a transposed convolution. κ 1 κ 2 κ κ 1 κ 2 κ κ 1 κ 2 κ κ 1 κ 2 κ κ 1 κ 2 κ 3 T = κ κ 2 κ κ 3 κ 2 κ κ 3 κ 2 κ κ 3 κ 2 κ κ 3 κ κ 3 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 13 / 80

32 In the deep-learning field, since it corresponds to transposing the weight matrix of the equivalent fully-connected layer, it is called a transposed convolution. κ 1 κ 2 κ κ 1 κ 2 κ κ 1 κ 2 κ κ 1 κ 2 κ κ 1 κ 2 κ 3 T = κ κ 2 κ κ 3 κ 2 κ κ 3 κ 2 κ κ 3 κ 2 κ κ 3 κ κ 3 While a convolution can be seen as a series of inner products, a transposed convolution can be seen as a weighted sum of translated kernels. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 13 / 80

33 Convolution layer Input W François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

34 Convolution layer Input W Kernel w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

35 Convolution layer Input W w 9 Output W w + 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

36 Convolution layer Input W w Output 9 0 W w + 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

37 Convolution layer Input W w Output W w + 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

38 Convolution layer Input W w Output W w + 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

39 Convolution layer Input W w Output W w + 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

40 Convolution layer Input W w Output W w + 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

41 Convolution layer Input W w Output W w + 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

42 Convolution layer Input W Kernel w Output W w + 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 14 / 80

43 Transposed convolution layer Input W Kernel w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 15 / 80

44 Transposed convolution layer Input W Output W + w 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 15 / 80

45 Transposed convolution layer Input W Output W + w 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 15 / 80

46 Transposed convolution layer Input W Output W + w 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 15 / 80

47 Transposed convolution layer Input W Output W + w 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 15 / 80

48 Transposed convolution layer Input W Output W + w 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 15 / 80

49 Transposed convolution layer Input W Kernel w Output W + w 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 15 / 80

50 torch.nn.functional.conv transpose1d implements the operation we just described. It takes as input a batch of multi-channel samples, and produces a batch of multi-channel samples. >>> x = Variable ( Tensor ([[[0, 0, 1, 0, 0, 0, 0]]]) ) >>> k = Variable ( Tensor ([[[1, 2, 3]]]) ) >>> F. conv1d (x, k) Variable containing : (0,.,.) = [ torch. FloatTensor of size 1 x1x5 ] = François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 16 / 80

51 torch.nn.functional.conv transpose1d implements the operation we just described. It takes as input a batch of multi-channel samples, and produces a batch of multi-channel samples. >>> x = Variable ( Tensor ([[[0, 0, 1, 0, 0, 0, 0]]]) ) >>> k = Variable ( Tensor ([[[1, 2, 3]]]) ) >>> F. conv1d (x, k) Variable containing : (0,.,.) = [ torch. FloatTensor of size 1 x1x5 ] = >>> F. conv_transpose1d (x, k) Variable containing : (0,.,.) = [ torch. FloatTensor of size 1 x1x9 ] = François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 16 / 80

52 The class torch.nn.convtranspose1d implements that operation into a torch.nn.module. >>> x = Variable ( Tensor ([[[ 2, 3, 0, -1]]]) ) >>> m = nn. ConvTranspose1d (1, 1, kernel_size =3) >>> m. bias. data. zero_ () 0 [ torch. FloatTensor of size 1] >>> m. weight. data. copy_ ( Tensor ([ 1, 2, -1 ])) (0,.,.) = [ torch. FloatTensor of size 1 x1x3 ] >>> y = m(x) >>> y Variable containing : (0,.,.) = [ torch. FloatTensor of size 1 x1x6 ] François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 17 / 80

53 Transposed convolutions also have a dilation parameter that behaves as for convolution and expends the kernel size without increasing the number of parameters by making it sparse. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 18 / 80

54 Transposed convolutions also have a dilation parameter that behaves as for convolution and expends the kernel size without increasing the number of parameters by making it sparse. They also have a stride and padding parameters, however, due to the relation between convolutions and transposed convolutions: While for convolutions stride and padding are defined in the input map, for transposed convolutions these parameters are defined in the output map, and the latter modulates a cropping operation. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 18 / 80

55 Transposed convolution layer (stride = 2) Input W Kernel w Output s(w 1) + w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 19 / 80

56 Transposed convolution layer (stride = 2) Input W Output s(w 1) + w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 19 / 80

57 Transposed convolution layer (stride = 2) Input W s Output s(w 1) + w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 19 / 80

58 Transposed convolution layer (stride = 2) Input W s s Output s(w 1) + w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 19 / 80

59 Transposed convolution layer (stride = 2) Input W s s s Output s(w 1) + w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 19 / 80

60 Transposed convolution layer (stride = 2) Input W s s s Output s(w 1) + w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 19 / 80

61 Transposed convolution layer (stride = 2) Input W Kernel w Output s(w 1) + w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 19 / 80

62 The composition of a convolution and a transposed convolution of same parameters keep the signal size [roughly] unchanged. A convolution with a stride greater than one may ignore parts of the signal. Its composition with the corresponding transposed convolution generates a map of the size of the observed area. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 20 / 80

63 The composition of a convolution and a transposed convolution of same parameters keep the signal size [roughly] unchanged. A convolution with a stride greater than one may ignore parts of the signal. Its composition with the corresponding transposed convolution generates a map of the size of the observed area. For instance, a 1d convolution of kernel size w and stride s composed with the transposed convolution of same parameters maintains the signal size W, only if q N, W = w + s q. W s s s w François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 20 / 80

64 It has been observed that transposed convolutions may create some grid-structure artifact, since generated pixels are not all covered similarly. For instance with a 4 4 kernel and stride 3 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 21 / 80

65 An alternative is to use an analytic up-scaling. Two standard such PyTorch modules are nn.upsamplingbilinear2d and nn.upsample. >>> x = Variable ( Tensor ([[[[ 1, 2 ], [ 3, 4 ]]]]) ) >>> b = nn. UpsamplingBilinear2d ( scale_factor = 3) >>> b(x) Variable containing : (0,0,.,.) = [ torch. FloatTensor of size 1 x1x6x6 ] François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 22 / 80

66 An alternative is to use an analytic up-scaling. Two standard such PyTorch modules are nn.upsamplingbilinear2d and nn.upsample. >>> x = Variable ( Tensor ([[[[ 1, 2 ], [ 3, 4 ]]]]) ) >>> b = nn. UpsamplingBilinear2d ( scale_factor = 3) >>> b(x) Variable containing : (0,0,.,.) = [ torch. FloatTensor of size 1 x1x6x6 ] >>> u = nn. Upsample ( scale_factor = 3, mode = nearest ) >>> u(x) Variable containing : (0,0,.,.) = [ torch. FloatTensor of size 1 x1x6x6 ] François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 22 / 80

67 Such module is usually combined with a convolution to learn local corrections to undesirable artifacts of the up-scaling. In practice, a transposed convolution such as nn. ConvTranspose2d (nic, noc, kernel_size = 3, stride = 2, padding = 1, output_padding = 1), can be replaced by nn. UpsamplingBilinear2d ( scale_factor = 2) nn. Conv2d (nic, noc, kernel_size = 3, padding = 1) or nn. Upsample ( scale_factor = 2, mode = nearest ) nn. Conv2d (nic, noc, kernel_size = 3, padding = 1) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 23 / 80

68 Deep Autoencoders François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 24 / 80

69 A deep autoencoder combines an encoder composed of convolutional layers, and a decoder composed of the reciprocal transposed convolution layers. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 25 / 80

70 A deep autoencoder combines an encoder composed of convolutional layers, and a decoder composed of the reciprocal transposed convolution layers. To run a simple example on MNIST, we consider the following model, where dimension reduction is obtained through filter sizes and strides > 1, avoiding max-pooling layers. AutoEncoder ( ( encoder ): Sequential ( (0) : Conv2d (1, 32, kernel_size =(5, 5), stride =(1, 1)) (1) : ReLU ( inplace ) (2) : Conv2d (32, 32, kernel_size =(5, 5), stride =(1, 1)) (3) : ReLU ( inplace ) (4) : Conv2d (32, 32, kernel_size =(4, 4), stride =(2, 2)) (5) : ReLU ( inplace ) (6) : Conv2d (32, 32, kernel_size =(3, 3), stride =(2, 2)) (7) : ReLU ( inplace ) (8) : Conv2d (32, 8, kernel_size =(4, 4), stride =(1, 1)) ) ( decoder ): Sequential ( (0) : ConvTranspose2d (8, 32, kernel_size =(4, 4), stride =(1, 1)) (1) : ReLU ( inplace ) (2) : ConvTranspose2d (32, 32, kernel_size =(3, 3), stride =(2, 2)) (3) : ReLU ( inplace ) (4) : ConvTranspose2d (32, 32, kernel_size =(4, 4), stride =(2, 2)) (5) : ReLU ( inplace ) (6) : ConvTranspose2d (32, 32, kernel_size =(5, 5), stride =(1, 1)) (7) : ReLU ( inplace ) (8) : ConvTranspose2d (32, 1, kernel_size =(5, 5), stride =(1, 1)) ) ) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 25 / 80

71 Encoder Tensor sizes / operations nn.conv2d(1, 32, kernel size=5, stride=1) nn.conv2d(32, 32, kernel size=5, stride=1) nn.conv2d(32, 32, kernel size=4, stride=2) nn.conv2d(32, 32, kernel size=3, stride=2) nn.conv2d(32, 8, kernel size=4, stride=1) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 26 / 80

72 Decoder Tensor sizes / operations nn.convtranspose2d(8, 32, kernel size=4, stride=1) nn.convtranspose2d(32, 32, kernel size=3, stride=2) nn.convtranspose2d(32, 32, kernel size=4, stride=2) nn.convtranspose2d(32, 32, kernel size=5, stride=1) nn.convtranspose2d(32, 1, kernel size=5, stride=1) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 27 / 80

73 Training is achieved with MSE and Adam model = AutoEncoder ( embedding_dim, nb_channels ) mse_loss = nn. MSELoss () if torch. cuda. is_available (): model. cuda () mse_loss. cuda () optimizer = optim. Adam ( model. parameters (), lr = 1e -3) for epoch in range ( args. nb_epochs ): for input, _ in iter ( train_loader ): if torch. cuda. is_available (): input = input. cuda () input = Variable ( input ) output = model ( input ) loss = mse_loss ( output, input ) model. zero_grad () loss. backward () optimizer. step () François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 28 / 80

74 X (original samples) g f (X ) (CNN, d = 2) g f (X ) (PCA, d = 2) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 29 / 80

75 X (original samples) g f (X ) (CNN, d = 4) g f (X ) (PCA, d = 4) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 29 / 80

76 X (original samples) g f (X ) (CNN, d = 8) g f (X ) (PCA, d = 8) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 29 / 80

77 X (original samples) g f (X ) (CNN, d = 16) g f (X ) (PCA, d = 16) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 29 / 80

78 X (original samples) g f (X ) (CNN, d = 32) g f (X ) (PCA, d = 32) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 29 / 80

79 To get an intuition of the latent representation, we can pick two samples x and x at random and interpolate samples along the line in the latent space x, x X 2, α [0, 1], ξ(x, x, α) = g((1 α)f (x) + αf (x )). x x Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 30 / 80

80 To get an intuition of the latent representation, we can pick two samples x and x at random and interpolate samples along the line in the latent space x, x X 2, α [0, 1], ξ(x, x, α) = g((1 α)f (x) + αf (x )). f f (x) x x f (x ) Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 30 / 80

81 To get an intuition of the latent representation, we can pick two samples x and x at random and interpolate samples along the line in the latent space x, x X 2, α [0, 1], ξ(x, x, α) = g((1 α)f (x) + αf (x )). f f (x) x x f (x ) Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 30 / 80

82 To get an intuition of the latent representation, we can pick two samples x and x at random and interpolate samples along the line in the latent space x, x X 2, α [0, 1], ξ(x, x, α) = g((1 α)f (x) + αf (x )). x x f g f (x) f (x ) Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 30 / 80

83 Autoencoder interpolation (d = 8) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 31 / 80

84 Autoencoder interpolation (d = 32) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 32 / 80

85 And we can assess the generative capabilities of the decoder g by introducing a [simple] density model q Z over the latent space F, sample there, and map the samples into the image space X with g. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 33 / 80

86 And we can assess the generative capabilities of the decoder g by introducing a [simple] density model q Z over the latent space F, sample there, and map the samples into the image space X with g. We can for instance use a Gaussian model with diagonal covariance matrix. f (X ) N( ˆm, ˆ ) where ˆm is a vector and ˆ a diagonal matrix, both estimated on training data. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 33 / 80

87 And we can assess the generative capabilities of the decoder g by introducing a [simple] density model q Z over the latent space F, sample there, and map the samples into the image space X with g. We can for instance use a Gaussian model with diagonal covariance matrix. f (X ) N( ˆm, ˆ ) where ˆm is a vector and ˆ a diagonal matrix, both estimated on training data. Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 33 / 80

88 And we can assess the generative capabilities of the decoder g by introducing a [simple] density model q Z over the latent space F, sample there, and map the samples into the image space X with g. We can for instance use a Gaussian model with diagonal covariance matrix. f (X ) N( ˆm, ˆ ) where ˆm is a vector and ˆ a diagonal matrix, both estimated on training data. Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 33 / 80

89 And we can assess the generative capabilities of the decoder g by introducing a [simple] density model q Z over the latent space F, sample there, and map the samples into the image space X with g. We can for instance use a Gaussian model with diagonal covariance matrix. f (X ) N( ˆm, ˆ ) where ˆm is a vector and ˆ a diagonal matrix, both estimated on training data. Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 33 / 80

90 And we can assess the generative capabilities of the decoder g by introducing a [simple] density model q Z over the latent space F, sample there, and map the samples into the image space X with g. We can for instance use a Gaussian model with diagonal covariance matrix. f (X ) N( ˆm, ˆ ) where ˆm is a vector and ˆ a diagonal matrix, both estimated on training data. g Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 33 / 80

91 Autoencoder sampling (d = 8) Autoencoder sampling (d = 16) Autoencoder sampling (d = 32) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 34 / 80

92 These results are unsatisfying, because the density model used on the latent space F is too simple and inadequate. Building a good model amounts to our original problem of modeling an empirical distribution, although it may now be in a lower dimension space. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 35 / 80

93 Denoising Autoencoders François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 36 / 80

94 Vincent et al. (2010) interpret the autoencoder in a probabilistic framework as a way of building an encoder that maximizes the mutual information between the input and the latent state. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 37 / 80

95 Vincent et al. (2010) interpret the autoencoder in a probabilistic framework as a way of building an encoder that maximizes the mutual information between the input and the latent state. Let X be a sample, Z = f (X ; θ) its latent representation, and q θ (x, z) the distribution of (X, Z). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 37 / 80

96 Vincent et al. (2010) interpret the autoencoder in a probabilistic framework as a way of building an encoder that maximizes the mutual information between the input and the latent state. Let X be a sample, Z = f (X ; θ) its latent representation, and q θ (x, z) the distribution of (X, Z). We have argmax θ I(X, Z) = argmax θ = argmax θ = argmax θ H(X ) H(X Z) H(X Z) [ ] E log q θ (X Z). However, there is no expression of q θ (X Z) in any reasonable setup. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 37 / 80

97 For any distribution p we have [ ] [ ] E log q θ (X Z) E log p(x Z). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 38 / 80

98 For any distribution p we have [ ] [ ] E log q θ (X Z) E log p(x Z). So we can in particular approximate the left term with the right one by optimizing a reconstruction model p η to make the inequality tight. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 38 / 80

99 If we consider the following model for p where g is deterministic, p η ( Z = z) N(g(z), σ) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 39 / 80

100 If we consider the following model for p p η ( Z = z) N(g(z), σ) where g is deterministic, we get [ ] [ X g(z; η) 2 ] E log p η(x Z) = E 2σ 2 [ X g(f (X ; θ); η) 2 ] = E 2σ 2. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 39 / 80

101 If we consider the following model for p p η ( Z = z) N(g(z), σ) where g is deterministic, we get [ ] [ X g(z; η) 2 ] E log p η(x Z) = E 2σ 2 [ X g(f (X ; θ); η) 2 ] = E 2σ 2. If optimizing η makes the bound tight, the final loss is the reconstruction error argmax θ I(X, Z) argmin θ ( 1 min η N ) N x n g(f (x n; θ); η) 2 n=1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 39 / 80

102 If we consider the following model for p p η ( Z = z) N(g(z), σ) where g is deterministic, we get [ ] [ X g(z; η) 2 ] E log p η(x Z) = E 2σ 2 [ X g(f (X ; θ); η) 2 ] = E 2σ 2. If optimizing η makes the bound tight, the final loss is the reconstruction error argmax θ I(X, Z) argmin θ ( 1 min η N ) N x n g(f (x n; θ); η) 2 This abstract view of the encoder as maximizing information justifies its use to build generic encoding layers. n=1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 39 / 80

103 In the perspective of building a good feature representation, just retaining information is not enough, otherwise the identity would be a good choice. Reducing dimension, or forcing sparsity is a way to push the model to maximize retained information in a constraint coding space. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 40 / 80

104 In the perspective of building a good feature representation, just retaining information is not enough, otherwise the identity would be a good choice. Reducing dimension, or forcing sparsity is a way to push the model to maximize retained information in a constraint coding space. In their work, Vincent et al. proposed to degrade the signal with noise before feeding it to the encoder, but to keep the MSE to the original signal. This forces the encoder to retain meaningful structures. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 40 / 80

105 VINCENT, LAROCHELLE, LAJOIE, BENGIO AND MANZAGOL Figure 6: Weight decay vs. Gaussian noise. We show typical filters learnt from natural image patches in the over-complete case (200 hidden units). Left: regular autoencoder with weight decay. We tried a wide range of weight-decay values and learning rates: filters never appeared to capture a more interesting structure than what is shown here. Note that some local blob detectors are recovered compared to using no weight decay at all (Figure 5 right). Right: a denoising autoencoder with additive Gaussian noise (σ = 0.5) learns Gabor-like local oriented edge detectors. Clearly the filters learnt are qualitatively very different in the two cases. (Vincent et al., 2010) yielded a mixture of edge detectors and grating filters. Clearly different corruption types and levels can yield qualitatively different filters. But it is interesting to note that all three noise types we experimented with were able to yield some potentially useful edge detectors. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 41 / 80

106 STACKED DENOISING AUTOENCODERS Figure 7: Filters obtained on natural image patches by denoising autoencoders using other noise types. Left: with 10% salt-and-pepper noise, we obtain oriented Gabor-like filters. They appear slightly less localized than when using Gaussian noise (contrast with Figure 6 right). Right: with 55% zero-masking noise we obtain filters that look like oriented gratings. For the three considered noise types, denoising training appears to learn filters that capture meaningful natural image statistics structure. 6. Experiments on Stacked Denoising Autoencoders (Vincent et al., 2010) In this section, we evaluate denoising autoencoders as a pretraining strategy for building deep networks, using the stacking procedure that we described in Section 3.5. We shall mainly compare the classification performance of networks pretrained by stacking denoising autoencoders (SDAE), ver- François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 42 / 80

107 Vincent et al. build deep MLPs whose layers are initialized successively as encoders trained within a noisy autoencoder. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 43 / 80

108 Vincent et al. build deep MLPs whose layers are initialized successively as encoders trained within a noisy autoencoder. Autoencoder François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 43 / 80

109 Vincent et al. build deep MLPs whose layers are initialized successively as encoders trained within a noisy autoencoder. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 43 / 80

110 Vincent et al. build deep MLPs whose layers are initialized successively as encoders trained within a noisy autoencoder. Autoencoder François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 43 / 80

111 Vincent et al. build deep MLPs whose layers are initialized successively as encoders trained within a noisy autoencoder. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 43 / 80

112 Vincent et al. build deep MLPs whose layers are initialized successively as encoders trained within a noisy autoencoder. Autoencoder François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 43 / 80

113 Vincent et al. build deep MLPs whose layers are initialized successively as encoders trained within a noisy autoencoder. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 43 / 80

114 Vincent et al. build deep MLPs whose layers are initialized successively as encoders trained within a noisy autoencoder. A final classifying layer is added and the full structure can be fine-tuned. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 43 / 80

115 VINCENT, LAROCHELLE, LAJOIE, BENGIO AND MANZAGOL Data Set SVM rb f DBN-1 SAE-3 DBN-3 SDAE-3 (ν) MNIST 1.40± ± ± ± ±0.22 (25%) basic 3.03± ± ± ± ±0.15 (10%) rot 11.11± ± ± ± ±0.26 (25%) bg-rand 14.58± ± ± ± ±0.27 (40%) bg-img 22.61± ± ± ± ±0.33 (25%) bg-img-rot 55.18± ± ± ± ±0.43 (25%) rect 2.15± ± ± ± ±0.12 (10%) rect-img 24.04± ± ± ± ±0.36 (25%) convex 19.13± ± ± ± ±0.34 (10%) tzanetakis 14.41± ± ± ± ±1.04(0.05) Table 3: Comparison of stacked denoising autoencoders (SDAE-3) with (Vincent other models. et al., Test 2010) error rate on all considered classification problems is reported together with a 95% confidence interval. Best performer is in bold, as well as those for which confidence intervals overlap. SDAE-3 appears to achieve performance superior or equivalent to the best other model on all problems except bg-rand. For SDAE-3, we also indicate the fraction ν of corrupted input components, or in case of tzanetakis, the standard deviation of the Gaussian noise, as chosen by proper model selection. Note that SAE-3 is equivalent to SDAE-3 with ν = 0%. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 44 / 80

116 Variational Autoencoders François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 45 / 80

117 Coming back to generating a signal, instead of training an autoencoder and modeling the distribution of Z, we can try an alternative approach: Impose a distribution for Z and then train a decoder g so that g(z) matches the training data. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 46 / 80

118 We consider the following two distributions: q is the distribution on X R d of a pair (X, Z) composed of a sample X taken from the data distribution and the output of the encoder on it, François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 47 / 80

119 We consider the following two distributions: q is the distribution on X R d of a pair (X, Z) composed of a sample X taken from the data distribution and the output of the encoder on it, p is the distribution on X R d of a pair (X, Z) composed of an encoding state Z N(0, I ) and the output of the decoder on it. We should ideally look for the g that maximizes the [empirical] log-likelihood 1 N log p(x n). n François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 47 / 80

120 We consider the following two distributions: q is the distribution on X R d of a pair (X, Z) composed of a sample X taken from the data distribution and the output of the encoder on it, p is the distribution on X R d of a pair (X, Z) composed of an encoding state Z N(0, I ) and the output of the decoder on it. We should ideally look for the g that maximizes the [empirical] log-likelihood 1 N log p(x n). n However, while we can sample z and compute g(z), we cannot compute p(x) for a given x, and even less compute its derivatives. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 47 / 80

121 The Variational Autoencoder proposed by Kingma and Welling (2013) relies on a tractable approximation of this log-likelihood. Their framework considers a stochastic encoder f, and decoder g, whose outputs depend on their inputs as usual but with some remaining randomness. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 48 / 80

122 We can equivalently maximize (all expectations are under p): [ S = E = E ] log p(x ) log q(x ) [ log p(x, Z) + log p(x ) p(x, Z) [ p(x, Z) p(z X ) = E log log q(x, Z) q(z X ) [ ] q(x ) log q(x, Z) log q(x, Z) ] = D KL (q(x, Z) p(x, Z)) + E D KL (q(z X ) p(z X ) ) ]. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 49 / 80

123 We can equivalently maximize (all expectations are under p): [ S = E = E ] log p(x ) log q(x ) [ log p(x, Z) + log p(x ) p(x, Z) [ p(x, Z) p(z X ) = E log log q(x, Z) q(z X ) [ ] q(x ) log q(x, Z) log q(x, Z) ] = D KL (q(x, Z) p(x, Z)) + E D KL (q(z X ) p(z X ) ) While the second term cannot be properly estimated, we can use the first as a lower bound, and rewrite it S D KL (q(x, Z) p(x, Z)) [ ] [ ] = E D KL (q(z X ) p(z)) + E log p(x Z). ]. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 49 / 80

124 We can equivalently maximize (all expectations are under p): [ S = E = E ] log p(x ) log q(x ) [ log p(x, Z) + log p(x ) p(x, Z) [ p(x, Z) p(z X ) = E log log q(x, Z) q(z X ) [ ] q(x ) log q(x, Z) log q(x, Z) ] = D KL (q(x, Z) p(x, Z)) + E D KL (q(z X ) p(z X ) ) While the second term cannot be properly estimated, we can use the first as a lower bound, and rewrite it S D KL (q(x, Z) p(x, Z)) [ ] [ ] = E D KL (q(z X ) p(z)) + E log p(x Z). It pushes to match the joint distributions on (X, Z) while matching the marginals on X alone would be enough. ]. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 49 / 80

125 Kingma and Welling use Gaussians with diagonal covariance for both q(z X ) and p(x Z). So in practice the encoder maps a data point from the signal space R c to [the parameters of] a Gaussian in the latent space R d f : R c R 2d ( ) x µ f 1,..., µf d, σf 1,..., σf d, François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 50 / 80

126 Kingma and Welling use Gaussians with diagonal covariance for both q(z X ) and p(x Z). So in practice the encoder maps a data point from the signal space R c to [the parameters of] a Gaussian in the latent space R d f : R c R 2d ( ) x µ f 1,..., µf d, σf 1,..., σf d, and the decoder maps a latent value from R d to [the parameters of] a Gaussian in the signal space R c g : R d R 2c z ( µ g 1,..., µg c, σ g 1,..., σg c ). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 50 / 80

127 Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 51 / 80

128 Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 51 / 80

129 f Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 51 / 80

130 f Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 51 / 80

131 f g Latent space F Original space X François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 51 / 80

132 We have to minimize L = Ê[ ˆD KL (q(z X ) p(z)) ] [ ] Ê log p(x Z). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 52 / 80

133 We have to minimize L = Ê[ ˆD KL (q(z X ) p(z)) ] [ ] Ê log p(x Z). Since q(z X ) and p(z) are Gaussian, we have ˆD KL (q(z X = x) p(z)) = 1 ) 2 ) ) 2 (1 + 2 log σ (µ fd 2 (x) fd (x) (σ fd (x). d François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 52 / 80

134 We have to minimize L = Ê[ ˆD KL (q(z X ) p(z)) ] [ ] Ê log p(x Z). Since q(z X ) and p(z) are Gaussian, we have ˆD KL (q(z X = x) p(z)) = 1 ) 2 ) ) 2 (1 + 2 log σ (µ fd 2 (x) fd (x) (σ fd (x). d And with z n l ( ) N µ f (x n), σ f (x n), n = 1,..., N, l = 1,..., L, we have [ ] Ê log p(x Z) 1 2 N n=1 l=1 ( L xn,d µ g c (zl n ) ) 2 c 2 ( σ g c (z n l ) ) 2 Kingma and Welling point out that using L = 1 is enough. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 52 / 80

135 For MNIST, we keep our convolutional structure, but the encoder now maps to twice the number of dimensions, which corresponds to the µ f s and σ f s, and we use a fixed variance for the decoder. We use Adam for training and the loss estimate for the standard autoencoder output = model ( input ) loss = mse_loss ( output, input ) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 53 / 80

136 For MNIST, we keep our convolutional structure, but the encoder now maps to twice the number of dimensions, which corresponds to the µ f s and σ f s, and we use a fixed variance for the decoder. We use Adam for training and the loss estimate for the standard autoencoder becomes output = model ( input ) loss = mse_loss ( output, input ) param = model. encode ( input ) mu, logvar = param. split ( param. size (1) //2, 1) logvar = logvar + math. log (0.01) std = logvar. mul (0.5). exp () kl = * (1 + logvar - mu.pow (2) - logvar. exp ()) kl = kl. mean () u = Variable (mu. data. new (mu. size ()). normal_ ()) z = u * std + mu output = model. decode (z) loss = mse_loss ( output, input ) * kl François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 53 / 80

137 For MNIST, we keep our convolutional structure, but the encoder now maps to twice the number of dimensions, which corresponds to the µ f s and σ f s, and we use a fixed variance for the decoder. We use Adam for training and the loss estimate for the standard autoencoder becomes output = model ( input ) loss = mse_loss ( output, input ) param = model. encode ( input ) mu, logvar = param. split ( param. size (1) //2, 1) logvar = logvar + math. log (0.01) std = logvar. mul (0.5). exp () kl = * (1 + logvar - mu.pow (2) - logvar. exp ()) kl = kl. mean () u = Variable (mu. data. new (mu. size ()). normal_ ()) z = u * std + mu output = model. decode (z) loss = mse_loss ( output, input ) * kl François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 53 / 80

138 For MNIST, we keep our convolutional structure, but the encoder now maps to twice the number of dimensions, which corresponds to the µ f s and σ f s, and we use a fixed variance for the decoder. We use Adam for training and the loss estimate for the standard autoencoder becomes output = model ( input ) loss = mse_loss ( output, input ) param = model. encode ( input ) mu, logvar = param. split ( param. size (1) //2, 1) logvar = logvar + math. log (0.01) std = logvar. mul (0.5). exp () kl = * (1 + logvar - mu.pow (2) - logvar. exp ()) kl = kl. mean () u = Variable (mu. data. new (mu. size ()). normal_ ()) z = u * std + mu output = model. decode (z) loss = mse_loss ( output, input ) * kl During inference we do not sample, and instead use µ f and µ g as prediction. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 53 / 80

139 Original Autoencoder reconstruction (d = 32) Variational Autoencoder reconstruction (d = 32) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 54 / 80

140 Autoencoder sampling (d = 32) Variational Autoencoder sampling (d = 32) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 55 / 80

141 Non-Volume Preserving network François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 56 / 80

142 A standard result of probability theory is that if f is continuous, invertible and [almost everywhere] differentiable, then x, p f 1 (Z) (x) = p Z (f (x)) J f (x). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 57 / 80

143 A standard result of probability theory is that if f is continuous, invertible and [almost everywhere] differentiable, then x, p f 1 (Z) (x) = p Z (f (x)) J f (x). 3 p Z 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 57 / 80

144 A standard result of probability theory is that if f is continuous, invertible and [almost everywhere] differentiable, then x, p f 1 (Z) (x) = p Z (f (x)) J f (x). 3 f p Z 1 François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 57 / 80

145 A standard result of probability theory is that if f is continuous, invertible and [almost everywhere] differentiable, then x, p f 1 (Z) (x) = p Z (f (x)) J f (x). 3 f p Z p f 1 (Z) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 57 / 80

146 From this equality, if f is a parametric function such that we can compute [and differentiate] p Z (f (x)) and J f (x) then, we can make the distribution of f 1 (Z) fits the data by optimizing log p f 1 (Z) (x n) n François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 58 / 80

147 From this equality, if f is a parametric function such that we can compute [and differentiate] p Z (f (x)) and J f (x) then, we can make the distribution of f 1 (Z) fits the data by optimizing log p f 1 (Z) (x n) = ( ) log p Z (f (x n)) J f (x n). n n François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 58 / 80

148 From this equality, if f is a parametric function such that we can compute [and differentiate] p Z (f (x)) and J f (x) then, we can make the distribution of f 1 (Z) fits the data by optimizing log p f 1 (Z) (x n) = ( ) log p Z (f (x n)) J f (x n). n n If we are able to do so, then we can synthesize a new X by sampling Z N(0, 1) and computing f 1 (Z). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 58 / 80

149 If Z N(0, I ), log p Z (f (x n)) = 1 2 ( f (xn) 2 + d log 2π ). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 59 / 80

150 If Z N(0, I ), log p Z (f (x n)) = 1 2 ( f (xn) 2 + d log 2π ). And remember that if f is is a composition of functions f = f (K) f (1) we have so J f (x) = log J f (x) = K J f (k) k=1 K log J f (k) k=1 ( ) f (k 1) f (1) (x), ( f (k 1) f (1) (x)). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 59 / 80

151 If f (k) are standard layers we cannot compute f 1 (z), and computing J f (x) is intractable. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 60 / 80

152 If f (k) are standard layers we cannot compute f 1 (z), and computing J f (x) is intractable. Dinh et al. (2014) introduced the coupling layers to address both issues. The resulting Non-Volume Preserving network (NVP) is an example of a Normalizing flow (Rezende and Mohamed, 2015). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 60 / 80

153 We use here the formalism from Dinh et al. (2016). Given a dimension d, a Boolean vector b {0, 1} d and two mappings s : R d R d t : R d R d, François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 61 / 80

154 We use here the formalism from Dinh et al. (2016). Given a dimension d, a Boolean vector b {0, 1} d and two mappings s : R d R d t : R d R d, we define a [fully connected] coupling layer as the transformation c : R d R d x b x + (1 b) ( ) x exp(s(b x)) + t(b x) where exp is component-wise, and is the Hadamard component-wise product. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 61 / 80

155 The expression c(x) = b x + (1 b) can be understood as: forward b x unchanged, ( ) x exp(s(b x)) + t(b x) (1) (0) b x c(x) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 62 / 80

156 The expression c(x) = b x + (1 b) ( ) x exp(s(b x)) + t(b x) can be understood as: forward b x unchanged, and apply to (1 b) x an invertible transformation parametrized by b x. (1) s exp t (0) + b x c(x) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 62 / 80

157 The consequence is that c is invertible, and if y = c(x) ( ) x = b y + (1 b) y t(b y) exp( s(b y)). (1) (0) b c(x) x François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 63 / 80

158 The consequence is that c is invertible, and if y = c(x) ( ) x = b y + (1 b) y t(b y) exp( s(b y)). (1) t s exp (0) + b c(x) x François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 63 / 80

159 The second property of this mapping is the simplicity of its Jacobian determinant. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 64 / 80

160 The second property of this mapping is the simplicity of its Jacobian determinant. Since ( ) c i (x) = b i x i + (1 b i ) x i exp(s i (b x)) + t i (b x) we have, i, j, x, b i = 1 c i (x) = x i c i x j = 1 {i=j} François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 64 / 80

161 The second property of this mapping is the simplicity of its Jacobian determinant. Since ( ) c i (x) = b i x i + (1 b i ) x i exp(s i (b x)) + t i (b x) we have, i, j, x, b i = 1 c i (x) = x i and c i x j = 1 {i=j} b i = 0 c i (x) = x i exp(s i (b x)) + t i (b x) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 64 / 80

162 The second property of this mapping is the simplicity of its Jacobian determinant. Since ( ) c i (x) = b i x i + (1 b i ) x i exp(s i (b x)) + t i (b x) we have, i, j, x, b i = 1 c i (x) = x i and c i x j = 1 {i=j} b i = 0 c i (x) = x i exp(s i (b x)) + t i (b x) c ( i s i (b x) ) = 1 {i=j} + x i exp(s i (b x)) + t i (b x) x j x j x j }{{}}{{} 0 if b j =0 0 if b j =0 c ( i = 1 {i=j} exp(s i (b x)) + b j x j... ). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 64 / 80

163 The second property of this mapping is the simplicity of its Jacobian determinant. Since ( ) c i (x) = b i x i + (1 b i ) x i exp(s i (b x)) + t i (b x) we have, i, j, x, b i = 1 c i (x) = x i and c i x j = 1 {i=j} b i = 0 c i (x) = x i exp(s i (b x)) + t i (b x) c ( i s i (b x) ) = 1 {i=j} + x i exp(s i (b x)) + t i (b x) x j x j x j }{{}}{{} 0 if b j =0 0 if b j =0 c ( i = 1 {i=j} exp(s i (b x)) + b j x j Hence c i x j can be non-zero only if i = j, or (1 b i )b j = ). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 64 / 80

164 If we re-order both the rows and columns of the Jacobian to put first the non-zeros entries of b, and then the zeros, it becomes lower triangular 1... (0) 1 J c(x) = exp(s k (x b)). ( 0).. exp(s k (x b)) its determinant remains unchanged, and we have log J f (k) (x) = s i (x b) = i i:b i =0 ((1 b) s (x b)) i. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 65 / 80

165 dim = 6 x = Variable ( Tensor (1, dim ). normal_ (), requires_grad = True ) b = Variable ( Tensor (1, dim ). zero_ ()) b.data.narrow (1, 0, dim //2).fill_ (1.0) s = nn. Sequential (nn. Linear (dim, dim ), nn. Tanh ()) t = nn. Sequential (nn. Linear (dim, dim ), nn. Tanh ()) c = b * x + (1 - b) * (x * s(b * x). exp () + t(b * x)) j = torch. cat ([ torch. autograd. grad (c_k, x, retain_graph = True ) [0] for c_k in c [0]]) print (j) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 66 / 80

166 dim = 6 x = Variable ( Tensor (1, dim ). normal_ (), requires_grad = True ) b = Variable ( Tensor (1, dim ). zero_ ()) b.data.narrow (1, 0, dim //2).fill_ (1.0) s = nn. Sequential (nn. Linear (dim, dim ), nn. Tanh ()) t = nn. Sequential (nn. Linear (dim, dim ), nn. Tanh ()) c = b * x + (1 - b) * (x * s(b * x). exp () + t(b * x)) j = torch. cat ([ torch. autograd. grad (c_k, x, retain_graph = True ) [0] for c_k in c [0]]) print (j) prints [ torch. FloatTensor of size 6x6] François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 66 / 80

167 To recap, with f (k), k = 1,..., K coupling layers, and x (0) n f = f (K) f (1), ( ) = x n and x n (k) = f (k) x n (k 1), François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 67 / 80

168 To recap, with f (k), k = 1,..., K coupling layers, and x (0) n = x n and x (k) n = f (k) ( L(f ) = n f = f (K) f (1), x n (k 1) ), we train by maximizing ( 1 ( x ) ) (K) n 2 K ( ) + d log 2π + log J 2 f (k) x n (k 1), k=1 with log J f (k) (x) = i (( 1 b (k)) s (k) ( x b (k))) i. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 67 / 80

169 To recap, with f (k), k = 1,..., K coupling layers, and x (0) n = x n and x (k) n = f (k) ( L(f ) = n f = f (K) f (1), x n (k 1) ), we train by maximizing ( 1 ( x ) ) (K) n 2 K ( ) + d log 2π + log J 2 f (k) x n (k 1), k=1 with log J f (k) (x) = i (( 1 b (k)) s (k) ( x b (k))) i. And to sample we just need to generate Z N(0, I ) and compute f 1 (Z). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 67 / 80

170 A coupling layer can be implemented with class NVPCouplingLayer ( Module ): def init (self, map_s, map_t, b): super ( NVPCouplingLayer, self ). init () self. map_s = map_s self. map_t = map_t self.b = Variable (b. clone (). unsqueeze (0), requires_grad = False ) def forward (self, x_and_logdetjac ): x, logdetjac = x_and_logdetjac s, t = self. map_s ( self.b * x), self. map_t ( self.b * x) logdetjac += ((1 - self.b) * s). sum (1) y = self.b * x + (1 - self.b) * ( torch. exp (s) * x + t) return (y, logdetjac ) def invert (self, y): s, t = self. map_s ( self.b * y), self. map_t ( self.b * y) return self.b * y + (1 - self.b) * ( torch. exp (-s) * (y - t)) The forward here computes both the image of x and the update on the accumulated determinant of the Jacobian, i.e. (x, u) (f (x), u + J f (x) ). François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 68 / 80

171 We can then define a complete network with one-hidden layer tanh MLPs for the s and t mappings class NVPNet ( Module ): def init (self, dim, hdim, depth ): super ( NVPNet, self ). init () b = Tensor ( dim ) self. layers = nn. ModuleList () for d in range ( depth ): if d%2 == 0: # Tag half the dimensions i = torch. randperm (b. numel ()). narrow (0, 0, b. numel () // 2) b. zero_ ()[i] = 1 else : b = 1 - b map_s = nn. Sequential (nn. Linear (dim, hdim ), nn. Tanh (), nn. Linear (hdim, dim )) map_t = nn. Sequential (nn. Linear (dim, hdim ), nn. Tanh (), nn. Linear (hdim, dim )) self. layers. append ( NVPCouplingLayer ( map_s, map_t, b)) def forward (self, x): for m in self. layers : x = m(x) return x def invert (self, y): for m in reversed ( self. layers ): y = m. invert (y) return y François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 69 / 80

172 And the log-proba of individual samples of a batch def LogProba ( x_and_logdetjac ): (x, logdetjac ) = x_and_logdetjac log_p = logdetjac * x. pow (2). add ( math. log (2 * math.pi)). sum (1) return log_p François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 70 / 80

173 Training is achieved by maximizing the mean log-proba batch_size = 100 model = NVPNet ( dim = 2, hidden_dim = 2, depth = 4) optimizer = optim. Adam ( model. parameters (), lr = 1e -2) for e in range ( args. nb_epochs ): for b in range (0, nb_train_samples, batch_size ): output = model (( input. narrow (0, b, batch_size ), 0)) loss = - LogProba ( output ). mean () model. zero_grad () loss. backward () optimizer. step () François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 71 / 80

174 Training is achieved by maximizing the mean log-proba batch_size = 100 model = NVPNet ( dim = 2, hidden_dim = 2, depth = 4) optimizer = optim. Adam ( model. parameters (), lr = 1e -2) for e in range ( args. nb_epochs ): for b in range (0, nb_train_samples, batch_size ): output = model (( input. narrow (0, b, batch_size ), 0)) loss = - LogProba ( output ). mean () model. zero_grad () loss. backward () optimizer. step () Finally, we can sample according to p f 1 (Z) with z = Variable ( Tensor ( nb_train_samples, dim ). normal_ ()) x = model. invert (z). data François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 71 / 80

175 4 Real Synth François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 72 / 80

176 4 Real Synth François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 72 / 80

177 4 Real Synth François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 72 / 80

178 4 Real Synth François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 72 / 80

179 4 Real Synth François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 72 / 80

180 Dinh et al. (2016) apply this approach to convolutional layers by using bs Published as a conference paper at ICLR 2017 consistent with the activation map structure, and reducing the map size while increasing the number of channels Figure 3: Masking schemes for affine coupling layers. On the left, a spatial checkerboard pattern mask. On the right, a channel-wise masking. The squeezing operation reduces the tensor (on the left) into a tensor (on the right). Before the squeezing operation, a checkerboard pattern is used for coupling layers while a channel-wise masking pattern is used afterward. (see Figure 2(b)), { y1:d = x 1:d y d+1:d = x d+1:d exp ( s(x ) (Dinh et al., 2016) (7) 1:d) + t(x 1:d) { x1:d = y 1:d x d+1:d = ( y d+1:d t(y ) 1:d) exp ( s(y ) 1:d), (8) meaning that sampling is as efficient as inference for this model. Note again that computing the François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 73 / 80

181 Published as a conference paper at ICLR 2017 They combine these layers by alternating masks, and branching out half of the channels at certain points to forward them unchanged. + x = + x z1 z2 z3 z4 f (3) z3 (2) h4 f (2) = + x = (1) (1) z1 z2 h3 h4 f (1) x1 x2 x3 x4 (a) In this alternating pattern, units which remain identical in one transformation are modified in the next. (b) Factoring out variables. At each step, half the variables are directly modeled as Gaussians, while the other half undergo further transformation. Figure 4: Composition schemes for affine coupling layers. 3.6 Multi-scale architecture (Dinh et al., 2016) We implement a multi-scale architecture using a squeezing operation: for each channel, it divides the image into subsquares of shape 2 2 c, then reshapes them into subsquares of shape 1 1 4c. The squeezing operation transforms an s s c tensor into an s 2 s 2 4c tensor (see Figure 3), effectively trading spatial size for number of channels. At each scale, we combine several operations into a sequence: we first apply three coupling layers François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 74 / 80

182 The structure for generating images consists of 2 stages 3 checkerboard coupling layers, a squeezing layer, 3 channel coupling layers, a factor-out layer. 1 stage 4 checkerboard coupling layers a factor-out layer. The s and t mappings get more complex in the later layers. François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 75 / 80

183 Figure 7: Samples from a model trained on Imagenet (64 64). (Dinh et al., 2016) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 76 / 80

184 Figure 8: Samples from a model trained on CelebA. (Dinh et al., 2016) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 77 / 80

185 Figure 9: Samples from a model trained on LSUN (bedroom category). (Dinh et al., 2016) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 78 / 80

186 Figure 10: Samples from a model trained on LSUN (church outdoor category). (Dinh et al., 2016) François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 79 / 80

187 Figure 6: Manifold generated from four examples in the dataset. Clockwise from top left: CelebA, Imagenet (64 64), LSUN (tower), LSUN (bedroom). over sample quality in a limited capacity setting. As a result, our model outputs sometimes highly improbable samples as we can notice especially on CelebA. As opposed to variational (Dinh etautoencoders, al., 2016) the samples generated from our model look not only globally coherent but also sharp. Our hypothesis is that as opposed to these models, real NVP does not rely on fixed form reconstruction cost like an L 2 norm which tends to reward capturing low frequency components more heavily than high frequency components. Unlike autoregressive models, sampling from our model is done very efficiently as it is François Fleuret EE-559 Deep learning / 9. Autoencoders and generative models 80 / 80

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Denoising Autoencoders

Denoising Autoencoders Denoising Autoencoders Oliver Worm, Daniel Leinfelder 20.11.2013 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 1 / 11 Introduction Poor initialisation can lead to local minima 1986 -

More information

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas On for Energy Based Models Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas Toronto Machine Learning Group Meeting, 2011 Motivation Models Learning Goal: Unsupervised

More information

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT versus 1 Agenda Deep Learning: Motivation Learning: Backpropagation Deep architectures I: Convolutional Neural Networks (CNN) Deep architectures II: Stacked Auto Encoders (SAE) Caffe Deep Learning Toolbox:

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Nonparametric Inference for Auto-Encoding Variational Bayes

Nonparametric Inference for Auto-Encoding Variational Bayes Nonparametric Inference for Auto-Encoding Variational Bayes Erik Bodin * Iman Malik * Carl Henrik Ek * Neill D. F. Campbell * University of Bristol University of Bath Variational approximations are an

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

Deep Learning Autoencoder Models

Deep Learning Autoencoder Models Deep Learning Autoencoder Models Davide Bacciu Dipartimento di Informatica Università di Pisa Intelligent Systems for Pattern Recognition (ISPR) Generative Models Wrap-up Deep Learning Module Lecture Generative

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016 Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Lecture 14: Deep Generative Learning

Lecture 14: Deep Generative Learning Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han

More information

arxiv: v1 [cs.lg] 15 Jun 2016

arxiv: v1 [cs.lg] 15 Jun 2016 Improving Variational Inference with Inverse Autoregressive Flow arxiv:1606.04934v1 [cs.lg] 15 Jun 2016 Diederik P. Kingma, Tim Salimans and Max Welling OpenAI, San Francisco University of Amsterdam, University

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Lecture 7 Convolutional Neural Networks

Lecture 7 Convolutional Neural Networks Lecture 7 Convolutional Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 17, 2017 We saw before: ŷ x 1 x 2 x 3 x 4 A series of matrix multiplications:

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Variational Auto Encoders

Variational Auto Encoders Variational Auto Encoders 1 Recap Deep Neural Models consist of 2 primary components A feature extraction network Where important aspects of the data are amplified and projected into a linearly separable

More information

Spatial Transformer Networks

Spatial Transformer Networks BIL722 - Deep Learning for Computer Vision Spatial Transformer Networks Max Jaderberg Andrew Zisserman Karen Simonyan Koray Kavukcuoglu Contents Introduction to Spatial Transformers Related Works Spatial

More information

EE-559 Deep learning 10. Generative Adversarial Networks

EE-559 Deep learning 10. Generative Adversarial Networks EE-559 Deep learning 10. Generative Adversarial Networks François Fleuret https://fleuret.org/dlc/ [version of: May 17, 2018] ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE Adversarial generative models François

More information

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 8: Autoencoder & DBM Princeton University COS 495 Instructor: Yingyu Liang Autoencoder Autoencoder Neural networks trained to attempt to copy its input to its output Contain

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic The XOR problem Fall 2013 x 2 Lecture 9, February 25, 2015 x 1 The XOR problem The XOR problem x 1 x 2 x 2 x 1 (picture adapted from Bishop 2006) It s the features, stupid It s the features, stupid The

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani Introduction When comparing different statistical models, we d like a quantitative criterion

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain Variational Inference in TensorFlow Danijar Hafner Stanford CS 20 2018-02-16 University College London, Google Brain Outline Variational Inference Tensorflow Distributions VAE in TensorFlow Variational

More information

Deep Learning & Neural Networks Lecture 2

Deep Learning & Neural Networks Lecture 2 Deep Learning & Neural Networks Lecture 2 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 16, 2014 2/45 Today s Topics 1 General Ideas in Deep Learning Motivation

More information

Deep Learning Architectures and Algorithms

Deep Learning Architectures and Algorithms Deep Learning Architectures and Algorithms In-Jung Kim 2016. 12. 2. Agenda Introduction to Deep Learning RBM and Auto-Encoders Convolutional Neural Networks Recurrent Neural Networks Reinforcement Learning

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CSC411: Final Review. James Lucas & David Madras. December 3, 2018 CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Denoising Criterion for Variational Auto-Encoding Framework

Denoising Criterion for Variational Auto-Encoding Framework Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Denoising Criterion for Variational Auto-Encoding Framework Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic, Yoshua

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

More information

Probabilistic Reasoning in Deep Learning

Probabilistic Reasoning in Deep Learning Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian

More information

Handwritten Indic Character Recognition using Capsule Networks

Handwritten Indic Character Recognition using Capsule Networks Handwritten Indic Character Recognition using Capsule Networks Bodhisatwa Mandal,Suvam Dubey, Swarnendu Ghosh, RiteshSarkhel, Nibaran Das Dept. of CSE, Jadavpur University, Kolkata, 700032, WB, India.

More information

Deep latent variable models

Deep latent variable models Deep latent variable models Pierre-Alexandre Mattei IT University of Copenhagen http://pamattei.github.io @pamattei 19 avril 2018 Séminaire de statistique du CNAM 1 Overview of talk A short introduction

More information

CSC321 Lecture 20: Autoencoders

CSC321 Lecture 20: Autoencoders CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 / 16 Overview Latent variable models so far: mixture models Boltzmann machines Both of these involve discrete

More information

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning Convolutional Neural Network (CNNs) University of Waterloo October 30, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Convolutional Networks Convolutional

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Dictionary Learning for photo-z estimation

Dictionary Learning for photo-z estimation Dictionary Learning for photo-z estimation Joana Frontera-Pons, Florent Sureau, Jérôme Bobin 5th September 2017 - Workshop Dictionary Learning on Manifolds MOTIVATION! Goal : Measure the radial positions

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix A. Main theorem and its proof Appendix Theorem A.1 below, our main theorem, analyzes the extended KL-divergence for some β (0.5, 1] defined as follows: L β (p) := (βp (x) + (1 β)p(x)) ln βp (x) + (1 β)p(x)

More information

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 CS 1674: Intro to Computer Vision Final Review Prof. Adriana Kovashka University of Pittsburgh December 7, 2016 Final info Format: multiple-choice, true/false, fill in the blank, short answers, apply an

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information