Reading Group on Deep Learning Session 2

Similar documents
Reading Group on Deep Learning Session 1

Neural Networks - II

Regularization in Neural Networks

PATTERN RECOGNITION AND MACHINE LEARNING

Lecture 17: Neural Networks and Deep Learning

Machine Learning Lecture 5

Neural Network Training

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Ch 4. Linear Models for Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Pattern Recognition and Machine Learning

Introduction to Machine Learning

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Deep Learning: a gentle introduction

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Outline Lecture 2 2(32)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

The Laplace Approximation

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Artificial Neural Networks. MGS Lecture 2

Linear Models for Classification

Bayesian Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Lecture : Probabilistic Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Neural Networks and Deep Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Linear Regression. Sargur Srihari

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Unsupervised Learning

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Overfitting, Bias / Variance Analysis

Introduction to Machine Learning

Lecture 5: Logistic Regression. Neural Networks

Variational Principal Components

Machine Learning. 7. Logistic and Linear Regression

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

STA 4273H: Statistical Machine Learning

Variational Bayesian Logistic Regression

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Variational Dropout via Empirical Bayes

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

UNSUPERVISED LEARNING

Logistic Regression. COMP 527 Danushka Bollegala

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Linear Classification

STA 4273H: Sta-s-cal Machine Learning

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Density Estimation. Seungjin Choi

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Logistic Regression. Seungjin Choi

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Expectation Propagation Algorithm

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Linear Models for Regression

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Variational Inference via Stochastic Backpropagation

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

The connection of dropout and Bayesian statistics

Jakub Hajic Artificial Intelligence Seminar I

Machine Learning for Signal Processing Bayes Classification and Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Bayesian Networks (Part I)

Curve Fitting Re-visited, Bishop1.2.5

p(d θ ) l(θ ) 1.2 x x x

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Nonparametric Bayesian Methods (Gaussian Processes)

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Artificial Neural Networks

Bayesian Regression Linear and Logistic Regression

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Linear Models for Regression

Basic Principles of Unsupervised and Unsupervised

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

The Origin of Deep Learning. Lili Mou Jan, 2015

Grundlagen der Künstlichen Intelligenz

Introduction to Deep Neural Networks

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Basics: Maximum Likelihood Estimation

A summary of Deep Learning without Poor Local Minima

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Transcription:

Reading Group on Deep Learning Session 2 Stephane Lathuiliere & Pablo Mesejo 10 June 2016 1/39

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. Possible Future Topics. 2/39

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. Possible Future Topics. 2/39

5.5. Regularization in Neural Networks First approach: Number of hidden units Figure: 5.9. Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. The graphs show the result of fitting networks having M = 1, 3 and 10 hidden units, respectively. 3/39

5.5. Gaussian priors Second approach: Before we proposed to train the network by optimizing: p(t x, w) = N ( t y(x, w), β 1) (5.12) Now, we optimize the posterior probability: with the following prior p(w t, x, λ) p(t x, w)p(w λ) p(w λ) = N ( w 0, λ 1 I ) Taking the negative logarithm, we obtain the error function Ẽ(w) = E(w) + λ 2 w w (5.112) 4/39

5.5. Consistent Gaussian priors Let s illustrate the inconsistency of a simple Gaussian prior. First hidden layer: Output units: ( ) z j = h (w ji x i + w j0 ) y k = j i (5.113) (w kj z j + w k0 ) (5.114) We apply a linear transformation on the data: x i = ax i + b Equivalent Network: w ji = 1 a w ji w j0 = w j0 b a i w ji 5/39

5.5. Consistent Gaussian priors We apply a similar transformation on the output: ỹ k = cy k + d Equivalent Network: w kj = cw kj w k0 = cw k0 + d Problem: There is no λ such that λ w w = λw w Thus, if we train a MLP on the transformed data, we do not obtain the equivalent network described above. 6/39

5.5. Consistent Gaussian priors Solution: We consider the following prior p(w α 1, α 2 ) exp ( α 1 2 w 2 α 2 w 2) 2 w W 1 w W 2 (5.122) and we obtain the following regularizer λ 1 2 w 2 + λ 2 w 2 (5.121) 2 w W 1 w W 2 Then we can chose: λ1 = a 1 2 λ 1 and λ 2 = c 1 2 λ 2 7/39

5.5. Consistent Gaussian priors p(w α 1, α 2) exp ( αw 1 2 αb 1 2 w W w 1 w W b 1 w 2 αw 2 2 w 2 αb 2 2 w W w 2 w W b 2 w 2 w 2) Figure: 5.11. Samples from the prior and plotting the corresponding network functions 8/39

5.5. Early stopping Third approach: Figure: 5.12. Illustration of the behaviour of training set error and validation set error 9/39

5.5. Early stopping Third approach: Figure: 5.12. Illustration of the behaviour of training set error and validation set error 9/39

5.5. Learning Invariances by Augmenting the Data Fourth approach: Figure: 5.14. Illustration of the synthetic warping of a handwritten digit. The original image is shown on the left. On the right, the top row shows three examples of warped digits, with the corresponding displacement fields shown on the bottom row. 10/39

5.5. Learning Invariances by Augmenting the Data We consider: Transformation governed by a single parameter ξ Transformation s(x, ξ) with s(x, 0) = x Sum-of-squares error function Error function for the untransformed data (infinite dataset): E = 1 {y(x) t} 2 p(t x)p(x) dx dt (5.129) 2 Error function for the expanded data: Ẽ = 1 {y(s(x, ξ)) t} 2 p(t x)p(x)p(ξ) dx dt dξ (5.130) 2 11/39

5.5. Learning Invariances by Augmenting the Data Expand s as a Taylor series: s(x, ξ) = s(x, 0) + ξ s ξ (x, ξ) ξ=0 + ξ2 2 s 2 ξ 2 (x, ξ) ξ=0 + O(ξ 3 ) = x + ξτ + 1 2 ξ2 τ + O(ξ 3 ) This allows us to expand the model function: y(s(x, ξ)) = y(x) + ξτ y(x) + ξ2 2 [(τ ) y(x) + τ y(x)τ ] + O(ξ 3 ) 12/39

5.5. Learning Invariances by Augmenting the Data After replacing in Ẽ: Ẽ = E + λω (5.131) with Ω = 1 (τ y(x) ) 2 p(x) dx (5.134) 2 and if we choose s : x x + ξ Ω = 1 y(x) 2 p(x) dx (5.135) 2 This is known as Tikhonov regularization. 13/39

5.5. Convolutional networks (Convnets) Fifth approach: Figure: Cat recognition 14/39

5.5. Convolutional networks (Convnets) Fifth approach: Figure: Cat recognition 14/39

5.5. Convolutional networks (Convnets) Fifth approach: Figure: Cat recognition Weight Sharing! 14/39

5.5. Convolutional networks (Convnets) Figure: Architecture of LeNet-5, a Convolution Neural network, here for digits recognition. Each plane is a feature map, i.e. a set of units whose weights are constrained to be identical. 1 First layer: Without weight sharing: 5x5x6x28x28 = 117600 with weight sharing: 5x5x6 = 150 1 Y. LeCun, et al.: Gradient-Based Learning Applied to Document Recognition,1998 15/39

5.5. Soft weight sharing Sixth approach: p(w) = i p(w i ) (5.136) with M p(w i ) = π j N ( w i µ j, σj 2 ) (5.137) j=1 Total error function: Ẽ(w) = E(w) + λω(w) (5.139) with Ω(w) = i M ln π j N ( w i µ j, σj 2 ) (5.138) j=1 16/39

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. Possible Future Topics. 17/39

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. Possible Future Topics. 17/39

5.6. Mixture Density Networks Goal of supervised learning: model p(t x). Generative approach: model p(t, x) = p(x t)p(t) and use Bayes Theorem p(t x) = p(x t)p(t) /p(x) Discriminative approach: model p(t x) directly. With inverse problems the distribution can be multimodal. Forward problem: Model parameters Data/Predictions. Inverse problem: Data Model parameters. Figure: 5.19: Left: forward problem (x t). Right: inverse problem (t x). A standard ANN trained by least squares approximates the conditional mean. 18/39

5.6. Mixture Density Networks Goal of supervised learning: model p(t x). Generative approach: model p(t, x) = p(x t)p(t) and use Bayes Theorem p(t x) = p(x t)p(t) /p(x) Discriminative approach: model p(t x) directly. With inverse problems the distribution can be multimodal. Forward problem: Model parameters Data/Predictions. Inverse problem: Data Model parameters. Figure: 5.19: Left: forward problem (x t). Right: inverse problem (t x). A standard ANN trained by least squares approximates the conditional mean. 18/39

5.6. Mixture Density Networks We seek a general framework for modeling p(t x). Mixture Density Network (MDN): mixture model in which both the mixing coefficients π k (x) as well as the component densities (µ k (x), σk 2 (x)) are flexible functions of the input vector x. p(t x) = K k=1 π k (x)n ( t µ k (x), σk 2 (x)i) (5.148) Recall that the Gaussian mixture distribution can be written as a linear superposition of Gaussians in the form p(x) = K π k N ( µ k, Σ k ) (9.7) k=1 Note: we can use other distributions for the components, such as Bernoulli distributions if the target variables are binary rather than continuous. 19/39

5.6. Mixture Density Networks Mixing coefficients must satisfy the constraints K π k (x) = 1, 0 π k (x) 1 (5.149) k=1 which can be achieved using a set of softmax outputs. π k (x) = exp(a π k ) K l=1 exp(aπ l ) (5.150) Variances must satisfy σk 2 0 and so can be represented using σ k (x) = exp(a σ k ) (5.151) Means have real components, they can be represented directly by the network output activations µ kj (x) = a µ kj 20/39

5.6. Mixture Density Networks Figure: MDN architecture (from Heiga Zen s slides, researcher at Google working in statistical speech synthesis and recognition) The total number of network outputs is given by (L + 2)K, having K (=2) components in the mixture model and L (=1) components in t. 21/39

5.6. Mixture Density Networks Figure: MDN architecture (from Heiga Zen s slides, researcher at Google working in statistical speech synthesis and recognition) The total number of network outputs is given by (L + 2)K, having K (=2) components in the mixture model and L (=1) components in t. 21/39

5.6. Mixture Density Networks Minimization of the negative logarithm of the likelihood: N { K ( ) } E(w) = ln π k (x n, w)n t n µ k (x n, w), σk 2 (x n, w)i n=1 k=1 (5.153) In order to minimize the error function, we need to calculate the derivatives of the error E(w) with respect to the components of w. These can be evaluated using backpropagation. We need the derivatives of the error wrt the output-unit activations (δ s). δk π = En a π = π k γ nk, δ µ k = { En k a µ = γ µkl t nl } nk kl σ, k 2 δk σ = { En a σ = γ nk L t n µ k 2 } k σk 2 where γ nk = γ k (t n x n ) = π kn nk K l=1 π ln nl and N nk denotes N (t n µ k (x n ), σ 2 k (x n)i). γ nk represents p(x n G k ) 22/39

5.6. Mixture Density Networks Once an MDN has been trained, it can predict the conditional density function of the target data for any given input vector. This conditional density represents a complete description of the generator of the data. Figure: 5.21. Bottom right: Mean of the most probable component (i.e., the one with the largest mixing coefficient) at each value of x 23/39

5.6. Mixture Density Networks Once an MDN has been trained, it can predict the conditional density function of the target data for any given input vector. This conditional density represents a complete description of the generator of the data. Figure: 5.21. Bottom right: Mean of the most probable component (i.e., the one with the largest mixing coefficient) at each value of x 23/39

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. Possible Future Topics. 24/39

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. Possible Future Topics. 24/39

5.7. Bayesian Neural Networks Main idea: Apply Bayesian inference to get an entire probability distribution over the network weights, w, given the training data: p(w D) Classical neural networks training algorithms get a single solution Approach Training Equivalent Problem / Solution ML MAP arg max p(t x, w) = w arg max N (t y(x, w), w β 1 ) E(w) = 1 } 2 N 2 n=1 {y(x n, w) t n arg max p(w t, x, λ) = w arg max p(t x, w)p(w λ) with λ w p(w λ) = N ( w 0, λ 1 I ) Ẽ(w) = E(w) + 2 wt w Posterior p(w D)???? 25/39

5.7. Bayesian Neural Networks In multilayered network Highly nonlinear dependence of y(x, w) on w Exact Bayesian treatment not possible because log of posterior distribution is non-convex Approximate methods are therefore necessary First approximation: Laplace approximation: replace posterior by a Gaussian centered at a mode of true posterior 26/39

5.7. Bayesian Neural Networks: Simple Regression Case Predict single continuous target t from vector x of inputs assuming hyperparameters α and β are fixed and known. p(t x, w, β) = N (t y(x, w), β 1 ) (5.161) Prior over weights assumed Gaussian Likelihood function Posterior is p(d w, β) = p(w α) = N (w 0, α 1 I) (5.162) N N (t n y(x n, w), β 1 ) (5.163) n=1 p(w D, α, β) p(w α)p(d w, β) (5.164) It will be non-gaussian as a consequence of the nonlinear dependence of y(x, w) on w Laplace approximation 27/39

5.7. Bayesian Neural Networks: Gaussian approx. to p(w D, α, β) Convenient to maximize the logarithm of the posterior (which corresponds to the regularized sum-of-squares error) ln p(w D, α, β) = α 2 wt w β 2 {y(xn, w) t n } 2 + const (5.165) The maximum is denoted as w MAP which is found using nonlinear optimization methods (e.g. conjugate gradient). Having found the mode w MAP, we can build a local Gaussian approximation q(w D) = N (w w MAP, A -1 ) (5.167) where A is in terms of the Hessian of the sum-of-squares error function A = ln p(w D, α, β) = αi + βh (5.166) 28/39

5.7. Bayesian Neural Networks Approach Training Equivalent Problem / Solution Testing ML MAP arg max p(t x, w) w arg min w E(w) where E(w) = 1 N } 2 {y(x n, w) t n 2 n=1 arg min Ẽ(w) arg max w p(w t, x, λ) w where Ẽ(w) = E(w) + λ 2 wt w Posterior p(w D, α, β) q(w D)???? arg max p(t w ML, x) = t arg max N (t y(x, w ML ), β 1 ) t Table: Summary in case of using Gaussian distributions. arg max p(t w MAP, x) = t arg max N (t y(x, w MAP ), β 1 ) t 29/39

5.7. Bayesian Neural Networks: p(t x, D) Similarly, the predictive distribution is obtained by marginalizing wrt this posterior distribution p(t x, D) = p(t x, w)q(w D)dw (5.168) This integration is still analytically intractable due to the nonlinearity of y(x, w) as a function of w. Second approximation: the posterior has small variance compared with the characteristic scales of w over which y(x, w) is varying. 30/39

5.7. Bayesian Neural Networks: p(t x, D) Similarly, the predictive distribution is obtained by marginalizing wrt this posterior distribution p(t x, D) = p(t x, w)q(w D)dw (5.168) This integration is still analytically intractable due to the nonlinearity of y(x, w) as a function of w. Second approximation: the posterior has small variance compared with the characteristic scales of w over which y(x, w) is varying. This allows us to make a Taylor series expansion of the network function around w MAP and retain only the linear terms: y(x, w) y(x, w MAP ) + g T (w w MAP ) (5.169) where g = w y(x, w) w=wmap 31/39

5.7. Bayesian Neural Networks: p(t x, D) We now have a linear-gaussian model with a Gaussian distribution for p(w) and a Gaussian for p(t w) whose mean is a linear function of w of the form p(t x, w, β) N (t y(x, w MAP ) + g T (w w MAP ), β 1 ) (5.171) We can make use of the general result (2.115) for the marginal p(t) to give p(t x, D, α, β) = N (t y(x, w MAP ), σ 2 (x)) (5.172) where σ 2 (x) = β 1 + g T A -1 g 2.3.3 Marginal and Conditional Gaussians: Given a marginal Gaussian distribution for x and a conditional Gaussian distribution for y given x in the form p(x) = N (x µ, Λ 1 ) and p(y x) = N (y Ax + b, L 1 ), the marginal distribution of y and the conditional distribution of x given y are given by p(y) = N (y Aµ + b, L 1 + AΛ 1 A T ) (2.115) 32/39

5.7. Bayesian Neural Networks Approach Training Equivalent Problem / Solution Testing ML MAP arg max p(t x, w) w arg min w E(w) where E(w) = 1 N } 2 {y(x n, w) t n 2 n=1 arg min Ẽ(w) arg max w p(w t, x, λ) w where Ẽ(w) = E(w) + λ 2 wt w Posterior p(w D, α, β) q(w D) arg max p(t w ML, x) = t arg max N (t y(x, w ML ), β 1 ) t arg max p(t w MAP, x) = t arg max N (t y(x, w MAP ), β 1 ) t arg max p(t x, D) = t arg max N (t y(x, w MAP ), σ 2 (x)) t Table: Summary in case of using Gaussian distributions. 33/39

5.7. Bayesian Neural Networks: Hyperparameter Optimization Estimates are obtained for α and β by maximizing ln p(d α, β) given by p(d α, β) = p(d w, β)p(w α)dw (5.174) This is easily evaluated using the Laplace approximation results (4.135). Taking logarithms then gives ln p(d α, β) E(w MAP ) 1 2 ln A + W 2 ln α+ N 2 ln β N 2 ln(2π) (5.175) where W is the total number of parameters in w, and the regularized error function is defined by E(w MAP ) = β N {y(x n, w 2 MAP ) t n } 2 + α 2 wt MAP w MAP (5.176) n=1 Z = f(z)dz f(z 0) (2π)W/2 A 1/2 (4.135) 34/39

5.7. Bayesian Neural Networks: Hyperparameter Optimization Consider first the maximization with respect to α. α = γ w T MAP w MAP (5.178) where γ represents the effective number of parameters and is defined by W λ i γ = (5.179) α + λ i i=1 and βhu i = λ i u i is the eigenvalue equation. Maximizing the evidence with respect to β gives 1 β = 1 N {y(x n, w MAP ) t n } 2 (5.180) N γ n=1 35/39

5.7. Bayesian Neural Networks: Comparison of Models Consider a set of candidate models H i to compare. We can apply Bayes Theorem to compute the posterior distribution over models, then pick the model with the largest posterior p(h i D) p(d H i )p(h i ) The term p(d Hi ) is called the evidence for H i and is given by p(d H i ) = p(d w, H i )p(w H i )dw Assuming that we have no reason to assign strongly different priors p(h i ), models H i are ranked by evaluating the evidence. The evidence is approximated by taking (5.175) and substituting the values of α and β obtained from the iterative optimization of these hyperparameters: ln p(d α, β) E(w MAP ) 1 2 ln A +W 2 ln α+n 2 ln β+n 2 ln(2π) (5.175) 36/39

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. Possible Future Topics. 37/39

Chapter Structure Introduction. 5.1. Feed-forward Network Functions. 5.2. Network Training. 5.3. Error Backpropagation. 5.4. The Hessian Matrix. 5.5. Regularization in Neural Networks. 5.6. Mixture Density Networks. 5.7. Bayesian Neural Networks. Possible Future Topics. 37/39

Possible Future Topics: Models Self-Organizing Maps (SOM) Ch. 9 Neural Networks and Learning Machines (S. Haykin, 2009) Radial Basis Function Networks (RBFN) Ch. 5 Neural Networks for Pattern Recognition (C.M. Bishop, 1995) Convolutional Neural Networks (CNN) AlexNet: ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al., NIPS 12) LeNet-5, convolutional neural networks http://yann.lecun.com/exdb/lenet/ VGGNet: Very Deep Convolutional Networks for Large-Scale Image Recognition (Simonyan and Zisserman, arxiv, 2014) GoogleLeNet: Going deeper with convolutions (Szegedy et al., arxiv, 2014) DeepFace: DeepFace: Closing the Gap to Human-Level Performance in Face Verification (Taigman et al., CVPR 14) Recurrent Neural Networks (RNN) LSTM: Long short-term memory (Hochreiter and Schmidhuber, Neural Computation, 1997) Boltzmann Machine: Ch. 11.7 from Haykin 09, Ch. 7 from Introduction to the theory of neural computation (Hertz et al., 1991) Hopfield Network: Ch. 13.7 from Haykin 09, Ch. 2 from Hertz 91, Ch. 13 from Neural Networks - A Systematic Introduction (Raul Rojas, 1996) Deep Belief Networks (DBN) Ch. 11.9 from Haykin 09, A fast learning algorithm for deep belief nets (Hinton et al., Neural Computation, 2006) Webpage: https://project.inria.fr/deeplearning Mailing list: deeplearning@inria.fr 38/39

Possible Future Topics: Problems Transfer Learning (or domain adaptation) How to transfer the knowledge gained solving one problem to a different but related problem? Machine learning methods work well under the assumption that the training and test data are drawn from the same feature space and the same distribution. When the distribution changes, it would be nice to reduce the need and effort to recollect a new training data. Integrating supervised and unsupervised learning in a single algorithm It seems that Deep Boltzmann Machines do this, but there are issues of scalability Webpage: https://project.inria.fr/deeplearning Mailing list: deeplearning@inria.fr 39/39