Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of Freiburg January 2, 208 Andreas Eitel Uni FR DL 207 ()

Technical Issues Location: Monday, 4:00-6:00, building 082, room 00 028 Remark: We will be there for questions every week from 4:00-6:00. We expect you to work on your own. Your attendance is required during lectures/presentations We expect you have basic knowledge in ML (e.g. heard the Machine Learning lecture). Contact information (tutors): Aaron Klein kleinaa@cs.uni-freiburg.de Artemij Amiranashvili amiranas@cs.uni-freiburg.de Mohammadreza Zolfaghari zolfagha@cs.uni-freiburg.de Maria Hügle hueglem@cs.uni-freiburg.de Jingwhei Zhang zhang@cs.uni-freiburg.de Andreas Eitel eitel@cs.uni-freiburg.de Homepage: http://ais.informatik.uni-freiburg.de/teaching/ws7/deep learning course Andreas Eitel Uni FR DL 207 (2)

Schedule and outline Phase Today: Introduction to Deep Learning (lecture). Assignment ) 6.0-06. Implementation of a feed-forward Neural Network from scratch (each person individually, -2 page report). 23.0: meeting solving open questions 30.0: meeting solving open questions 06.: Intro to CNNs and Tensorflow (mandatory lecture), hand in Assignment Assignment 2) 06. - 20. Train CNN in tensorflow (each person individually, -2 page report) 3. meeting solving open questions Phase 2 (split into three tracks) 20. First meeting in tracks, hand in Assignment 2 Assignment 3) 20. - 8.2 Group work track specific task, -2 page report Assignment 4) 8.2-22.0 Group work advanced topic, -2 page report Andreas Eitel Uni FR DL 207 (3)

Tracks Track Neurorobotics/Robotics Robot navigation Deep reinforcement learning Andreas Eitel Uni FR DL 207 (4)

Tracks Track Neurorobotics/Robotics Robot navigation Deep reinforcement learning Track 2 Machine Learning Architecture search and hyperparameter optimization Visual recognition Andreas Eitel Uni FR DL 207 (4)

Groups and Evaluation In total we have three practical phases Reports: Short -2 page report explaining your results, typically accompanied by -2 figures (e.g. a learning curve) hand in the code you used to generate these results as well, send us your github/bitbucket repo Presentations: Phase 3: 5 min group work presentation in last week the reports and presentation after each exercise will be evaluated w.r.t. scientific quality and quality of presentation Requirements for passing: attending the class in the lecture and presentation weeks active participation in the small groups, i.e., programming, testing, writing report Andreas Eitel Uni FR DL 207 (5)

Today... Groups: You will work on your own for the first exercise Assignment: You will have three weeks to build a MLP and apply it to the MNIST dataset (more on this at the end) Lecture: Short recap on how MLPs (feed-forward neural networks) work and how to train them Andreas Eitel Uni FR DL 207 (6)

(Deep) Machine Learning Overview Data Model M? inference Learn Model M from the data 2 Let the model M infer unknown quantities from data Andreas Eitel Uni FR DL 207 (7)

(Deep) Machine Learning Overview Data Model M? inference Andreas Eitel Uni FR DL 207 (8)

Machine Learning Overview What is the difference between deep learning and a standard machine learning pipeline? Andreas Eitel Uni FR DL 207 (9)

Standard Machine Learning Pipeline () Engineer good features (not learned) (2) Learn Model (3) Inference e.g. classes of unseen data x x x 2 (3) 3 () x x x features color height (4 cm) 2 color height (3.5 cm) 3 color height (0 cm) (2) Data Andreas Eitel Uni FR DL 207 (0)

Unsupervised Feature Learning Pipeline (a) (b) (2) (3) Maybe engineer good features (not learned) Learn (deep) representation unsupervisedly Learn Model Inference e.g. classes of unseen data x x x 2 3 (a) x x x features color height (4 cm) 2 color height (3.5 cm) 3 color height (0 cm) (3) (2) Data Pixels (b) Andreas Eitel Uni FR DL 207 ()

Supervised Deep Learning Pipeline () Jointly Learn everything with a deep architecture (2) Inference e.g. classes of unseen data x x x Classes 2 (2) 3 Data Andreas Eitel Uni FR Pixels () DL 207 (2)

Training supervised feed-forward neural networks Let s formalize! We are given: Dataset D = {(x, y ),..., (x N, y N )} A neural network with parameters θ which implements a function f θ (x) We want to learn: The parameters θ such that i [, N] : f θ (x i ) = y i Andreas Eitel Uni FR DL 207 (3)

Training supervised feed-forward neural networks A neural network with parameters θ which implements a function f θ (x) θ is given by the network weights w and bias terms b f(x) (2) h (x) () h (x) smax a smax a...... a a a...... a a ai... x xj... xn Andreas Eitel Uni FR DL 207 (4)

Neural network forward-pass Computing f θ (x) for a neural network is a forward-pass f(x) smax a smax a unit i activation: N a i = w i,jx j + b i j=i unit i output: h () i (x) = t(a i) where t( ) is an activation or transfer function (2) h (x) () h (x)...... a a a...... a a ai wi,j...... x xj xn bi Andreas Eitel Uni FR DL 207 (5)

Neural network forward-pass Computing f θ (x) for a neural network is a forward-pass alternatively (and much faster) use vector notation: layer activation: a () = W () x + b () layer output: h () (x) = t(a () ) where t( ) is applied element wise f(x) (2) h (x) () h (x) W () smax a smax a...... a a a...... a a ai wi,j...... x xj xn bi Andreas Eitel Uni FR DL 207 (6)

Neural network forward-pass Computing f θ (x) for a neural network is a forward-pass Second layer layer 2 activation: a (2) = W (2) h () (x) + b (2) layer 2 output: h () (x) = t(a (2) ) where t( ) is applied element wise () f(x) (2) h (x) W (2) h (x) W () smax a smax a...... a a a...... a a ai... x xj... xn Andreas Eitel Uni FR DL 207 (7)

Neural network forward-pass Computing f θ (x) for a neural network is a forward-pass Output layer output layer activation: a (3) = W (3) h (2) (x) + b (3) network output: f(x) = o(a (3) ) where o( ) is the output nonlinearity for classification use softmax: e z i o i(z) = z j= ez j () f(x) (2) h (x) h (x) W (3) W (2) W () smax a smax a...... a a a...... a a ai... x xj... xn Andreas Eitel Uni FR DL 207 (8)

Training supervised feed-forward neural networks Neural network activation functions Typical nonlinearities for hidden layers are: tanh(a i), sigmoid σ(a i) =, ReLU relu(ai) = max(ai, 0) + e a i tanh and sigmoid are both squashing non-linearities ReLU just thresholds at 0 Why not linear? Andreas Eitel Uni FR DL 207 (9)

Training supervised feed-forward neural networks Train parameters θ such that i [, N] : f θ (x i ) = y i Andreas Eitel Uni FR DL 207 (20)

Training supervised feed-forward neural networks Train parameters θ such that i [, N] : f θ (x i ) = y i We can do this via minimizing the empirical risk on our dataset D min L(f θ, D) = min θ θ N N l(f θ (x i ), y i ), () i= where l(, ) is a per example loss For regression often use the squared loss: l(f θ (x), y) = 2 M (f j,θ (x) y j) 2 j= For M-class classification use the negative log likelihood: l(f θ (x), y) = M log(f j,θ (x)) y j j Andreas Eitel Uni FR DL 207 (20)

Gradient descent The simplest approach to minimizing min L(f θ, D) is gradient descent θ Gradient descent: θ 0 init randomly do θ t+ = θ t γ L(f θ, D) θ while (L(f θ t+, V ) L(f θ t, V )) 2 > ɛ Andreas Eitel Uni FR DL 207 (2)

Gradient descent The simplest approach to minimizing min L(f θ, D) is gradient descent θ Gradient descent: θ 0 init randomly do θ t+ = θ t γ L(f θ, D) θ while (L(f θ t+, V ) L(f θ t, V )) 2 > ɛ Where V is a validation dataset (why not use D?) Remember in our case: L(f θ, D) = N l(f θ (x i ), y i ) N i= We will get to computing the derivatives shortly Andreas Eitel Uni FR DL 207 (2)

Gradient descent Gradient descent example: D = {(x, y ),..., (x 00, y 00 )} with x U[0, ] y = 3 x + ɛ where ɛ N (0, 0.) Andreas Eitel Uni FR DL 207 (22)

Gradient descent Gradient descent example: D = {(x, y ),..., (x 00, y 00 )} with x U[0, ] y = 3 x + ɛ where ɛ N (0, 0.) Learn parameters θ of function f θ (x) = θx using loss l(f θ (x), y) = 2 f θ(x) y 2 2 = 2 (f θ(x) y) 2 L(f θ, D) θ = N N i= l(f θ, D) θ = N N (θx y)x i= Andreas Eitel Uni FR DL 207 (22)

Gradient descent Gradient descent example γ = 2. gradient descent Andreas Eitel Uni FR DL 207 (23)

Stochastic Gradient descent (SGD) There are two problems with gradient descent:. We have to find a good γ 2. Computing the gradient is expensive if the training dataset is large! Problem 2 can be attacked with online optimization (we will have a look at this) Problem remains but can be tackled via second order methods or other advanced optimization algorithms (rprop/rmsprop, adagrad) Andreas Eitel Uni FR DL 207 (24)

Gradient descent. We have to find a good γ (γ = 2., γ = 5.) gradient descent 2 Andreas Eitel Uni FR DL 207 (25)

Stochastic Gradient descent (SGD) 2 Computing the gradient is expensive if the training dataset is large! What if we would only evaluate f on parts of the data? Stochastic Gradient Descent: θ 0 init randomly do (x, y ) D sample example from dataset D θ t+ = θ t γ t l(f θ (x ), y ) θ while (L(f θ t+, V ) L(f θ t, V )) 2 > ɛ where γ t and (γ t ) 2 < t= t= (γ should go to zero but not too fast) SGD can speed up optimization for large datasets but can yield very noisy updates in practice mini-batches are used (compute l(, ) for several samples and average) we still have to find a learning rate shedule γ t Andreas Eitel Uni FR DL 207 (26)

Stochastic Gradient descent (SGD) Same data, assuming that gradient evaluation on all data takes 4 times as much time as evaluating a single datapoint (gradient descent (γ = 2), stochastic gradient descent (γ t = 0.0 t )) sgd Andreas Eitel Uni FR DL 207 (27)

Neural Network backward pass Now how do we compute the gradient for a network? l(f(x), y) Use the chain rule: f(g(x)) x = f(g(x)) g(x) g(x) x first compute loss on output layer then backpropagate to get l(f(x), y) l(f(x), y) and W (3) a (3) f(x) (2) h (x) W (3) W (2) () h (x) W () smax a smax a...... a a a...... a a ai... x xj... xn Andreas Eitel Uni FR DL 207 (28)

Neural Network backward pass Now how do we compute the gradient for a network? l(f(x), y) gradient wrt. layer 3 weights: l(f(x), y) l(f(x), y) a (3) = W (3) a (3) W (3) assuming l is NLL and softmax outputs, gradient wrt. layer 3 activation is: l(f(x), y) = (y f(x)) a (3) f(x) (2) h (x) W (3) W (2) () h (x) smax...... a a a...... y is one-hot encoded a a ai gradient of a wrt. W (3) : W () a (3) W = (3) h(2) (x) T... x xj... xn a smax a recall a (3) = W (3) h (2) (x) + b (3) Andreas Eitel Uni FR DL 207 (29)

Neural Network backward pass Now how do we compute the gradient for a network? gradient wrt. layer 3 weights: l(f(x), y) l(f(x), y) a (3) = W (3) a (3) W (3) assuming l is NLL and softmax outputs, gradient wrt. layer 3 activation is: l(f(x), y) = (y f(x)) a (3) f(x) (2) h (x) W (3) W (2) l(f(x), y) smax...... a a a...... gradient of a wrt. W (3) () : h (x) a (3) a a W = (3) h(2) (x) T ai W () combined:... l(f(x), y) = (y f(x))(h (2) (x)) T x xj... xn W (3) a smax recall a (3) = W (3) h (2) (x) + b (3) a Andreas Eitel Uni FR DL 207 (30)

Neural Network backward pass Now how do we compute the gradient for a network? gradient wrt. layer 3 weights: l(f(x), y) l(f(x), y) a (3) = W (3) a (3) W (3) assuming l is NLL and softmax outputs, gradient wrt. layer 3 activation is: l(f(x), y) = (y f(x)) a (3) gradient of a wrt. W (3) : a (3) W = (3) (h(2) (x)) T gradient wrt. previous layer: l(f(x), y) h (2) (x) l(f(x), y) a (3) = a (3) h (2) (x) ( = W (3)) T l(f(x), y) a (3) f(x) (2) h (x) W (3) W (2) () h (x) W () l(f(x), y) smax a smax a...... a a a...... a a ai... x xj... xn recall a (3) = W (3) h (2) (x) + b (3) Andreas Eitel Uni FR DL 207 (3)

Neural Network backward pass Now how do we compute the gradient for a network? l(f(x), y) gradient wrt. layer 2 weights: l(f(x), y) l(f(x), y) h (2) (x) a (2) = W (2) h (2) (x) a (2) W (2) same schema as before just have to consider computing derivative of activation function h(2) (x), e.g. for sigmoid σ( ) a (2) f(x) (2) h (x) W (3) W (2) () h (x) smax...... a a a...... a a ai h (2) (x) = σ(a i)( a i) a W () (2) and backprop even further... x xj... xn a smax a recall a (3) = W (3) h (2) (x) + b (3) Andreas Eitel Uni FR DL 207 (32)

Gradient Checking Backward-pass is just repeated application of the chain rule However, there is a huge potential for bugs... Gradient checking to the rescue (simply check code via finite-differences): Gradient Checking: θ = (W (), b (),... ) init randomly x init randomly ; y init randomly g analytic compute gradient l(f θ(x), y) θ for i in #θ ˆθ = θ ˆθi = ˆθ i + ɛ g numeric = l(fˆθ(x), y) l(f θ (x), y) ɛ assert( g numeric g analytic < ɛ) can also be used to test partial implementations (i.e. layers, activation functions) simply remove loss computation and backprop ones via backprop Andreas Eitel Uni FR DL 207 (33)

Overfitting If you train the parameters of a large network θ = (W (), b (),... ) you will see overfitting! L(f θ (x), D) L(f θ (x), V ) This can be at least partly conquered with regularization weight decay: change cost (and gradient) L(f θ, D) = N min θ N l(f θ (x i ), y i ) + #θ θ i 2 #θ i= i enforces small weights (occams razor) dropout: kill 50% of the activations in each hidden layer during training forward pass. Multiply hidden activations by during testing 2 prevents co-adaptation / enforces robustness to noise Andreas Eitel Uni FR DL 207 (34)

Assignment Implementation: Implement a simple feed-forward neural network by completing the provided stub this includes: possibility to use 2-4 layers sigmoid/tanh and ReLU for the hidden layer softmax output layer optimization via gradient descent (gd) optimization via stochastic gradient descent (sgd) gradient checking code (!!!) weight initialization with random noise (!!!) (use normal distribution with changing std. deviation for now) Bonus points for testing some advanced ideas: implement dropout, l2 regularization implement a different optimization scheme (rprop, rmsprop, adagrad) Code stub: https://github.com/aisrobots/dl_lab_207 Evaluation: Find good parameters (learning rate, number of iterations etc.) using a validation set (usually take the last 0k examples from the training set) After optimizing parameters run on the full dataset and test once on the test-set (you should be able to reach.6.8% error ) Submission: Clone our github repo and send us the link to your github/bitbucket repo including your solution code and report. Email to eitel@cs.uni-freiburg.de, hueglem@cs.uni-freiburg.de and zhang@cs.uni-freiburg.de Andreas Eitel Uni FR DL 207 (35)

Slide Information These slides have been created by Tobias Springenberg for the DL Lab Course WS 206/207. Andreas Eitel Uni FR DL 207 (36)