Deep Learning Lab Course 2017 (Deep Learning Practical)

Similar documents
Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks: Backpropagation

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Lecture 17: Neural Networks and Deep Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Deep Learning & Artificial Intelligence WS 2018/2019

Introduction to Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Convolutional Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Machine Learning (CSE 446): Backpropagation

Tutorial on Machine Learning for Advanced Electronics

Artificial Neural Networks. MGS Lecture 2

CSC321 Lecture 9: Generalization

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Introduction to Convolutional Neural Networks (CNNs)

Lecture 4 Backpropagation

Neural Networks and Deep Learning

CSC242: Intro to AI. Lecture 21

Deep Feedforward Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Deep Feedforward Networks

CSC321 Lecture 9: Generalization

CSCI567 Machine Learning (Fall 2018)

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Loss Functions and Optimization. Lecture 3-1

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Intro to Neural Networks and Deep Learning

Neural Network Training

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

y(x n, w) t n 2. (1)

Foundations of Artificial Intelligence

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Lecture 2: Learning with neural networks

Machine Learning I Continuous Reinforcement Learning

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Day 3 Lecture 3. Optimizing deep networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Deep Neural Networks (1) Hidden layers; Back-propagation

Computational statistics

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Lecture 3 Feedforward Networks and Backpropagation

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

DeepLearning on FPGAs

Reading Group on Deep Learning Session 1

Lecture 5 Neural models for NLP

An overview of deep learning methods for genomics

Normalization Techniques in Training of Deep Neural Networks

Machine Learning Lecture 12

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

CSC321 Lecture 15: Exploding and Vanishing Gradients

Notes on Back Propagation in 4 Lines

Machine Learning Lecture 10

STA 414/2104: Lecture 8

Advanced computational methods X Selected Topics: SGD

SGD and Deep Learning

UNSUPERVISED LEARNING

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Machine Learning

Convolutional Neural Networks. Srikumar Ramalingam

ECE521 Lecture 7/8. Logistic Regression

Jakub Hajic Artificial Intelligence Seminar I

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Rapid Introduction to Machine Learning/ Deep Learning

Machine Learning Techniques

Supervised Learning. George Konidaris

Statistical Machine Learning

Feed-forward Network Functions

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning. Boris

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

How to do backpropagation in a brain

CSC 411 Lecture 10: Neural Networks

Lecture 3 Feedforward Networks and Backpropagation

Lecture 6 Optimization for Deep Neural Networks

Grundlagen der Künstlichen Intelligenz

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Bits of Machine Learning Part 1: Supervised Learning

Logistic Regression & Neural Networks

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Neural Networks: Backpropagation

Stochastic gradient descent; Classification

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Understanding Neural Networks : Part I

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning Basics III

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Deep Learning (CNNs)

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Transcription:

Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of Freiburg January 2, 208 Andreas Eitel Uni FR DL 207 ()

Technical Issues Location: Monday, 4:00-6:00, building 082, room 00 028 Remark: We will be there for questions every week from 4:00-6:00. We expect you to work on your own. Your attendance is required during lectures/presentations We expect you have basic knowledge in ML (e.g. heard the Machine Learning lecture). Contact information (tutors): Aaron Klein kleinaa@cs.uni-freiburg.de Artemij Amiranashvili amiranas@cs.uni-freiburg.de Mohammadreza Zolfaghari zolfagha@cs.uni-freiburg.de Maria Hügle hueglem@cs.uni-freiburg.de Jingwhei Zhang zhang@cs.uni-freiburg.de Andreas Eitel eitel@cs.uni-freiburg.de Homepage: http://ais.informatik.uni-freiburg.de/teaching/ws7/deep learning course Andreas Eitel Uni FR DL 207 (2)

Schedule and outline Phase Today: Introduction to Deep Learning (lecture). Assignment ) 6.0-06. Implementation of a feed-forward Neural Network from scratch (each person individually, -2 page report). 23.0: meeting solving open questions 30.0: meeting solving open questions 06.: Intro to CNNs and Tensorflow (mandatory lecture), hand in Assignment Assignment 2) 06. - 20. Train CNN in tensorflow (each person individually, -2 page report) 3. meeting solving open questions Andreas Eitel Uni FR DL 207 (3)

Schedule and outline Phase Today: Introduction to Deep Learning (lecture). Assignment ) 6.0-06. Implementation of a feed-forward Neural Network from scratch (each person individually, -2 page report). 23.0: meeting solving open questions 30.0: meeting solving open questions 06.: Intro to CNNs and Tensorflow (mandatory lecture), hand in Assignment Assignment 2) 06. - 20. Train CNN in tensorflow (each person individually, -2 page report) 3. meeting solving open questions Phase 2 (split into three tracks) 20. First meeting in tracks, hand in Assignment 2 Assignment 3) 20. - 8.2 Group work track specific task, -2 page report Assignment 4) 8.2-22.0 Group work advanced topic, -2 page report Andreas Eitel Uni FR DL 207 (3)

Schedule and outline Phase Today: Introduction to Deep Learning (lecture). Assignment ) 6.0-06. Implementation of a feed-forward Neural Network from scratch (each person individually, -2 page report). 23.0: meeting solving open questions 30.0: meeting solving open questions 06.: Intro to CNNs and Tensorflow (mandatory lecture), hand in Assignment Assignment 2) 06. - 20. Train CNN in tensorflow (each person individually, -2 page report) 3. meeting solving open questions Phase 2 (split into three tracks) 20. First meeting in tracks, hand in Assignment 2 Assignment 3) 20. - 8.2 Group work track specific task, -2 page report Assignment 4) 8.2-22.0 Group work advanced topic, -2 page report Phase 3: 5.0: meeting final project topics and solving open questions Assignment 5) 22.0-2.02 Solve a challenging problem of your choice using the tools from Phase 2 (group work + 5 min presentation) Andreas Eitel Uni FR DL 207 (3)

Tracks Track Neurorobotics/Robotics Robot navigation Deep reinforcement learning Andreas Eitel Uni FR DL 207 (4)

Tracks Track Neurorobotics/Robotics Robot navigation Deep reinforcement learning Track 2 Machine Learning Architecture search and hyperparameter optimization Visual recognition Andreas Eitel Uni FR DL 207 (4)

Tracks Track Neurorobotics/Robotics Robot navigation Deep reinforcement learning Track 2 Machine Learning Architecture search and hyperparameter optimization Visual recognition Track 3 Computer Vision Image segmentation Autoencoders Generative adversarial networks Andreas Eitel Uni FR DL 207 (4)

Groups and Evaluation In total we have three practical phases Reports: Short -2 page report explaining your results, typically accompanied by -2 figures (e.g. a learning curve) hand in the code you used to generate these results as well, send us your github/bitbucket repo Presentations: Phase 3: 5 min group work presentation in last week the reports and presentation after each exercise will be evaluated w.r.t. scientific quality and quality of presentation Requirements for passing: attending the class in the lecture and presentation weeks active participation in the small groups, i.e., programming, testing, writing report Andreas Eitel Uni FR DL 207 (5)

Today... Groups: You will work on your own for the first exercise Assignment: You will have three weeks to build a MLP and apply it to the MNIST dataset (more on this at the end) Lecture: Short recap on how MLPs (feed-forward neural networks) work and how to train them Andreas Eitel Uni FR DL 207 (6)

(Deep) Machine Learning Overview Data Model M? inference Learn Model M from the data 2 Let the model M infer unknown quantities from data Andreas Eitel Uni FR DL 207 (7)

(Deep) Machine Learning Overview Data Model M? inference Andreas Eitel Uni FR DL 207 (8)

Machine Learning Overview What is the difference between deep learning and a standard machine learning pipeline? Andreas Eitel Uni FR DL 207 (9)

Standard Machine Learning Pipeline () Engineer good features (not learned) (2) Learn Model (3) Inference e.g. classes of unseen data x x x 2 (3) 3 () x x x features color height (4 cm) 2 color height (3.5 cm) 3 color height (0 cm) (2) Data Andreas Eitel Uni FR DL 207 (0)

Unsupervised Feature Learning Pipeline (a) (b) (2) (3) Maybe engineer good features (not learned) Learn (deep) representation unsupervisedly Learn Model Inference e.g. classes of unseen data x x x 2 3 (a) x x x features color height (4 cm) 2 color height (3.5 cm) 3 color height (0 cm) (3) (2) Data Pixels (b) Andreas Eitel Uni FR DL 207 ()

Supervised Deep Learning Pipeline () Jointly Learn everything with a deep architecture (2) Inference e.g. classes of unseen data x x x Classes 2 (2) 3 Data Andreas Eitel Uni FR Pixels () DL 207 (2)

Training supervised feed-forward neural networks Let s formalize! We are given: Dataset D = {(x, y ),..., (x N, y N )} A neural network with parameters θ which implements a function f θ (x) We want to learn: The parameters θ such that i [, N] : f θ (x i ) = y i Andreas Eitel Uni FR DL 207 (3)

Training supervised feed-forward neural networks A neural network with parameters θ which implements a function f θ (x) θ is given by the network weights w and bias terms b f(x) (2) h (x) () h (x) smax a smax a...... a a a...... a a ai... x xj... xn Andreas Eitel Uni FR DL 207 (4)

Neural network forward-pass Computing f θ (x) for a neural network is a forward-pass f(x) smax a smax a unit i activation: N a i = w i,jx j + b i j=i unit i output: h () i (x) = t(a i) where t( ) is an activation or transfer function (2) h (x) () h (x)...... a a a...... a a ai wi,j...... x xj xn bi Andreas Eitel Uni FR DL 207 (5)

Neural network forward-pass Computing f θ (x) for a neural network is a forward-pass alternatively (and much faster) use vector notation: layer activation: a () = W () x + b () layer output: h () (x) = t(a () ) where t( ) is applied element wise f(x) (2) h (x) () h (x) W () smax a smax a...... a a a...... a a ai wi,j...... x xj xn bi Andreas Eitel Uni FR DL 207 (6)

Neural network forward-pass Computing f θ (x) for a neural network is a forward-pass Second layer layer 2 activation: a (2) = W (2) h () (x) + b (2) layer 2 output: h () (x) = t(a (2) ) where t( ) is applied element wise () f(x) (2) h (x) W (2) h (x) W () smax a smax a...... a a a...... a a ai... x xj... xn Andreas Eitel Uni FR DL 207 (7)

Neural network forward-pass Computing f θ (x) for a neural network is a forward-pass Output layer output layer activation: a (3) = W (3) h (2) (x) + b (3) network output: f(x) = o(a (3) ) where o( ) is the output nonlinearity for classification use softmax: e z i o i(z) = z j= ez j () f(x) (2) h (x) h (x) W (3) W (2) W () smax a smax a...... a a a...... a a ai... x xj... xn Andreas Eitel Uni FR DL 207 (8)

Training supervised feed-forward neural networks Neural network activation functions Typical nonlinearities for hidden layers are: tanh(a i), sigmoid σ(a i) =, ReLU relu(ai) = max(ai, 0) + e a i tanh and sigmoid are both squashing non-linearities ReLU just thresholds at 0 Why not linear? Andreas Eitel Uni FR DL 207 (9)

Training supervised feed-forward neural networks Train parameters θ such that i [, N] : f θ (x i ) = y i Andreas Eitel Uni FR DL 207 (20)

Training supervised feed-forward neural networks Train parameters θ such that i [, N] : f θ (x i ) = y i We can do this via minimizing the empirical risk on our dataset D min L(f θ, D) = min θ θ N where l(, ) is a per example loss N l(f θ (x i ), y i ), () i= Andreas Eitel Uni FR DL 207 (20)

Training supervised feed-forward neural networks Train parameters θ such that i [, N] : f θ (x i ) = y i We can do this via minimizing the empirical risk on our dataset D min L(f θ, D) = min θ θ N N l(f θ (x i ), y i ), () i= where l(, ) is a per example loss For regression often use the squared loss: l(f θ (x), y) = 2 M (f j,θ (x) y j) 2 j= For M-class classification use the negative log likelihood: l(f θ (x), y) = M log(f j,θ (x)) y j j Andreas Eitel Uni FR DL 207 (20)

Gradient descent The simplest approach to minimizing min L(f θ, D) is gradient descent θ Gradient descent: θ 0 init randomly do θ t+ = θ t γ L(f θ, D) θ while (L(f θ t+, V ) L(f θ t, V )) 2 > ɛ Andreas Eitel Uni FR DL 207 (2)

Gradient descent The simplest approach to minimizing min L(f θ, D) is gradient descent θ Gradient descent: θ 0 init randomly do θ t+ = θ t γ L(f θ, D) θ while (L(f θ t+, V ) L(f θ t, V )) 2 > ɛ Where V is a validation dataset (why not use D?) Remember in our case: L(f θ, D) = N l(f θ (x i ), y i ) N i= We will get to computing the derivatives shortly Andreas Eitel Uni FR DL 207 (2)

Gradient descent Gradient descent example: D = {(x, y ),..., (x 00, y 00 )} with x U[0, ] y = 3 x + ɛ where ɛ N (0, 0.) Andreas Eitel Uni FR DL 207 (22)

Gradient descent Gradient descent example: D = {(x, y ),..., (x 00, y 00 )} with x U[0, ] y = 3 x + ɛ where ɛ N (0, 0.) Learn parameters θ of function f θ (x) = θx using loss l(f θ (x), y) = 2 f θ(x) y 2 2 = 2 (f θ(x) y) 2 L(f θ, D) θ = N N i= l(f θ, D) θ = N N (θx y)x i= Andreas Eitel Uni FR DL 207 (22)

Gradient descent Gradient descent example: D = {(x, y ),..., (x 00, y 00 )} with x U[0, ] y = 3 x + ɛ where ɛ N (0, 0.) Learn parameters θ of function f θ (x) = θx using loss l(f θ (x), y) = 2 f θ(x) y 2 2 = 2 (f θ(x) y) 2 L(f θ, D) θ = N N i= l(f θ, D) θ = N N (θx y)x i= Andreas Eitel Uni FR DL 207 (22)

Gradient descent Gradient descent example γ = 2. gradient descent Andreas Eitel Uni FR DL 207 (23)

Stochastic Gradient descent (SGD) There are two problems with gradient descent:. We have to find a good γ 2. Computing the gradient is expensive if the training dataset is large! Problem 2 can be attacked with online optimization (we will have a look at this) Problem remains but can be tackled via second order methods or other advanced optimization algorithms (rprop/rmsprop, adagrad) Andreas Eitel Uni FR DL 207 (24)

Gradient descent. We have to find a good γ (γ = 2., γ = 5.) gradient descent 2 Andreas Eitel Uni FR DL 207 (25)

Stochastic Gradient descent (SGD) 2 Computing the gradient is expensive if the training dataset is large! What if we would only evaluate f on parts of the data? Stochastic Gradient Descent: θ 0 init randomly do (x, y ) D sample example from dataset D θ t+ = θ t γ t l(f θ (x ), y ) θ while (L(f θ t+, V ) L(f θ t, V )) 2 > ɛ where γ t and (γ t ) 2 < t= t= (γ should go to zero but not too fast) Andreas Eitel Uni FR DL 207 (26)

Stochastic Gradient descent (SGD) 2 Computing the gradient is expensive if the training dataset is large! What if we would only evaluate f on parts of the data? Stochastic Gradient Descent: θ 0 init randomly do (x, y ) D sample example from dataset D θ t+ = θ t γ t l(f θ (x ), y ) θ while (L(f θ t+, V ) L(f θ t, V )) 2 > ɛ where γ t and (γ t ) 2 < t= t= (γ should go to zero but not too fast) SGD can speed up optimization for large datasets but can yield very noisy updates in practice mini-batches are used (compute l(, ) for several samples and average) we still have to find a learning rate shedule γ t Andreas Eitel Uni FR DL 207 (26)

Stochastic Gradient descent (SGD) Same data, assuming that gradient evaluation on all data takes 4 times as much time as evaluating a single datapoint (gradient descent (γ = 2), stochastic gradient descent (γ t = 0.0 t )) sgd Andreas Eitel Uni FR DL 207 (27)

Neural Network backward pass Now how do we compute the gradient for a network? l(f(x), y) Use the chain rule: f(g(x)) x = f(g(x)) g(x) g(x) x first compute loss on output layer then backpropagate to get l(f(x), y) l(f(x), y) and W (3) a (3) f(x) (2) h (x) W (3) W (2) () h (x) W () smax a smax a...... a a a...... a a ai... x xj... xn Andreas Eitel Uni FR DL 207 (28)

Neural Network backward pass Now how do we compute the gradient for a network? l(f(x), y) gradient wrt. layer 3 weights: l(f(x), y) l(f(x), y) a (3) = W (3) a (3) W (3) assuming l is NLL and softmax outputs, gradient wrt. layer 3 activation is: l(f(x), y) = (y f(x)) a (3) f(x) (2) h (x) W (3) W (2) () h (x) smax...... a a a...... y is one-hot encoded a a ai gradient of a wrt. W (3) : W () a (3) W = (3) h(2) (x) T... x xj... xn a smax a recall a (3) = W (3) h (2) (x) + b (3) Andreas Eitel Uni FR DL 207 (29)

Neural Network backward pass Now how do we compute the gradient for a network? gradient wrt. layer 3 weights: l(f(x), y) l(f(x), y) a (3) = W (3) a (3) W (3) assuming l is NLL and softmax outputs, gradient wrt. layer 3 activation is: l(f(x), y) = (y f(x)) a (3) f(x) (2) h (x) W (3) W (2) l(f(x), y) smax...... a a a...... gradient of a wrt. W (3) () : h (x) a (3) a a W = (3) h(2) (x) T ai W () combined:... l(f(x), y) = (y f(x))(h (2) (x)) T x xj... xn W (3) a smax recall a (3) = W (3) h (2) (x) + b (3) a Andreas Eitel Uni FR DL 207 (30)

Neural Network backward pass Now how do we compute the gradient for a network? gradient wrt. layer 3 weights: l(f(x), y) l(f(x), y) a (3) = W (3) a (3) W (3) assuming l is NLL and softmax outputs, gradient wrt. layer 3 activation is: l(f(x), y) = (y f(x)) a (3) gradient of a wrt. W (3) : a (3) W = (3) (h(2) (x)) T gradient wrt. previous layer: l(f(x), y) h (2) (x) l(f(x), y) a (3) = a (3) h (2) (x) ( = W (3)) T l(f(x), y) a (3) f(x) (2) h (x) W (3) W (2) () h (x) W () l(f(x), y) smax a smax a...... a a a...... a a ai... x xj... xn recall a (3) = W (3) h (2) (x) + b (3) Andreas Eitel Uni FR DL 207 (3)

Neural Network backward pass Now how do we compute the gradient for a network? l(f(x), y) gradient wrt. layer 2 weights: l(f(x), y) l(f(x), y) h (2) (x) a (2) = W (2) h (2) (x) a (2) W (2) same schema as before just have to consider computing derivative of activation function h(2) (x), e.g. for sigmoid σ( ) a (2) f(x) (2) h (x) W (3) W (2) () h (x) smax...... a a a...... a a ai h (2) (x) = σ(a i)( a i) a W () (2) and backprop even further... x xj... xn a smax a recall a (3) = W (3) h (2) (x) + b (3) Andreas Eitel Uni FR DL 207 (32)

Gradient Checking Backward-pass is just repeated application of the chain rule However, there is a huge potential for bugs... Gradient checking to the rescue (simply check code via finite-differences): Gradient Checking: θ = (W (), b (),... ) init randomly x init randomly ; y init randomly g analytic compute gradient l(f θ(x), y) θ for i in #θ ˆθ = θ ˆθi = ˆθ i + ɛ g numeric = l(fˆθ(x), y) l(f θ (x), y) ɛ assert( g numeric g analytic < ɛ) can also be used to test partial implementations (i.e. layers, activation functions) simply remove loss computation and backprop ones via backprop Andreas Eitel Uni FR DL 207 (33)

Overfitting If you train the parameters of a large network θ = (W (), b (),... ) you will see overfitting! L(f θ (x), D) L(f θ (x), V ) This can be at least partly conquered with regularization Andreas Eitel Uni FR DL 207 (34)

Overfitting If you train the parameters of a large network θ = (W (), b (),... ) you will see overfitting! L(f θ (x), D) L(f θ (x), V ) This can be at least partly conquered with regularization weight decay: change cost (and gradient) L(f θ, D) = N min θ enforces small weights (occams razor) N l(f θ (x i ), y i ) + #θ θ i 2 #θ i= i Andreas Eitel Uni FR DL 207 (34)

Overfitting If you train the parameters of a large network θ = (W (), b (),... ) you will see overfitting! L(f θ (x), D) L(f θ (x), V ) This can be at least partly conquered with regularization weight decay: change cost (and gradient) L(f θ, D) = N min θ N l(f θ (x i ), y i ) + #θ θ i 2 #θ i= i enforces small weights (occams razor) dropout: kill 50% of the activations in each hidden layer during training forward pass. Multiply hidden activations by during testing 2 prevents co-adaptation / enforces robustness to noise Andreas Eitel Uni FR DL 207 (34)

Overfitting If you train the parameters of a large network θ = (W (), b (),... ) you will see overfitting! L(f θ (x), D) L(f θ (x), V ) This can be at least partly conquered with regularization weight decay: change cost (and gradient) L(f θ, D) = N min θ N l(f θ (x i ), y i ) + #θ θ i 2 #θ i= i enforces small weights (occams razor) dropout: kill 50% of the activations in each hidden layer during training forward pass. Multiply hidden activations by during testing 2 prevents co-adaptation / enforces robustness to noise Many, many more! Andreas Eitel Uni FR DL 207 (34)

Assignment Implementation: Implement a simple feed-forward neural network by completing the provided stub this includes: possibility to use 2-4 layers sigmoid/tanh and ReLU for the hidden layer softmax output layer optimization via gradient descent (gd) optimization via stochastic gradient descent (sgd) gradient checking code (!!!) weight initialization with random noise (!!!) (use normal distribution with changing std. deviation for now) Bonus points for testing some advanced ideas: implement dropout, l2 regularization implement a different optimization scheme (rprop, rmsprop, adagrad) Code stub: https://github.com/aisrobots/dl_lab_207 Evaluation: Find good parameters (learning rate, number of iterations etc.) using a validation set (usually take the last 0k examples from the training set) After optimizing parameters run on the full dataset and test once on the test-set (you should be able to reach.6.8% error ) Submission: Clone our github repo and send us the link to your github/bitbucket repo including your solution code and report. Email to eitel@cs.uni-freiburg.de, hueglem@cs.uni-freiburg.de and zhang@cs.uni-freiburg.de Andreas Eitel Uni FR DL 207 (35)

Slide Information These slides have been created by Tobias Springenberg for the DL Lab Course WS 206/207. Andreas Eitel Uni FR DL 207 (36)