Machine Learning Lecture 10

Similar documents
Machine Learning Lecture 12

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Machine Learning Lecture 5

Machine Learning Lecture 7

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks. MGS Lecture 2

Machine Learning Lecture 14

Neural Networks and Deep Learning

Reading Group on Deep Learning Session 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 17: Neural Networks and Deep Learning

Logistic Regression & Neural Networks

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Course 395: Machine Learning - Lectures

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

y(x n, w) t n 2. (1)

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Lecture 5: Logistic Regression. Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Artificial Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural Networks biological neuron artificial neuron 1

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Deep Neural Networks (1) Hidden layers; Back-propagation

Ch 4. Linear Models for Classification

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

4. Multilayer Perceptrons

Linear & nonlinear classifiers

Linear Models for Classification

Logistic Regression. COMP 527 Danushka Bollegala

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Feedforward Neural Nets and Backpropagation

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Grundlagen der Künstlichen Intelligenz

Neural Networks and the Back-propagation Algorithm

Voting (Ensemble Methods)

Intelligent Systems Discriminative Learning, Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Linear discriminant functions

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

ECE521 Lectures 9 Fully Connected Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks: Backpropagation

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

FINAL: CS 6375 (Machine Learning) Fall 2014

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

ECE 5424: Introduction to Machine Learning

Multilayer Perceptron

Advanced statistical methods for data analysis Lecture 2

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks and Deep Learning.

Machine Learning for NLP

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ECE 5984: Introduction to Machine Learning

Machine Learning. Neural Networks

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Stochastic Gradient Descent

Deep Learning Lab Course 2017 (Deep Learning Practical)

Neural Networks: Introduction

Neural Network Training

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Machine Learning Lecture 2

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

CSC 411 Lecture 10: Neural Networks

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Chapter 14 Combining Models

AI Programming CS F-20 Neural Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Introduction to Deep Learning

Pattern Recognition and Machine Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Perceptron & Neural Networks

How to do backpropagation in a brain

Intro to Neural Networks and Deep Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSC321 Lecture 16: ResNets and Attention

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Deep Feedforward Networks

Lecture 3 Feedforward Networks and Backpropagation

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Midterm: CS 6375 Spring 2015 Solutions

Transcription:

Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de

Today s Topic Deep Learning 2

Course Outline Fundamentals Bayes Decision Theory Probability Density Estimation Classification Approaches Linear Discriminants Support Vector Machines Ensemble Methods & Boosting (Random Forests) Deep Learning Foundations Convolutional Neural Networks Recurrent Neural Networks 3

Recap: AdaBoost Adaptive Boosting Main idea [Freund & Schapire, 1996] Iteratively select an ensemble of component classifiers After each iteration, reweight misclassified training examples. Increase the chance of being selected in a sampled training set. Or increase the misclassification cost when training on the full set. Components h m (x): weak or base classifier Condition: <50% training error over any distribution H(x): strong or final classifier AdaBoost: Construct a strong classifier as a thresholded linear combination of the weighted weak classifiers: H(x) = sign à M X m=1 m h m (x)! 4

Recap: AdaBoost Algorithm 1. Initialization: Set w n (1) = 1 for n = 1,,N. N 2. For m = 1,,M iterations a) Train a new weak classifier h m (x) using the current weighting coefficients W (m) by minimizing the weighted error function NX J m = w n (m) I(h m(x) 6= t n ) n=1 b) Estimate the weighted error of this classifier on X: P N n=1 ² m = w(m) n I(h m (x) 6= t n ) P N n=1 w(m) n c) Calculate a weighting coefficient for h m (x): m =? d) Update the weighting coefficients: w (m+1) n =? How should we do this exactly? 5

Recap: Minimizing Exponential Error The original algorithm used an exponential error function NX E = exp f t n f m (x n )g n=1 where f m (x) is a classifier defined as a linear combination of base classifiers h l (x): f m (x) = 1 2 mx l h l (x) l=1 Goal Minimize E with respect to both the weighting coefficients l and the parameters of the base classifiers h l (x). 6

Recap: Minimizing Exponential Error Sequential Minimization (continuation from last lecture) Only minimize with respect to m and h m (x) E = = = E = NX exp f t n f m (x n )g n=1 NX exp n=1 NX n=1 w (m) n ½ t n f m 1 (x n ) 1 ¾ 2 t n m h m (x n ) exp ³ X N e m=2 e m=2 with = const. ½ 1 ¾ 2 t n m h m (x n ) n=1 f m (x) = 1 2 = w (m) n I(h m(x n ) 6= t n ) + e m=2 mx l h l (x) l=1 NX n=1 w (m) n 7

AdaBoost Minimizing Exponential Error Minimize with respect to h m (x): E = ³ X N e m=2 e m=2 n=1 @E @h m (x n )! = 0 w (m) n I(h m(x n ) 6= t n ) + e m=2 NX n=1 w (m) n = const. = const. This is equivalent to minimizing NX J m = w n (m) I(h m(x) 6= t n ) n=1 (our weighted error function from step 2a) of the algorithm) We re on the right track. Let s continue 8

AdaBoost Minimizing Exponential Error Minimize with respect to m : E = ³ X N e m=2 e m=2 n=1 @E @ m! = 0 w (m) n I(h m(x n ) 6= t n ) + e m=2 NX n=1 w (m) n µ 1 2 e m=2 + 1 X N 2 e m=2 n=1 w (m) n I(h m(x n ) 6= t n )! = 1 2 e m=2 NX n=1 w (m) n weighted error ² m := Update for the coefficients: P N n=1 w(m) n I(h m (x n ) 6= t n ) P N n=1 w(m) n = e m=2 e m=2 + e m=2 1 ² m = e m + 1 ½ ¾ 1 ²m m = ln ² m 9

AdaBoost Minimizing Exponential Error Remaining step: update the weights Recall that E = NX n=1 w (m) n exp ½ 1 ¾ 2 t n m h m (x n ) Therefore w (m+1) n = w (m) n = ::: Update for the weight coefficients. w (m+1) This becomes n in the next iteration. exp ½ 1 ¾ 2 t n m h m (x n ) = w (m) n exp f m I(h m (x n ) 6= t n )g 10

AdaBoost Final Algorithm 1. Initialization: Set w n (1) = 1 for n = 1,,N. N 2. For m = 1,,M iterations a) Train a new weak classifier h m (x) using the current weighting coefficients W (m) by minimizing the weighted error function NX J m = w n (m) I(h m(x) 6= t n ) n=1 b) Estimate the weighted error of this classifier on X: P N n=1 ² m = w(m) n I(h m (x) 6= t n ) P N n=1 w(m) n c) Calculate a weighting ½coefficient ¾ for h m (x): 1 ²m m = ln d) Update the weighting coefficients: w (m+1) n ² m = w (m) n exp f m I(h m (x n ) 6= t n )g 11

AdaBoost Analysis Result of this derivation We now know that AdaBoost minimizes an exponential error function in a sequential fashion. This allows us to analyze AdaBoost s behavior in more detail. In particular, we can see how robust it is to outlier data points. 12

Recap: Error Functions Ideal misclassification error Not differentiable! z n = t n y(x n ) Ideal misclassification error function (black) This is what we want to approximate, Unfortunately, it is not differentiable. The gradient is zero for misclassified points. We cannot minimize it by gradient descent. 13 Image source: Bishop, 2006

Recap: Error Functions Ideal misclassification error Squared error Sensitive to outliers! Penalizes too correct data points! Squared error used in Least-Squares Classification Very popular, leads to closed-form solutions. However, sensitive to outliers due to squared penalty. Penalizes too correct data points z n = t n y(x n ) Generally does not lead to good classifiers. 14 Image source: Bishop, 2006

Recap: Error Functions Ideal misclassification error Squared error Hinge error Robust to outliers! Not differentiable! Favors sparse solutions! z n = t n y(x n ) Hinge error used in SVMs Zero error for points outside the margin (z n > 1) sparsity Linear penalty for misclassified points (z n < 1) robustness Not differentiable around z n = 1 Cannot be optimized directly. 15 Image source: Bishop, 2006

Discussion: AdaBoost Error Function Ideal misclassification error Squared error Hinge error Exponential error z n = t n y(x n ) Exponential error used in AdaBoost Continuous approximation to ideal misclassification function. Sequential minimization leads to simple AdaBoost scheme. Properties? 16 Image source: Bishop, 2006

Discussion: AdaBoost Error Function Sensitive to outliers! Ideal misclassification error Squared error Hinge error Exponential error z n = t n y(x n ) Exponential error used in AdaBoost No penalty for too correct data points, fast convergence. Disadvantage: exponential penalty for large negative values! Less robust to outliers or misclassified data points! 17 Image source: Bishop, 2006

Discussion: Other Possible Error Functions Ideal misclassification error Squared error Hinge error Exponential error Cross-entropy error E = X ft n ln y n + (1 t n ) ln(1 y n )g z n = t n y(x n ) Cross-entropy error used in Logistic Regression Similar to exponential error for z>0. Only grows linearly with large negative values of z. Make AdaBoost more robust by switching to this error function. GentleBoost 18 Image source: Bishop, 2006

Summary: AdaBoost Properties Simple combination of multiple classifiers. Easy to implement. Can be used with many different types of classifiers. None of them needs to be too good on its own. In fact, they only have to be slightly better than chance. Commonly used in many areas. Empirically good generalization capabilities. Limitations Original AdaBoost sensitive to misclassified training data points. Because of exponential error function. Improvement by GentleBoost Single-class classifier Multiclass extensions available 19

Today s Topic Deep Learning 20

Topics of This Lecture A Brief History of Neural Networks Perceptrons Definition Loss functions Regularization Limits Multi-Layer Perceptrons Definition Learning with hidden units Obtaining the Gradients Naive analytical differentiation Numerical differentiation Backpropagation 21

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron And a cool learning algorithm: Perceptron Learning Hardware implementation Mark I Perceptron for 20 20 pixel image analysis The embryo of an electronic computer that [...] will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. 22 Image source: Wikipedia, clipartpanda.com

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert They showed that (single-layer) Perceptrons cannot solve all problems. This was misunderstood by many that they were worthless. Neural Networks don t work! 23 Image source: colourbox.de, thinkstock

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks Some notable successes with multi-layer perceptrons. Backpropagation learning algorithm Oh no! Killer robots will achieve world domination! OMG! They work like the human brain! 24 Image sources: clipartpanda.com, cliparts.co

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks Some notable successes with multi-layer perceptrons. Backpropagation learning algorithm But they are hard to train, tend to overfit, and have unintuitive parameters. So, the excitement fades again sigh! 25 Image source: clipartof.com, colourbox.de

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks 1995+ Interest shifts to other learning methods Notably Support Vector Machines Machine Learning becomes a discipline of its own. I can do science, me! 26 Image source: clipartof.com

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks 1995+ Interest shifts to other learning methods Notably Support Vector Machines Machine Learning becomes a discipline of its own. The general public and the press still love Neural Networks. I m doing Machine Learning. So, you re using Neural Networks? Actually... 27

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks 1995+ Interest shifts to other learning methods 2005+ Gradual progress Better understanding how to successfully train deep networks Availability of large datasets and powerful GPUs Still largely under the radar for many disciplines applying ML Are you using Neural Networks? Come on. Get real! 28

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks 1995+ Interest shifts to other learning methods 2005+ Gradual progress 2012 Breakthrough results ImageNet Large Scale Visual Recognition Challenge A ConvNet halves the error rate of dedicated vision approaches. Deep Learning is widely adopted. It works! 29 Image source: clipartpanda.com, clipartof.com

Topics of This Lecture A Brief History of Neural Networks Perceptrons Definition Loss functions Regularization Limits Multi-Layer Perceptrons Definition Learning with hidden units Obtaining the Gradients Naive analytical differentiation Numerical differentiation Backpropagation 30

Perceptrons (Rosenblatt 1957) Standard Perceptron Input Layer Hand-designed features based on common sense Outputs Output layer Weights Input layer Linear outputs Logistic outputs Learning = Determining the weights w 31

Extension: Multi-Class Networks One output node per class Output layer Weights Input layer Outputs Linear outputs Logistic outputs Can be used to do multidimensional linear regression or multiclass classification. Slide adapted from Stefan Roth 32

Extension: Non-Linear Basis Functions Straightforward generalization Output layer Weights Feature layer Mapping (fixed) Input layer Outputs Linear outputs Logistic outputs 33

Extension: Non-Linear Basis Functions Straightforward generalization Output layer W kd Weights Feature layer Mapping (fixed) Input layer Remarks Perceptrons are generalized linear discriminants! Everything we know about the latter can also be applied here. Note: feature functions Á(x) are kept fixed, not learned! 34

Perceptron Learning Very simple algorithm Process the training cases in some permutation If the output unit is correct, leave the weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a one, subtract the input vector from the weight vector. This is guaranteed to converge to a correct solution if such a solution exists. Slide adapted from Geoff Hinton 35

Perceptron Learning Let s analyze this algorithm... Process the training cases in some permutation If the output unit is correct, leave the weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a one, subtract the input vector from the weight vector. Translation w ( +1) kj = w ( ) kj (y k(x n ; w) t kn ) Á j (x n ) Slide adapted from Geoff Hinton 36

Perceptron Learning Let s analyze this algorithm... Process the training cases in some permutation If the output unit is correct, leave the weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a one, subtract the input vector from the weight vector. Translation w ( +1) kj = w ( ) kj (y k(x n ; w) t kn ) Á j (x n ) This is the Delta rule a.k.a. LMS rule! Perceptron Learning corresponds to 1 st -order (stochastic) Gradient Descent (e.g., of a quadratic error function)! Slide adapted from Geoff Hinton 37

Loss Functions We can now also apply other loss functions L2 loss Least-squares regression L1 loss: Median regression Cross-entropy loss Logistic regression Hinge loss SVM classification Softmax loss L(t; y(x)) = P n P k Multi-class probabilistic classification n I (t n = k) ln exp(y k(x)) P j exp(y j(x)) o 38

Regularization In addition, we can apply regularizers E.g., an L2 regularizer This is known as weight decay in Neural Networks. We can also apply other regularizers, e.g. L1 sparsity Since Neural Networks often have many parameters, regularization becomes very important in practice. We will see more complex regularization techniques later on... 39

Limitations of Perceptrons What makes the task difficult? Perceptrons with fixed, hand-coded input features can model any separable function perfectly......given the right input features. For some tasks this requires an exponential number of input features. E.g., by enumerating all possible binary input vectors as separate feature units (similar to a look-up table). But this approach won t generalize to unseen test cases! It is the feature design that solves the task! Once the hand-coded features have been determined, there are very strong limitations on what a perceptron can learn. Classic example: XOR function. 40

Wait... Didn t we just say that... Perceptrons correspond to generalized linear discriminants And Perceptrons are very limited... Doesn t this mean that what we have been doing so far in this lecture has the same problems??? Yes, this is the case. A linear classifier cannot solve certain problems (e.g., XOR). However, with a non-linear classifier based on the right kind of features, the problem becomes solvable. So far, we have solved such problems by hand-designing good features Á and kernels Á > Á. Can we also learn such feature representations? 41

Topics of This Lecture A Brief History of Neural Networks Perceptrons Definition Loss functions Regularization Limits Multi-Layer Perceptrons Definition Learning with hidden units Obtaining the Gradients Naive analytical differentiation Numerical differentiation Backpropagation 42

Multi-Layer Perceptrons Adding more layers Output layer Hidden layer Mapping (learned!) Input layer Output Slide adapted from Stefan Roth 43

Multi-Layer Perceptrons Activation functions g : For example: g (2) (a) = ¾(a), g (1) (a) = a The hidden layer can have an arbitrary number of nodes There can also be multiple hidden layers. Universal approximators A 2-layer network (1 hidden layer) can approximate any continuous function of a compact domain arbitrarily well! (assuming sufficient hidden nodes) Slide credit: Stefan Roth 44

Learning with Hidden Units Networks without hidden units are very limited in what they can learn More layers of linear units do not help still linear Fixed output non-linearities are not enough. We need multiple layers of adaptive non-linear hidden units. But how can we train such nets? Need an efficient way of adapting all weights, not just the last layer. Learning the weights to the hidden units = learning features This is difficult, because nobody tells us what the hidden units should do. Main challenge in deep learning. Slide adapted from Geoff Hinton 45

Learning with Hidden Units How can we train multi-layer networks efficiently? Need an efficient way of adapting all weights, not just the last layer. Idea: Gradient Descent Set up an error function with a loss L( ) and a regularizer ( ). E.g., L 2 loss L 2 regularizer ( weight decay ) Update each weight in the direction of the gradient 46

Gradient Descent Two main steps 1. Computing the gradients for each weight 2. Adjusting the weights in the direction of the gradient today next lecture 47

Topics of This Lecture A Brief History of Neural Networks Perceptrons Definition Loss functions Regularization Limits Multi-Layer Perceptrons Definition Learning with hidden units Obtaining the Gradients Naive analytical differentiation Numerical differentiation Backpropagation 48

Obtaining the Gradients Approach 1: Naive Analytical Differentiation Compute the gradients for each variable analytically. What is the problem when doing this? 49

Excursion: Chain Rule of Differentiation One-dimensional case: Scalar functions 50

Excursion: Chain Rule of Differentiation Multi-dimensional case: Total derivative Need to sum over all paths that lead to the target variable x. 51

Obtaining the Gradients Approach 1: Naive Analytical Differentiation Compute the gradients for each variable analytically. What is the problem when doing this? With increasing depth, there will be exponentially many paths! Infeasible to compute this way. 52

Topics of This Lecture A Brief History of Neural Networks Perceptrons Definition Loss functions Regularization Limits Multi-Layer Perceptrons Definition Learning with hidden units Obtaining the Gradients Naive analytical differentiation Numerical differentiation Backpropagation 53

Obtaining the Gradients Approach 2: Numerical Differentiation Given the current state W ( ), we can evaluate E(W ( ) ). Idea: Make small changes to W ( ) and accept those that improve E(W ( ) ). Horribly inefficient! Need several forward passes for each weight. Each forward pass is one run over the entire dataset! 54

Topics of This Lecture A Brief History of Neural Networks Perceptrons Definition Loss functions Regularization Limits Multi-Layer Perceptrons Definition Learning with hidden units Obtaining the Gradients Naive analytical differentiation Numerical differentiation Backpropagation 55

Obtaining the Gradients Approach 3: Incremental Analytical Differentiation Idea: Compute the gradients layer by layer. Each layer below builds upon the results of the layer above. The gradient is propagated backwards through the layers. Backpropagation algorithm 56

Backpropagation Algorithm Core steps 1. Convert the discrepancy between each output and its target value into an error derivate. 2. Compute error derivatives in each hidden layer from error derivatives in the layer above. 3. Use error derivatives w.r.t. activities to get error derivatives w.r.t. the incoming weights E y j E y j E y i (k 1) E w ji (k 1) Slide adapted from Geoff Hinton 58

Backpropagation Algorithm y j z j E z = y j j z j E y j = g z j z j E y j y i (k 1) Notation y j Output of layer k Connections: w ji (k 1) yi (k 1) z j Slide adapted from Geoff Hinton Input of layer k z j = i y j = g zj 63

Backpropagation Algorithm y j z j E z = y j j z j E y j = g z j z j E y j y i (k 1) E y i (k 1) = j z j y i (k 1) E z j = j w ji (k 1) E z j Notation y j z j Slide adapted from Geoff Hinton Output of layer k Input of layer k Connections: z j = i z y j j = g zj (k 1) y = w ji i w ji (k 1) yi (k 1) (k 1) 64

Backpropagation Algorithm y j z j E z = y j j z j E y j = g z j z j E y j y i (k 1) E y i (k 1) = j z j y i (k 1) E z j = j w ji (k 1) E z j Notation y j z j Slide adapted from Geoff Hinton Output of layer k Input of layer k E w ji (k 1) = z j (k 1) w ji Connections: E z j = y i (k 1) E z j z j = i z y j j = g zj (k 1) w = y i ji w ji (k 1) yi (k 1) (k 1) 65

Backpropagation Algorithm y j z j E z = y j j z j E y j = g z j z j E y j y i (k 1) E y i (k 1) = j z j y i (k 1) E z j = j w ji (k 1) E z j E w ji (k 1) = Efficient propagation scheme z j (k 1) w ji E z j = y i (k 1) E z j y i (k 1) is already known from forward pass! (Dynamic Programming) Propagate back the gradient from layer k and multiply with y i (k 1). Slide adapted from Geoff Hinton 66

Summary: MLP Backpropagation Forward Pass Backward Pass for k = 1,..., l do for k = l, l-1,...,1 do endfor endfor Notes For efficiency, an entire batch of data X is processed at once. denotes the element-wise product 67

Analysis: Backpropagation Backpropagation is the key to make deep NNs tractable However... The Backprop algorithm given here is specific to MLPs It does not work with more complex architectures, e.g. skip connections or recurrent networks! Whenever a new connection function induces a different functional form of the chain rule, you have to derive a new Backprop algorithm for it. Tedious... Let s analyze Backprop in more detail This will lead us to a more flexible algorithm formulation Next lecture 68

References and Further Reading More information on Neural Networks can be found in Chapters 6 and 7 of the Goodfellow & Bengio book I. Goodfellow, Y. Bengio, A. Courville Deep Learning MIT Press, 2016 https://goodfeli.github.io/dlbook/ 69