Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1
Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for learning with a neural network o Neural Networks as modular architectures o Various Neural Network modules o How to implement and check your very own module INTRODUCTION ON DEEP LEARNING AND NEURAL NETWORKS - PAGE 2
The Machine Learning Paradigm UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 3
Forward computations o Collect annotated data o Define model and initialize randomly o Predict based on current model In neural network jargon forward propagation o Evaluate predictions Data h(x i ; θ) Model Score/Prediction/Output y i h(x i ; θ) Objective/Loss/Cost/Energy L(θ; y i, h) Input: X Targets: Y θ (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 4
Forward computations o Collect annotated data o Define model and initialize randomly o Predict based on current model In neural network jargon forward propagation o Evaluate predictions Data h(x i ; θ) Model Score/Prediction/Output y i h(x i ; θ) Objective/Loss/Cost/Energy L(θ; y i, h) Input: X Targets: Y θ (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 5
Forward computations o Collect annotated data o Define model and initialize randomly o Predict based on current model In neural network jargon forward propagation o Evaluate predictions Data h(x i ; θ) Model Score/Prediction/Output y i h(x i ; θ) Objective/Loss/Cost/Energy L(θ; y i, h) Input: X Targets: Y θ (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 6
Forward computations o Collect annotated data o Define model and initialize randomly o Predict based on current model In neural network jargon forward propagation o Evaluate predictions Data h(x i ; θ) Model Score/Prediction/Output y i h(x i ; θ) Objective/Loss/Cost/Energy L(θ; y i, h) Input: X Targets: Y θ (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 7
Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model Score/Prediction/Output Objective/Loss/Cost/Energy Data Input: X Targets: Y θ = 1 L( ) (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 8
Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model Data Input: X Targets: Y θ Score/Prediction/Output = 1 Objective/Loss/Cost/Energy (θ; y i ) y i OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 9
Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model Data Input: X Targets: Y θ Score/Prediction/Output y i h Objective/Loss/Cost/Energy (θ; y i ) y i OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 10
Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model h(x i ) Data Input: X Targets: Y θ θ Score/Prediction/Output y i h Objective/Loss/Cost/Energy (θ; y i ) y i OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 11
Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model h(x i ) Data Input: X Targets: Y θ θ Score/Prediction/Output y i h Objective/Loss/Cost/Energy (θ; y i ) y i OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 12
Optimization through Gradient Descent o As with many model, we optimize our neural network with Gradient Descent θ (t+1) = θ (t) η t θ L o The most important component in this formulation is the gradient o Backpropagation to the rescue The backward computations of network return the gradients How to make the backward computations LEARNING WITH NEURAL NETWORKS - PAGE 13
Backpropagation UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 14
What is a neural network again? o A family of parametric, non-linear and hierarchical representation learning functions, which are massively optimized with stochastic gradient descent to encode domain knowledge, i.e. domain invariances, stationarity. o a L x; θ 1,,L = h L (h L 1 h 1 x, θ 1, θ L 1, θ L ) x:input, θ l : parameters for layer l, a l = h l (x, θ l ): (non-)linear function o Given training corpus {X, Y} find optimal parameters θ arg min θ (x,y) (X,Y) l(y, a L x; θ 1,,L ) INTRODUCTION ON NEURAL NETWORKS AND DEEP LEARNING - PAGE 15
Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Forward connections (Feedforward architecture) Loss h 5 (x i ; θ) h 4 (x i ; θ) h 3 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 16
Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Loss h 5 (x i ; θ) h 4 (x i ; θ) Interweaved connections (Directed Acyclic Graph architecture- DAGNN) h 3 (x i ; θ) h 4 (x i ; θ) h 2 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 17
Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Loss h 5 (x i ; θ) h 4 (x i ; θ) h 3 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input Loopy connections (Recurrent architecture, special care needed) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 18
Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Loss h 5 (x i ; θ) h 4 (x i ; θ) Input h 1 (x i ; θ) h 2 (x i ; θ) h 3 (x i ; θ) h 4 (x i ; θ) h 5 (x i ; θ) Loss h 4 (x i ; θ) h 3 (x i ; θ) Functions Modules h 2 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input h 1 (x i ; θ) h 2 (x i ; θ) h 3 (x i ; θ) h 4 (x i ; θ) h 5 (x i ; θ) Loss Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 19
What is a module? o A module is a building block for our network o Each module is an object/function a = h(x; θ) that Contains trainable parameters (θ) Receives as an argument an input x And returns an output a based on the activation function h o The activation function should be (at least) first order differentiable (almost) everywhere o For easier/more efficient backpropagation store module input easy to get module output fast easy to compute derivatives Loss h 5 (x 5 ; θ 5 ) h 5 (x 5 ; θ 5 ) h 4 (x 4 ; θ 4 ) h 3 (x 3 ; θ 3 ) h 2 (x 2 ; θ 2 ) h 2 (x 2 ; θ 2 ) h 1 (x 1 ; θ 1 ) Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 20
Anything goes or do special constraints exist? o A neural network is a composition of modules (building blocks) o Any architecture works o If the architecture is a feedforward cascade, no special care o If acyclic, there is right order of computing the forward computations o If there are loops, these form recurrent connections (revisited later) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 21
Forward computations for neural networks o Simply compute the activation of each module in the network a l = h l x l ; θ, where a l = x l+1 (or x l = a l 1 ) o We need to know the precise function behind each module h l ( ) o Recursive operations One module s output is another s input o Steps Visit modules one by one starting from the data input Some modules might have several inputs from multiple modules o Compute modules activations with the right order Make sure all the inputs computed at the right time Loss h 5 (x 5 ; θ 5 ) h 5 (x 5 ; θ 5 ) h 4 (x 4 ; θ 4 ) h 3 (x 3 ; θ 3 ) h 2 (x 2 ; θ 2 ) h 2 (x 2 ; θ 2 ) h 1 (x 1 ; θ 1 ) Data Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 22
Backward computations for neural networks o Simply compute the gradients of each module for our data We need to know the gradient formulation of each module h l (x l ; θ l ) w.r.t. their inputs x l and parameters θ l o We need the forward computations first Their result is the sum of losses for our input data o Then take the reverse network (reverse connections) and traverse it backwards o Instead of using the activation functions, we use their gradients Data: o The whole process can be described very neatly and concisely with the backpropagation algorithm dloss(input) dh 5 (x 5 ; θ 5 ) dh 3 (x 3 ; θ 3 ) dh 2 (x 2 ; θ 2 ) dh 4 (x 4 ; θ 4 ) dh 1 (x 1 ; θ 1 ) dh 5 (x 5 ; θ 5 ) dh 2 (x 2 ; θ 2 ) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 23
Again, what is a neural network again? o a L x; θ 1,,L = h L (h L 1 h 1 x, θ 1, θ L 1, θ L ) x:input, θ l : parameters for layer l, a l = h l (x, θ l ): (non-)linear function o Given training corpus {X, Y} find optimal parameters θ arg min θ (x,y) (X,Y) l(y, a L x; θ 1,,L ) o To use any gradient descent based optimization (θ (t+1) = θ (t+1) η t θ need the gradients, l = 1,, L θ l o How to compute the gradients for such a complicated function enclosing other functions, like a L ( )? (t)) we INTRODUCTION ON NEURAL NETWORKS AND DEEP LEARNING - PAGE 24
Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dx i = σ j dz dy j dy j dx i gradients from all possible paths z y (1) y (2) x (1) x (2) x (3) UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 25
Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dx i = σ j dz dy j dy j dx i gradients from all possible paths z y 1 y 2 x 1 x 2 x 3 dz = dz dy 1 dz dy 2 dx 1 dy 1 dx1+ dy 2 dx 1 UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 26
Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dx i = σ j dz dy j dy j dx i gradients from all possible paths z y 1 y 2 x 1 x 2 x 3 dz = dz dy 1 dz dy 2 dx 2 dy 1 dx2+ dy 2 dx 2 UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 27
Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dx i = σ j dz dy j dy j dx i gradients from all possible paths z y (1) y (2) x (1) x (2) x (3) dz = dz dy 1 dz dy 2 dx 3 dy 1 dx3+ dy 2 dx 3 UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 28
Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dz dy = σ j dx j gradients from all possible paths i dy j dx i or in vector notation dy dx is the Jacobian dz dx = dy dx T dz dy z y (1) y (2) x (1) x (2) x (3) UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 29
The Jacobian o When x R 3, y R 2 J y x = dy dx = y 1 x 1 y 1 x 2 y 1 x 3 y 2 x 1 y 2 x 2 y 2 x 3 LEARNING WITH NEURAL NETWORKS - PAGE 30
Chain rule in practice o f y = sin y, y = g x = 0.5 x 2 df dx = d [sin(y)] dg d 0.5x 2 dx = cos 0.5x 2 x UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 31
Backpropagation Chain rule!!! o The loss function L(y, a L ) depends on a L, which depends on a L 1,, which depends on a 2 : o Gradients of parameters of layer l Chain rule = θ l a L a L a L 1 a L 1 a L 2 a l θ l o When shortened, we need to two quantities a L x; θ 1,,L = h L (h L 1 h 1 x, θ 1,, θ L 1, θ L ) θ l = ( a l θ l ) T a l Gradient of a module w.r.t. its parameters Gradient of loss w.r.t. the module output OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 32
Backpropagation Chain rule!!! o For a l in = ( a l ) T we only need the Jacobian of the l-th θ l θ l θ l a l module output a l w.r.t. to the module s parameters θ l o Very local rule, every module looks for its own o Since computations can be very local graphs can be very complicated modules can be complicated (as long as they are differentiable) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 33
Backpropagation Chain rule!!! o For a l in θ l = ( a l θ l ) T a l we apply chain rule again a l+1 = h l+1 (x l+1 ; θ l+1 ) a l = a l+1 a l T a l+1 x l+1 = a l o We can rewrite a l+1 a l as gradient of module w.r.t. to input Remember, the output of a module is the input for the next one: a l =x l+1 Gradient w.r.t. the module input a l = a l+1 x l+1 T a l+1 a l = h l (x l ; θ l ) Recursive rule (good for us)!!! OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 34
Multivariate functions f(x) o Often module functions depend on multiple input variables Softmax! Each output dimension depends on multiple input dimensions a j = e xj e x1 + e x2 + e x3, j = 1,2,3 o For these cases for the a l (or a l ) we must compute Jacobian matrix as a x l θ l l depends on multiple input x l (or θ l ) e.g. in softmax a 2 depends on all e x1, e x2 and e x3, not just on e x2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 35
Diagonal Jacobians o Often in modules the output depends only in a single input e.g. a sigmoid a = σ(x), or a = tanh(x), or a = exp(x) a x = σ x = σ x 1 x 2 x 3 = σ(x 1 ) σ(x 2 ) σ(x 3 ) o Not need for full Jacobian, only the diagonal: anyways dai j = 0, i j dx da dx = dσ dx = σ(x 1 )(1 σ(x 1 )) 0 0 0 σ(x 2 )(1 σ(x 2 )) 0 0 0 σ(x 3 )(1 σ(x 3 )) ~ σ(x 1 )(1 σ(x 1 )) σ(x 2 )(1 σ(x 2 )) σ(x 3 )(1 σ(x 3 )) Can rewrite equations as inner products to save computations UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 36
Dimension analysis o To make sure everything is done correctly Dimension analysis o The dimensions of the gradient w.r.t. θ l must be equal to the dimensions of the respective weight θ l dim a l = dim a l dim θ l = dim θ l LEARNING WITH NEURAL NETWORKS - PAGE 37
Dimension analysis o For a l = a l+1 x l+1 T a l+1 dim a l = d l dim θ l = d l 1 d l [d l 1] = d l+1 d l T [d l+1 1] o For θ l = α l θ l α l T [d l 1 d l ] = [d l 1 1] [1 d l ] LEARNING WITH NEURAL NETWORKS - PAGE 38
Backpropagation: Recursive chain rule o Step 1. Compute forward propagations for all layers recursively a l = h l x l and x l+1 = a l o Step 2. Once done with forward propagation, follow the reverse path. Start from the last layer and for each new layer compute the gradients Cache computations when possible to avoid redundant operations = a T l+1 a l x l+1 a l+1 o Step 3. Use the gradients θ l θ l = a l θ l a l with Stochastic Gradient Descend to train T
Backpropagation: Recursive chain rule o Step 1. Compute forward propagations for all layers recursively a l = h l x l and x l+1 = a l o Step 2. Once done with forward propagation, follow the reverse path. Start from the last layer and for each new layer compute the gradients Cache computations when possible to avoid redundant operations = a T l+1 a l x l+1 a l+1 ector with dimensions [d l 1] o Step 3. Use the gradients Jacobian matrix with dimensions d l+1 d l T θ l Vector with dimensions [d l+1 1] θ l = a l θ l a l with Stochastic Gradient Descend to train Matrix with dimensions [d l 1 d l ] Vector with dimensions [d l 1 1] T Vector with dimensions [1 d l ]
Dimensionality analysis: An Example o d l 1 = 15 (15 neurons), d l = 10 (10 neurons), d l+1 = 5 (5 neurons) o Let s say a l = θ Τ l x l and a l+1 = θ Τ l+1 x l+1 o Forward computations a l 1 15 1, a l : 10 1, a l+1 : [5 1] x l : 15 1, x l+1 : 10 1 θ l : 15 10 o Gradients a l : 5 10 T 5 1 = 10 1 θ l 15 1 10 1 T = 15 10 x l = a l 1
Intuitive Backpropagation UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 42
Backpropagation in practice o Things are dead simple, just compute per module a(x; θ) x a(x; θ) θ o Then follow iterative procedure a l = a l+1 x l+1 T a l+1 θ l = a l θ l a l T UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 43
Backpropagation in practice o Things are dead simple, just compute per module a(x; θ) x a(x; θ) θ o Then follow iterative procedure [remember: a l = x l+1 ] Derivatives from layer above a l = a l+1 x l+1 T a l+1 θ l = a l θ l a l T Module derivatives UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 44
Forward propagation o For instance, let s consider our module is the function cos θx + π o The forward computation is simply import numpy as np def forward(x): return np.cos(self.theta*x)+np.pi UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 45
Backward propagation o The backpropagation for the function cos x + π import numpy as np def backward_dx(x): return -self.theta*np.sin(self.theta*x) import numpy as np def backward_dtheta(x): return -x*np.sin(self.theta*x) UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 46
Backpropagation: An example UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 47
Backpropagation visualization L = 3, a 3 = h 3 (x 3 ) θ 3 = L a 2 = h 2 (x 2, θ 2 ) a 1 = h 1 (x 1, θ 1 ) θ 2 θ 1 x 1 x 1 1 x 1 2 x 1 3 x 1 4 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 48
Backpropagation visualization at epoch (t) Forward propagations Compute and store a 1 = h 1 (x 1 ) a 3 = h 3 (x 3 ) θ 3 = L Example a 1 = σ(θ 1 x 1 ) a 2 = h 2 (x 2, θ 2 ) θ 2 Store!!! a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 49
Backpropagation visualization at epoch (t) Forward propagations Compute and store a 2 = h 2 (x 2 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = L Example a 1 = σ(θ 1 x 1 ) a 2 = σ(θ 2 x 2 ) a 2 = h 2 (x 2, θ 2 ) θ 2 Store!!! a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 50
Backpropagation visualization at epoch (t) Forward propagations Compute and store a 3 = h 3 (x 3 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) L Example a 1 = σ(θ 1 x 1 ) a 2 = σ(θ 2 x 2 ) 2 a 3 = y x 3 θ 2 a 1 = h 1 (x 1, θ 1 ) Store!!! θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 51
Backpropagation visualization at epoch (t) Backpropagation a 3 = Direct computation a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) L Example a 3 = L y, x 3 = h 3 (x 3 ) = 0.5 y x 3 2 x 3 = (y x 3 ) θ 3 θ 2 a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 52
Backpropagation visualization at epoch (t) Backpropagation = a 3 a 2 a 3 a 2 = a 2 θ 2 a 2 θ 2 a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) θ 2 L Example L y, x 3 = 0.5 y x 3 2 x 3 = a 2 a 2 = σ(θ 2 x 2 ) = = (y x a 2 x 3 ) 3 σ(x) = σ(x)(1 σ(x)) a 1 = h 1 (x 1, θ 1 ) θ 1 a 2 θ 2 = x 2 σ(θ 2 x 2 )(1 σ(θ 2 x 2 )) = x 2 a 2 (1 a 2 ) x 1 Stored during forward computations = (y x a 3 ) 2 = x θ 2 a 2 a 2 (1 a 2 ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 53
Backpropagation visualization at epoch (t) Backpropagation = a 2 a 1 a 2 a 1 = a 1 θ 1 a 1 θ 1 a 2 = h 2 (x 2, θ 2 ) a 1 = h 1 (x 1, θ 1 ) a 3 = h 3 (x 3 ) θ 3 = θ 2 θ 1 L Example L y, a 3 = 0.5 y a 3 2 a 2 = σ(θ 2 x 2 ) x 2 = a 1 a 1 = σ(θ 1 x 1 ) a 2 = a 2 = θ a 1 x 2 a 2 (1 a 2 ) 2 a 1 = x θ 1 a 1 (1 a 1 ) 1 = θ a 1 a 2 a 2 (1 a 2 ) x 1 2 = x θ 1 a 1 a 1 (1 a 1 ) 1 Computed from the exact previous backpropagation step (Remember, recursive rule) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 54
Backpropagation visualization at epoch (t + 1) Forward propagations Compute and store a 1 = h 1 (x 1 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = L Example a 1 = σ(θ 1 x 1 ) a 2 = h 2 (x 2, θ 2 ) θ 2 Store!!! a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 55
Backpropagation visualization at epoch (t + 1) Forward propagations Compute and store a 2 = h 2 (x 2 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = L Example a 1 = σ(θ 1 x 1 ) a 2 = σ(θ 2 x 2 ) a 2 = h 2 (x 2, θ 2 ) θ 2 Store!!! a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 56
Backpropagation visualization at epoch (t + 1) Forward propagations Compute and store a 3 = h 3 (x 3 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) L Example a 1 = σ(θ 1 x 1 ) a 2 = σ(θ 2 x 2 ) 2 a 3 = y x 3 θ 2 a 1 = h 1 (x 1, θ 1 ) Store!!! θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 57
Backpropagation visualization at epoch (t + 1) Backpropagation a 3 = Direct computation a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) L Example a 3 = L y, x 3 = h 3 (x 3 ) = 0.5 y x 3 2 x 3 = (y x 3 ) θ 3 θ 2 a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 58
Backpropagation visualization at epoch (t + 1) Backpropagation = a 3 a 2 a 3 a 2 = a 2 θ 2 a 2 θ 2 a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) θ 2 L Example L y, x 3 = 0.5 y x 3 2 x 3 = a 2 a 2 = σ(θ 2 x 2 ) = = (y x a 2 x 3 ) 3 σ(x) = σ(x)(1 σ(x)) a 1 = h 1 (x 1, θ 1 ) θ 1 a 2 θ 2 = x 2 σ(θ 2 x 2 )(1 σ(θ 2 x 2 )) = x 2 a 2 (1 a 2 ) x 1 Stored during forward computations = (y x a 3 ) 2 = x θ 2 a 2 a 2 (1 a 2 ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 59
Backpropagation visualization at epoch (t + 1) Backpropagation = a 2 a 1 a 2 a 1 = a 1 θ 1 a 1 θ 1 a 2 = h 2 (x 2, θ 2 ) a 1 = h 1 (x 1, θ 1 ) a 3 = h 3 (x 3 ) θ 3 = θ 2 θ 1 L Example L y, a 3 = 0.5 y a 3 2 a 2 = σ(θ 2 x 2 ) x 2 = a 1 a 1 = σ(θ 1 x 1 ) a 2 = a 2 = θ a 1 x 2 a 2 (1 a 2 ) 2 a 1 = x θ 1 a 1 (1 a 1 ) 1 = θ a 1 a 2 a 2 (1 a 2 ) x 1 2 = x θ 1 a 1 a 1 (1 a 1 ) 1 Computed from the exact previous backpropagation step (Remember, recursive rule) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 60
Some practical tricks of the trade o For classification use cross-entropy loss o Use Stochastic Gradient Descent on mini-batches o Shuffle training examples at each new epoch o Normalize input variables μ, σ 2 = 0,1 μ = 0 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 61
Everything is a module UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 62
Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Loss h 5 (x i ; θ) h 4 (x i ; θ) Input h 1 (x i ; θ) h 2 (x i ; θ) h 3 (x i ; θ) h 4 (x i ; θ) h 5 (x i ; θ) Loss h 4 (x i ; θ) h 3 (x i ; θ) Functions Modules h 2 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input h 1 (x i ; θ) h 2 (x i ; θ) h 3 (x i ; θ) h 4 (x i ; θ) h 5 (x i ; θ) Loss Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 63
Linear module o Activation function a = θx o Gradient with respect to the input a x = θ o Gradient with respect to the parameters a θ = x OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 64
Sigmoid module o Activation function a = σ x = 1 1+e x o Gradient wrt the input a x o Gradient wrt the input σ θx = σ(x)(1 σ(x)) x o Gradient wrt the parameters σ θx = x σ(θx)(1 σ(θx)) θ = θ σ θx 1 σ θx OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 65
Sigmoid module Pros and Cons + Output can be interpreted as probability + Output bounded in [0, 1] network cannot overshoot - Always multiply with < 1 Gradients can be small in deep networks - The gradients at the tails flat to 0 no serious SGD updates Overconfident, but not necessarily correct Neurons get stuck OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 66
Tanh module o Activation function a = tanh x = ex e x e x +e x o Gradient with respect to the input a = 1 x tanh2 (x) o Similar to sigmoid, but with different output range [ 1, +1] instead of 0, +1 Stronger gradients, because data is centered around 0 (not 0.5) Less bias to hidden layer neurons as now outputs can be both positive and negative (more likely to have zero mean in the end) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 67
Rectified Linear Unit (ReLU) module (Alexnet) o Activation function a = h(x) = max 0, x o Gradient wrt the input a if x 0 = ቊ0, x 1, ifx > 0 o Very popular in computer vision and speech recognition o Much faster computations, gradients No vanishing or exploding problems, only comparison, addition, multiplication o People claim biological plausibility o Sparse activations o No saturation o Non-symmetric o Non-differentiable at 0 o A large gradient during training can cause a neuron to die. Higher learning rates mitigate the problem
ReLU convergence rate ReLU Tanh
Other ReLUs o Soft approximation (softplus): a = h(x) = ln 1 + e x o Noisy ReLU: a = h x = max 0, x + ε, ε~ν(0, σ(x)) x, if x > 0 o Leaky ReLU: a = h x = ቊ 0.01x otherwise x, if x > 0 o Parametric ReLu: a = h x = ቊ βx otherwise (parameter β is trainable)
Softmax module o Activation function a (k) = softmax(x (k) ) = ex(k) Outputs probability distribution, σk k=1 σ j e x(j) a (k) = 1 for K classes o Because e a+b = e a e b, we usually compute a (k) = ex(k) μ σ j e x(j) μ, μ = max k x (k) because e x(k) μ σ j e x(j) μ = eμex(k) = ex(k) e μ σ j e x(j) σ j e x(j) o Avoid exponentianting large numbers better stability OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 71
Euclidean loss module o Activation function a(x) = 0.5 y x 2 Mostly used to measure the loss in regression tasks o Gradient with respect to the input a x = x y OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 72
Cross-entropy loss (log-loss or log-likelihood) module o Activation function a x = σk k=1 y (k) log x (k), y (k) = {0, 1} a o Gradient with respect to the input = 1 x (k) x (k) o The cross-entropy loss is the most popular classification losses for classifiers that output probabilities (not SVM) o Cross-entropy loss couples well softmax/sigmoid module Often the modules are combined and joint gradients are computed o Generalization of logistic regression for more than 2 outputs OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 73
Many, many more modules out there o Regularization modules Dropout o Normalization modules l 2 -normalization, l 1 -normalization Question: When is a normalization module needed? Answer: Possibly when combining different modalities/networks (e.g. in Siamese or multiple-branch networks) o Loss modules Hinge loss o and others, which we are going to discuss later in the course LEARNING WITH NEURAL NETWORKS - PAGE 74
Composite modules or Make your own module UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 75
Backpropagation again o Step 1. Compute forward propagations for all layers recursively a l = h l x l and x l+1 = a l o Step 2. Once done with forward propagation, follow the reverse path. Start from the last layer and for each new layer compute the gradients Cache computations when possible to avoid redundant operations = a T l+1 a l x l+1 a l+1 o Step 3. Use the gradients θ l θ l = a l θ l a l with Stochastic Gradient Descend to train T
New modules o Everything can be a module, given some ground rules o How to make our own module? Write a function that follows the ground rules o Needs to be (at least) first-order differentiable (almost) everywhere o Hence, we need to be able to compute the a(x;θ) x and a(x;θ) θ LEARNING WITH NEURAL NETWORKS - PAGE 77
A module of modules o As everything can be a module, a module of modules could also be a module o We can therefore make new building blocks as we please, if we expect them to be used frequently o Of course, the same rules for the eligibility of modules still apply LEARNING WITH NEURAL NETWORKS - PAGE 78
1 sigmoid == 2 modules? o Assume the sigmoid σ( ) operating on top of θx a = σ(θx) o Directly computing it complicated backpropagation equations o Since everything is a module, we can decompose this to 2 modules a 1 = θx a 2 = σ(a 1 ) LEARNING WITH NEURAL NETWORKS - PAGE 79
1 sigmoid == 2 modules? - Two backpropagation steps instead of one + But now our gradients are simpler Algorithmic way of computing gradients We avoid taking more gradients than needed in a (complex) non-linearity a 1 = θx a 2 = σ(a 1 ) LEARNING WITH NEURAL NETWORKS - PAGE 80
Network-in-network [Lin et al., arxiv 2013] UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 81
ResNet [He et al., CVPR 2016] UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 82
Radial Basis Function (RBF) Network module o RBF module a = j u j exp( β j (x w j ) 2 ) o Decompose into cascade of modules a 1 = (x w) 2 a 2 = exp βa 1 a 3 = ua 2 a 4 = plus(, a j 3, ) LEARNING WITH NEURAL NETWORKS - PAGE 83
Radial Basis Function (RBF) Network module o An RBF module is good for regression problems, in which cases it is followed by a Euclidean loss module o The Gaussian centers w j can be initialized externally, e.g. with k-means a 1 = (x w) 2 a 2 = exp βa 1 a 3 = ua 2 a 4 = plus(, a j 3, ) LEARNING WITH NEURAL NETWORKS - PAGE 84
An RBF visually a 5 = y a 4 2 RBF module a 4 = plus(a 3 ) α 3 (1) = u1 α 2 (1) α 3 (2) = u2 α 2 (2) α 3 (3) = u3 α 2 (3) α 2 (1) = exp( β1 α 1 (1) ) α2 (2) = exp( β1 α 1 (2) ) α2 (3) = exp( β1 α 1 (3) ) α 1 (1) = (x w1 ) 2 α 1 (2) = (x w1 ) 2 α 1 (3) = (x w1 ) 2 a 1 = (x w) 2 a 2 = exp βa 1 a 3 = ua 2 a 4 = plus(, a 3 j, ) LEARNING WITH NEURAL NETWORKS - PAGE 85
Unit tests UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 86
Unit test o Always check your implementations Not only for Deep Learning o Does my implementation of the sin function return the correct values? If I execute sin(π/2) does it return 1 as it should o Even more important for gradient functions not only our implementation can be wrong, but also our math o Slightest sign of malfunction ALWAYS RECHECK Ignoring problems never solved problems UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 87
Gradient check Original gradient definition: df(x) dx = lim h 0 f(x+h) Δh o Most dangerous part for new modules get gradients wrong o Compute gradient analytically o Compute gradient computationally o Compare g θ (i) Δ θ (i) = a(x; θ(i) ) θ (i) a θ + ε a θ ε g θ (i) 2 2ε o Is difference in 10 4, 10 7 thengradients are good LEARNING WITH NEURAL NETWORKS - PAGE 88
Gradient check o Perturb one parameter θ (i) at a time with θ (i) + ε o Then check Δ θ (i) for that one parameter only o Do not perturb the whole parameter vector θ + ε This will give wrong results (simple geometry) o Sample dimensions of the gradient vector If you get a few dimensions of an gradient vector good, all is good Sample function and bias gradients equally, otherwise you might get your bias wrong LEARNING WITH NEURAL NETWORKS - PAGE 89
Numerical gradients o Can we replace analytical gradients with numerical gradients? o In theory, yes! o In practice, no! Too slow UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 90
Be creative! o What about trigonometric modules? o Or polynomial modules? o Or new loss modules? LEARNING WITH NEURAL NETWORKS - PAGE 91
Summary o Machine learning paradigm for neural networks o Backpropagation algorithm, backbone for training neural networks o Neural network == modular architecture o Visited different modules, saw how to implement and check them UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING INTRODUCTION ON NEURAL NETWORKS AND DEEP LEARNING - PAGE 92
Reading material & references o http://www.deeplearningbook.org/ Part I: Chapter 2-5 Part II: Chapter 6 UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 93
Next lecture o Optimizing deep networks o Which loss functions per machine learning task o Advanced modules o Deep Learning theory UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING INTRODUCTION ON NEURAL NETWORKS AND DEEP LEARNING - PAGE 94