Lecture 2: Learning with neural networks
|
|
- Emil Reed
- 6 years ago
- Views:
Transcription
1 Lecture 2: Learning with neural networks Deep UvA LEARNING WITH NEURAL NETWORKS - PAGE 1
2 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for learning with a neural network o Neural Networks as modular architectures o Various Neural Network modules o How to implement and check your very own module INTRODUCTION ON DEEP LEARNING AND NEURAL NETWORKS - PAGE 2
3 The Machine Learning Paradigm UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 3
4 Forward computations o Collect annotated data o Define model and initialize randomly o Predict based on current model In neural network jargon forward propagation o Evaluate predictions Data h(x i ; θ) Model Score/Prediction/Output y i h(x i ; θ) Objective/Loss/Cost/Energy L(θ; y i, h) Input: X Targets: Y θ (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 4
5 Forward computations o Collect annotated data o Define model and initialize randomly o Predict based on current model In neural network jargon forward propagation o Evaluate predictions Data h(x i ; θ) Model Score/Prediction/Output y i h(x i ; θ) Objective/Loss/Cost/Energy L(θ; y i, h) Input: X Targets: Y θ (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 5
6 Forward computations o Collect annotated data o Define model and initialize randomly o Predict based on current model In neural network jargon forward propagation o Evaluate predictions Data h(x i ; θ) Model Score/Prediction/Output y i h(x i ; θ) Objective/Loss/Cost/Energy L(θ; y i, h) Input: X Targets: Y θ (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 6
7 Forward computations o Collect annotated data o Define model and initialize randomly o Predict based on current model In neural network jargon forward propagation o Evaluate predictions Data h(x i ; θ) Model Score/Prediction/Output y i h(x i ; θ) Objective/Loss/Cost/Energy L(θ; y i, h) Input: X Targets: Y θ (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 7
8 Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model Score/Prediction/Output Objective/Loss/Cost/Energy Data Input: X Targets: Y θ = 1 L( ) (y i y i ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 8
9 Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model Data Input: X Targets: Y θ Score/Prediction/Output = 1 Objective/Loss/Cost/Energy (θ; y i ) y i OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 9
10 Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model Data Input: X Targets: Y θ Score/Prediction/Output y i h Objective/Loss/Cost/Energy (θ; y i ) y i OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 10
11 Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model h(x i ) Data Input: X Targets: Y θ θ Score/Prediction/Output y i h Objective/Loss/Cost/Energy (θ; y i ) y i OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 11
12 Backward computations o Collect gradient data o Define model and initialize randomly o Predict based on current model In neural network jargon backpropagation o Evaluate predictions Model h(x i ) Data Input: X Targets: Y θ θ Score/Prediction/Output y i h Objective/Loss/Cost/Energy (θ; y i ) y i OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 12
13 Optimization through Gradient Descent o As with many model, we optimize our neural network with Gradient Descent θ (t+1) = θ (t) η t θ L o The most important component in this formulation is the gradient o Backpropagation to the rescue The backward computations of network return the gradients How to make the backward computations LEARNING WITH NEURAL NETWORKS - PAGE 13
14 Backpropagation UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 14
15 What is a neural network again? o A family of parametric, non-linear and hierarchical representation learning functions, which are massively optimized with stochastic gradient descent to encode domain knowledge, i.e. domain invariances, stationarity. o a L x; θ 1,,L = h L (h L 1 h 1 x, θ 1, θ L 1, θ L ) x:input, θ l : parameters for layer l, a l = h l (x, θ l ): (non-)linear function o Given training corpus {X, Y} find optimal parameters θ arg min θ (x,y) (X,Y) l(y, a L x; θ 1,,L ) INTRODUCTION ON NEURAL NETWORKS AND DEEP LEARNING - PAGE 15
16 Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Forward connections (Feedforward architecture) Loss h 5 (x i ; θ) h 4 (x i ; θ) h 3 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 16
17 Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Loss h 5 (x i ; θ) h 4 (x i ; θ) Interweaved connections (Directed Acyclic Graph architecture- DAGNN) h 3 (x i ; θ) h 4 (x i ; θ) h 2 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 17
18 Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Loss h 5 (x i ; θ) h 4 (x i ; θ) h 3 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input Loopy connections (Recurrent architecture, special care needed) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 18
19 Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Loss h 5 (x i ; θ) h 4 (x i ; θ) Input h 1 (x i ; θ) h 2 (x i ; θ) h 3 (x i ; θ) h 4 (x i ; θ) h 5 (x i ; θ) Loss h 4 (x i ; θ) h 3 (x i ; θ) Functions Modules h 2 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input h 1 (x i ; θ) h 2 (x i ; θ) h 3 (x i ; θ) h 4 (x i ; θ) h 5 (x i ; θ) Loss Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 19
20 What is a module? o A module is a building block for our network o Each module is an object/function a = h(x; θ) that Contains trainable parameters (θ) Receives as an argument an input x And returns an output a based on the activation function h o The activation function should be (at least) first order differentiable (almost) everywhere o For easier/more efficient backpropagation store module input easy to get module output fast easy to compute derivatives Loss h 5 (x 5 ; θ 5 ) h 5 (x 5 ; θ 5 ) h 4 (x 4 ; θ 4 ) h 3 (x 3 ; θ 3 ) h 2 (x 2 ; θ 2 ) h 2 (x 2 ; θ 2 ) h 1 (x 1 ; θ 1 ) Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 20
21 Anything goes or do special constraints exist? o A neural network is a composition of modules (building blocks) o Any architecture works o If the architecture is a feedforward cascade, no special care o If acyclic, there is right order of computing the forward computations o If there are loops, these form recurrent connections (revisited later) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 21
22 Forward computations for neural networks o Simply compute the activation of each module in the network a l = h l x l ; θ, where a l = x l+1 (or x l = a l 1 ) o We need to know the precise function behind each module h l ( ) o Recursive operations One module s output is another s input o Steps Visit modules one by one starting from the data input Some modules might have several inputs from multiple modules o Compute modules activations with the right order Make sure all the inputs computed at the right time Loss h 5 (x 5 ; θ 5 ) h 5 (x 5 ; θ 5 ) h 4 (x 4 ; θ 4 ) h 3 (x 3 ; θ 3 ) h 2 (x 2 ; θ 2 ) h 2 (x 2 ; θ 2 ) h 1 (x 1 ; θ 1 ) Data Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 22
23 Backward computations for neural networks o Simply compute the gradients of each module for our data We need to know the gradient formulation of each module h l (x l ; θ l ) w.r.t. their inputs x l and parameters θ l o We need the forward computations first Their result is the sum of losses for our input data o Then take the reverse network (reverse connections) and traverse it backwards o Instead of using the activation functions, we use their gradients Data: o The whole process can be described very neatly and concisely with the backpropagation algorithm dloss(input) dh 5 (x 5 ; θ 5 ) dh 3 (x 3 ; θ 3 ) dh 2 (x 2 ; θ 2 ) dh 4 (x 4 ; θ 4 ) dh 1 (x 1 ; θ 1 ) dh 5 (x 5 ; θ 5 ) dh 2 (x 2 ; θ 2 ) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 23
24 Again, what is a neural network again? o a L x; θ 1,,L = h L (h L 1 h 1 x, θ 1, θ L 1, θ L ) x:input, θ l : parameters for layer l, a l = h l (x, θ l ): (non-)linear function o Given training corpus {X, Y} find optimal parameters θ arg min θ (x,y) (X,Y) l(y, a L x; θ 1,,L ) o To use any gradient descent based optimization (θ (t+1) = θ (t+1) η t θ need the gradients, l = 1,, L θ l o How to compute the gradients for such a complicated function enclosing other functions, like a L ( )? (t)) we INTRODUCTION ON NEURAL NETWORKS AND DEEP LEARNING - PAGE 24
25 Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dx i = σ j dz dy j dy j dx i gradients from all possible paths z y (1) y (2) x (1) x (2) x (3) UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 25
26 Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dx i = σ j dz dy j dy j dx i gradients from all possible paths z y 1 y 2 x 1 x 2 x 3 dz = dz dy 1 dz dy 2 dx 1 dy 1 dx1+ dy 2 dx 1 UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 26
27 Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dx i = σ j dz dy j dy j dx i gradients from all possible paths z y 1 y 2 x 1 x 2 x 3 dz = dz dy 1 dz dy 2 dx 2 dy 1 dx2+ dy 2 dx 2 UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 27
28 Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dx i = σ j dz dy j dy j dx i gradients from all possible paths z y (1) y (2) x (1) x (2) x (3) dz = dz dy 1 dz dy 2 dx 3 dy 1 dx3+ dy 2 dx 3 UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 28
29 Chain rule o Assume a nested function, z = f y and y = g x o Chain Rule for scalars x, y, z dz dx = dz dy dy dx o When x R m, y R n, z R dz dz dy = σ j dx j gradients from all possible paths i dy j dx i or in vector notation dy dx is the Jacobian dz dx = dy dx T dz dy z y (1) y (2) x (1) x (2) x (3) UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 29
30 The Jacobian o When x R 3, y R 2 J y x = dy dx = y 1 x 1 y 1 x 2 y 1 x 3 y 2 x 1 y 2 x 2 y 2 x 3 LEARNING WITH NEURAL NETWORKS - PAGE 30
31 Chain rule in practice o f y = sin y, y = g x = 0.5 x 2 df dx = d [sin(y)] dg d 0.5x 2 dx = cos 0.5x 2 x UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 31
32 Backpropagation Chain rule!!! o The loss function L(y, a L ) depends on a L, which depends on a L 1,, which depends on a 2 : o Gradients of parameters of layer l Chain rule = θ l a L a L a L 1 a L 1 a L 2 a l θ l o When shortened, we need to two quantities a L x; θ 1,,L = h L (h L 1 h 1 x, θ 1,, θ L 1, θ L ) θ l = ( a l θ l ) T a l Gradient of a module w.r.t. its parameters Gradient of loss w.r.t. the module output OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 32
33 Backpropagation Chain rule!!! o For a l in = ( a l ) T we only need the Jacobian of the l-th θ l θ l θ l a l module output a l w.r.t. to the module s parameters θ l o Very local rule, every module looks for its own o Since computations can be very local graphs can be very complicated modules can be complicated (as long as they are differentiable) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 33
34 Backpropagation Chain rule!!! o For a l in θ l = ( a l θ l ) T a l we apply chain rule again a l+1 = h l+1 (x l+1 ; θ l+1 ) a l = a l+1 a l T a l+1 x l+1 = a l o We can rewrite a l+1 a l as gradient of module w.r.t. to input Remember, the output of a module is the input for the next one: a l =x l+1 Gradient w.r.t. the module input a l = a l+1 x l+1 T a l+1 a l = h l (x l ; θ l ) Recursive rule (good for us)!!! OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 34
35 Multivariate functions f(x) o Often module functions depend on multiple input variables Softmax! Each output dimension depends on multiple input dimensions a j = e xj e x1 + e x2 + e x3, j = 1,2,3 o For these cases for the a l (or a l ) we must compute Jacobian matrix as a x l θ l l depends on multiple input x l (or θ l ) e.g. in softmax a 2 depends on all e x1, e x2 and e x3, not just on e x2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 35
36 Diagonal Jacobians o Often in modules the output depends only in a single input e.g. a sigmoid a = σ(x), or a = tanh(x), or a = exp(x) a x = σ x = σ x 1 x 2 x 3 = σ(x 1 ) σ(x 2 ) σ(x 3 ) o Not need for full Jacobian, only the diagonal: anyways dai j = 0, i j dx da dx = dσ dx = σ(x 1 )(1 σ(x 1 )) σ(x 2 )(1 σ(x 2 )) σ(x 3 )(1 σ(x 3 )) ~ σ(x 1 )(1 σ(x 1 )) σ(x 2 )(1 σ(x 2 )) σ(x 3 )(1 σ(x 3 )) Can rewrite equations as inner products to save computations UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 36
37 Dimension analysis o To make sure everything is done correctly Dimension analysis o The dimensions of the gradient w.r.t. θ l must be equal to the dimensions of the respective weight θ l dim a l = dim a l dim θ l = dim θ l LEARNING WITH NEURAL NETWORKS - PAGE 37
38 Dimension analysis o For a l = a l+1 x l+1 T a l+1 dim a l = d l dim θ l = d l 1 d l [d l 1] = d l+1 d l T [d l+1 1] o For θ l = α l θ l α l T [d l 1 d l ] = [d l 1 1] [1 d l ] LEARNING WITH NEURAL NETWORKS - PAGE 38
39 Backpropagation: Recursive chain rule o Step 1. Compute forward propagations for all layers recursively a l = h l x l and x l+1 = a l o Step 2. Once done with forward propagation, follow the reverse path. Start from the last layer and for each new layer compute the gradients Cache computations when possible to avoid redundant operations = a T l+1 a l x l+1 a l+1 o Step 3. Use the gradients θ l θ l = a l θ l a l with Stochastic Gradient Descend to train T
40 Backpropagation: Recursive chain rule o Step 1. Compute forward propagations for all layers recursively a l = h l x l and x l+1 = a l o Step 2. Once done with forward propagation, follow the reverse path. Start from the last layer and for each new layer compute the gradients Cache computations when possible to avoid redundant operations = a T l+1 a l x l+1 a l+1 ector with dimensions [d l 1] o Step 3. Use the gradients Jacobian matrix with dimensions d l+1 d l T θ l Vector with dimensions [d l+1 1] θ l = a l θ l a l with Stochastic Gradient Descend to train Matrix with dimensions [d l 1 d l ] Vector with dimensions [d l 1 1] T Vector with dimensions [1 d l ]
41 Dimensionality analysis: An Example o d l 1 = 15 (15 neurons), d l = 10 (10 neurons), d l+1 = 5 (5 neurons) o Let s say a l = θ Τ l x l and a l+1 = θ Τ l+1 x l+1 o Forward computations a l , a l : 10 1, a l+1 : [5 1] x l : 15 1, x l+1 : 10 1 θ l : o Gradients a l : 5 10 T 5 1 = 10 1 θ l T = x l = a l 1
42 Intuitive Backpropagation UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 42
43 Backpropagation in practice o Things are dead simple, just compute per module a(x; θ) x a(x; θ) θ o Then follow iterative procedure a l = a l+1 x l+1 T a l+1 θ l = a l θ l a l T UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 43
44 Backpropagation in practice o Things are dead simple, just compute per module a(x; θ) x a(x; θ) θ o Then follow iterative procedure [remember: a l = x l+1 ] Derivatives from layer above a l = a l+1 x l+1 T a l+1 θ l = a l θ l a l T Module derivatives UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 44
45 Forward propagation o For instance, let s consider our module is the function cos θx + π o The forward computation is simply import numpy as np def forward(x): return np.cos(self.theta*x)+np.pi UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 45
46 Backward propagation o The backpropagation for the function cos x + π import numpy as np def backward_dx(x): return -self.theta*np.sin(self.theta*x) import numpy as np def backward_dtheta(x): return -x*np.sin(self.theta*x) UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 46
47 Backpropagation: An example UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 47
48 Backpropagation visualization L = 3, a 3 = h 3 (x 3 ) θ 3 = L a 2 = h 2 (x 2, θ 2 ) a 1 = h 1 (x 1, θ 1 ) θ 2 θ 1 x 1 x 1 1 x 1 2 x 1 3 x 1 4 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 48
49 Backpropagation visualization at epoch (t) Forward propagations Compute and store a 1 = h 1 (x 1 ) a 3 = h 3 (x 3 ) θ 3 = L Example a 1 = σ(θ 1 x 1 ) a 2 = h 2 (x 2, θ 2 ) θ 2 Store!!! a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 49
50 Backpropagation visualization at epoch (t) Forward propagations Compute and store a 2 = h 2 (x 2 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = L Example a 1 = σ(θ 1 x 1 ) a 2 = σ(θ 2 x 2 ) a 2 = h 2 (x 2, θ 2 ) θ 2 Store!!! a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 50
51 Backpropagation visualization at epoch (t) Forward propagations Compute and store a 3 = h 3 (x 3 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) L Example a 1 = σ(θ 1 x 1 ) a 2 = σ(θ 2 x 2 ) 2 a 3 = y x 3 θ 2 a 1 = h 1 (x 1, θ 1 ) Store!!! θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 51
52 Backpropagation visualization at epoch (t) Backpropagation a 3 = Direct computation a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) L Example a 3 = L y, x 3 = h 3 (x 3 ) = 0.5 y x 3 2 x 3 = (y x 3 ) θ 3 θ 2 a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 52
53 Backpropagation visualization at epoch (t) Backpropagation = a 3 a 2 a 3 a 2 = a 2 θ 2 a 2 θ 2 a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) θ 2 L Example L y, x 3 = 0.5 y x 3 2 x 3 = a 2 a 2 = σ(θ 2 x 2 ) = = (y x a 2 x 3 ) 3 σ(x) = σ(x)(1 σ(x)) a 1 = h 1 (x 1, θ 1 ) θ 1 a 2 θ 2 = x 2 σ(θ 2 x 2 )(1 σ(θ 2 x 2 )) = x 2 a 2 (1 a 2 ) x 1 Stored during forward computations = (y x a 3 ) 2 = x θ 2 a 2 a 2 (1 a 2 ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 53
54 Backpropagation visualization at epoch (t) Backpropagation = a 2 a 1 a 2 a 1 = a 1 θ 1 a 1 θ 1 a 2 = h 2 (x 2, θ 2 ) a 1 = h 1 (x 1, θ 1 ) a 3 = h 3 (x 3 ) θ 3 = θ 2 θ 1 L Example L y, a 3 = 0.5 y a 3 2 a 2 = σ(θ 2 x 2 ) x 2 = a 1 a 1 = σ(θ 1 x 1 ) a 2 = a 2 = θ a 1 x 2 a 2 (1 a 2 ) 2 a 1 = x θ 1 a 1 (1 a 1 ) 1 = θ a 1 a 2 a 2 (1 a 2 ) x 1 2 = x θ 1 a 1 a 1 (1 a 1 ) 1 Computed from the exact previous backpropagation step (Remember, recursive rule) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 54
55 Backpropagation visualization at epoch (t + 1) Forward propagations Compute and store a 1 = h 1 (x 1 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = L Example a 1 = σ(θ 1 x 1 ) a 2 = h 2 (x 2, θ 2 ) θ 2 Store!!! a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 55
56 Backpropagation visualization at epoch (t + 1) Forward propagations Compute and store a 2 = h 2 (x 2 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = L Example a 1 = σ(θ 1 x 1 ) a 2 = σ(θ 2 x 2 ) a 2 = h 2 (x 2, θ 2 ) θ 2 Store!!! a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 56
57 Backpropagation visualization at epoch (t + 1) Forward propagations Compute and store a 3 = h 3 (x 3 ) L = 3, a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) L Example a 1 = σ(θ 1 x 1 ) a 2 = σ(θ 2 x 2 ) 2 a 3 = y x 3 θ 2 a 1 = h 1 (x 1, θ 1 ) Store!!! θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 57
58 Backpropagation visualization at epoch (t + 1) Backpropagation a 3 = Direct computation a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) L Example a 3 = L y, x 3 = h 3 (x 3 ) = 0.5 y x 3 2 x 3 = (y x 3 ) θ 3 θ 2 a 1 = h 1 (x 1, θ 1 ) θ 1 x 1 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 58
59 Backpropagation visualization at epoch (t + 1) Backpropagation = a 3 a 2 a 3 a 2 = a 2 θ 2 a 2 θ 2 a 3 = h 3 (x 3 ) θ 3 = a 2 = h 2 (x 2, θ 2 ) θ 2 L Example L y, x 3 = 0.5 y x 3 2 x 3 = a 2 a 2 = σ(θ 2 x 2 ) = = (y x a 2 x 3 ) 3 σ(x) = σ(x)(1 σ(x)) a 1 = h 1 (x 1, θ 1 ) θ 1 a 2 θ 2 = x 2 σ(θ 2 x 2 )(1 σ(θ 2 x 2 )) = x 2 a 2 (1 a 2 ) x 1 Stored during forward computations = (y x a 3 ) 2 = x θ 2 a 2 a 2 (1 a 2 ) 2 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 59
60 Backpropagation visualization at epoch (t + 1) Backpropagation = a 2 a 1 a 2 a 1 = a 1 θ 1 a 1 θ 1 a 2 = h 2 (x 2, θ 2 ) a 1 = h 1 (x 1, θ 1 ) a 3 = h 3 (x 3 ) θ 3 = θ 2 θ 1 L Example L y, a 3 = 0.5 y a 3 2 a 2 = σ(θ 2 x 2 ) x 2 = a 1 a 1 = σ(θ 1 x 1 ) a 2 = a 2 = θ a 1 x 2 a 2 (1 a 2 ) 2 a 1 = x θ 1 a 1 (1 a 1 ) 1 = θ a 1 a 2 a 2 (1 a 2 ) x 1 2 = x θ 1 a 1 a 1 (1 a 1 ) 1 Computed from the exact previous backpropagation step (Remember, recursive rule) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 60
61 Some practical tricks of the trade o For classification use cross-entropy loss o Use Stochastic Gradient Descent on mini-batches o Shuffle training examples at each new epoch o Normalize input variables μ, σ 2 = 0,1 μ = 0 OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 61
62 Everything is a module UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 62
63 Neural network models o A neural network model is a series of hierarchically connected functions o This hierarchies can be very, very complex Loss h 5 (x i ; θ) h 4 (x i ; θ) Input h 1 (x i ; θ) h 2 (x i ; θ) h 3 (x i ; θ) h 4 (x i ; θ) h 5 (x i ; θ) Loss h 4 (x i ; θ) h 3 (x i ; θ) Functions Modules h 2 (x i ; θ) h 2 (x i ; θ) h 1 (x i ; θ) Input h 1 (x i ; θ) h 2 (x i ; θ) h 3 (x i ; θ) h 4 (x i ; θ) h 5 (x i ; θ) Loss Input OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 63
64 Linear module o Activation function a = θx o Gradient with respect to the input a x = θ o Gradient with respect to the parameters a θ = x OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 64
65 Sigmoid module o Activation function a = σ x = 1 1+e x o Gradient wrt the input a x o Gradient wrt the input σ θx = σ(x)(1 σ(x)) x o Gradient wrt the parameters σ θx = x σ(θx)(1 σ(θx)) θ = θ σ θx 1 σ θx OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 65
66 Sigmoid module Pros and Cons + Output can be interpreted as probability + Output bounded in [0, 1] network cannot overshoot - Always multiply with < 1 Gradients can be small in deep networks - The gradients at the tails flat to 0 no serious SGD updates Overconfident, but not necessarily correct Neurons get stuck OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 66
67 Tanh module o Activation function a = tanh x = ex e x e x +e x o Gradient with respect to the input a = 1 x tanh2 (x) o Similar to sigmoid, but with different output range [ 1, +1] instead of 0, +1 Stronger gradients, because data is centered around 0 (not 0.5) Less bias to hidden layer neurons as now outputs can be both positive and negative (more likely to have zero mean in the end) OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 67
68 Rectified Linear Unit (ReLU) module (Alexnet) o Activation function a = h(x) = max 0, x o Gradient wrt the input a if x 0 = ቊ0, x 1, ifx > 0 o Very popular in computer vision and speech recognition o Much faster computations, gradients No vanishing or exploding problems, only comparison, addition, multiplication o People claim biological plausibility o Sparse activations o No saturation o Non-symmetric o Non-differentiable at 0 o A large gradient during training can cause a neuron to die. Higher learning rates mitigate the problem
69 ReLU convergence rate ReLU Tanh
70 Other ReLUs o Soft approximation (softplus): a = h(x) = ln 1 + e x o Noisy ReLU: a = h x = max 0, x + ε, ε~ν(0, σ(x)) x, if x > 0 o Leaky ReLU: a = h x = ቊ 0.01x otherwise x, if x > 0 o Parametric ReLu: a = h x = ቊ βx otherwise (parameter β is trainable)
71 Softmax module o Activation function a (k) = softmax(x (k) ) = ex(k) Outputs probability distribution, σk k=1 σ j e x(j) a (k) = 1 for K classes o Because e a+b = e a e b, we usually compute a (k) = ex(k) μ σ j e x(j) μ, μ = max k x (k) because e x(k) μ σ j e x(j) μ = eμex(k) = ex(k) e μ σ j e x(j) σ j e x(j) o Avoid exponentianting large numbers better stability OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 71
72 Euclidean loss module o Activation function a(x) = 0.5 y x 2 Mostly used to measure the loss in regression tasks o Gradient with respect to the input a x = x y OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 72
73 Cross-entropy loss (log-loss or log-likelihood) module o Activation function a x = σk k=1 y (k) log x (k), y (k) = {0, 1} a o Gradient with respect to the input = 1 x (k) x (k) o The cross-entropy loss is the most popular classification losses for classifiers that output probabilities (not SVM) o Cross-entropy loss couples well softmax/sigmoid module Often the modules are combined and joint gradients are computed o Generalization of logistic regression for more than 2 outputs OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 73
74 Many, many more modules out there o Regularization modules Dropout o Normalization modules l 2 -normalization, l 1 -normalization Question: When is a normalization module needed? Answer: Possibly when combining different modalities/networks (e.g. in Siamese or multiple-branch networks) o Loss modules Hinge loss o and others, which we are going to discuss later in the course LEARNING WITH NEURAL NETWORKS - PAGE 74
75 Composite modules or Make your own module UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 75
76 Backpropagation again o Step 1. Compute forward propagations for all layers recursively a l = h l x l and x l+1 = a l o Step 2. Once done with forward propagation, follow the reverse path. Start from the last layer and for each new layer compute the gradients Cache computations when possible to avoid redundant operations = a T l+1 a l x l+1 a l+1 o Step 3. Use the gradients θ l θ l = a l θ l a l with Stochastic Gradient Descend to train T
77 New modules o Everything can be a module, given some ground rules o How to make our own module? Write a function that follows the ground rules o Needs to be (at least) first-order differentiable (almost) everywhere o Hence, we need to be able to compute the a(x;θ) x and a(x;θ) θ LEARNING WITH NEURAL NETWORKS - PAGE 77
78 A module of modules o As everything can be a module, a module of modules could also be a module o We can therefore make new building blocks as we please, if we expect them to be used frequently o Of course, the same rules for the eligibility of modules still apply LEARNING WITH NEURAL NETWORKS - PAGE 78
79 1 sigmoid == 2 modules? o Assume the sigmoid σ( ) operating on top of θx a = σ(θx) o Directly computing it complicated backpropagation equations o Since everything is a module, we can decompose this to 2 modules a 1 = θx a 2 = σ(a 1 ) LEARNING WITH NEURAL NETWORKS - PAGE 79
80 1 sigmoid == 2 modules? - Two backpropagation steps instead of one + But now our gradients are simpler Algorithmic way of computing gradients We avoid taking more gradients than needed in a (complex) non-linearity a 1 = θx a 2 = σ(a 1 ) LEARNING WITH NEURAL NETWORKS - PAGE 80
81 Network-in-network [Lin et al., arxiv 2013] UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 81
82 ResNet [He et al., CVPR 2016] UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 82
83 Radial Basis Function (RBF) Network module o RBF module a = j u j exp( β j (x w j ) 2 ) o Decompose into cascade of modules a 1 = (x w) 2 a 2 = exp βa 1 a 3 = ua 2 a 4 = plus(, a j 3, ) LEARNING WITH NEURAL NETWORKS - PAGE 83
84 Radial Basis Function (RBF) Network module o An RBF module is good for regression problems, in which cases it is followed by a Euclidean loss module o The Gaussian centers w j can be initialized externally, e.g. with k-means a 1 = (x w) 2 a 2 = exp βa 1 a 3 = ua 2 a 4 = plus(, a j 3, ) LEARNING WITH NEURAL NETWORKS - PAGE 84
85 An RBF visually a 5 = y a 4 2 RBF module a 4 = plus(a 3 ) α 3 (1) = u1 α 2 (1) α 3 (2) = u2 α 2 (2) α 3 (3) = u3 α 2 (3) α 2 (1) = exp( β1 α 1 (1) ) α2 (2) = exp( β1 α 1 (2) ) α2 (3) = exp( β1 α 1 (3) ) α 1 (1) = (x w1 ) 2 α 1 (2) = (x w1 ) 2 α 1 (3) = (x w1 ) 2 a 1 = (x w) 2 a 2 = exp βa 1 a 3 = ua 2 a 4 = plus(, a 3 j, ) LEARNING WITH NEURAL NETWORKS - PAGE 85
86 Unit tests UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING OPTIMIZING NEURAL NETWORKS IN THEORY AND IN PRACTICE - PAGE 86
87 Unit test o Always check your implementations Not only for Deep Learning o Does my implementation of the sin function return the correct values? If I execute sin(π/2) does it return 1 as it should o Even more important for gradient functions not only our implementation can be wrong, but also our math o Slightest sign of malfunction ALWAYS RECHECK Ignoring problems never solved problems UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 87
88 Gradient check Original gradient definition: df(x) dx = lim h 0 f(x+h) Δh o Most dangerous part for new modules get gradients wrong o Compute gradient analytically o Compute gradient computationally o Compare g θ (i) Δ θ (i) = a(x; θ(i) ) θ (i) a θ + ε a θ ε g θ (i) 2 2ε o Is difference in 10 4, 10 7 thengradients are good LEARNING WITH NEURAL NETWORKS - PAGE 88
89 Gradient check o Perturb one parameter θ (i) at a time with θ (i) + ε o Then check Δ θ (i) for that one parameter only o Do not perturb the whole parameter vector θ + ε This will give wrong results (simple geometry) o Sample dimensions of the gradient vector If you get a few dimensions of an gradient vector good, all is good Sample function and bias gradients equally, otherwise you might get your bias wrong LEARNING WITH NEURAL NETWORKS - PAGE 89
90 Numerical gradients o Can we replace analytical gradients with numerical gradients? o In theory, yes! o In practice, no! Too slow UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 90
91 Be creative! o What about trigonometric modules? o Or polynomial modules? o Or new loss modules? LEARNING WITH NEURAL NETWORKS - PAGE 91
92 Summary o Machine learning paradigm for neural networks o Backpropagation algorithm, backbone for training neural networks o Neural network == modular architecture o Visited different modules, saw how to implement and check them UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING INTRODUCTION ON NEURAL NETWORKS AND DEEP LEARNING - PAGE 92
93 Reading material & references o Part I: Chapter 2-5 Part II: Chapter 6 UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING LEARNING WITH NEURAL NETWORKS - PAGE 93
94 Next lecture o Optimizing deep networks o Which loss functions per machine learning task o Advanced modules o Deep Learning theory UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING INTRODUCTION ON NEURAL NETWORKS AND DEEP LEARNING - PAGE 94
Lecture 2: Modular Learning
Lecture 2: Modular Learning Deep Learning @ UvA MODULAR LEARNING - PAGE 1 Announcement o C. Maddison is coming to the Deep Vision Seminars on the 21 st of Sep Author of concrete distribution, co-author
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationBackpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationCh.6 Deep Feedforward Networks (2/3)
Ch.6 Deep Feedforward Networks (2/3) 16. 10. 17. (Mon.) System Software Lab., Dept. of Mechanical & Information Eng. Woonggy Kim 1 Contents 6.3. Hidden Units 6.3.1. Rectified Linear Units and Their Generalizations
More informationGradient-Based Learning. Sargur N. Srihari
Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation
More informationAdvanced Machine Learning
Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson
More informationCSC 411 Lecture 10: Neural Networks
CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationTopics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationarxiv: v3 [cs.lg] 14 Jan 2018
A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationCSC321 Lecture 5: Multilayer Perceptrons
CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationBackpropagation and Neural Networks part 1. Lecture 4-1
Lecture 4: Backpropagation and Neural Networks part 1 Lecture 4-1 Administrative A1 is due Jan 20 (Wednesday). ~150 hours left Warning: Jan 18 (Monday) is Holiday (no class/office hours) Also note: Lectures
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More information8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples
8-1: Backpropagation Prof. J.C. Kao, UCLA Backpropagation Chain rule for the derivatives Backpropagation graphs Examples 8-2: Backpropagation Prof. J.C. Kao, UCLA Motivation for backpropagation To do gradient
More informationCSC321 Lecture 15: Exploding and Vanishing Gradients
CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23 Overview Yesterday, we saw how to compute the gradient descent
More informationNeural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron
More informationDeep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04
Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures Lecture 04 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Self-Taught Learning 1. Learn
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationFeed-forward Network Functions
Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationIntroduction to Machine Learning Spring 2018 Note Neural Networks
CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,
More informationNeural Networks, Computation Graphs. CMSC 470 Marine Carpuat
Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationDeep Feedforward Networks
Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function
More informationArtificial Neural Networks. MGS Lecture 2
Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation
More informationMachine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationSPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks
Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension
More informationLecture 4 Backpropagation
Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationFeedforward Neural Networks
Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which
More informationTraining Neural Networks Practical Issues
Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from
More informationMACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 4.1 Gradient-Based Learning: Back-Propagation and Multi-Module Systems Yann LeCun
Y. LeCun: Machine Learning and Pattern Recognition p. 1/2 MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 4.1 Gradient-Based Learning: Back-Propagation and Multi-Module Systems Yann LeCun The
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationNonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.
Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationCSC321 Lecture 6: Backpropagation
CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1 / 21 Overview We ve seen that multilayer neural networks are powerful. But how can we actually learn them?
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 31 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationArtificial Neural Networks
Artificial Neural Networks Oliver Schulte - CMPT 310 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationLecture 35: Optimization and Neural Nets
Lecture 35: Optimization and Neural Nets CS 4670/5670 Sean Bell DeepDream [Google, Inceptionism: Going Deeper into Neural Networks, blog 2015] Aside: CNN vs ConvNet Note: There are many papers that use
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationThe XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic
The XOR problem Fall 2013 x 2 Lecture 9, February 25, 2015 x 1 The XOR problem The XOR problem x 1 x 2 x 2 x 1 (picture adapted from Bishop 2006) It s the features, stupid It s the features, stupid The
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationOther Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3
Y. LeCun: Machine Learning and Pattern Recognition p. 5/3 Other Topologies The back-propagation procedure is not limited to feed-forward cascades. It can be applied to networks of module with any topology,
More informationMACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun
Y. LeCun: Machine Learning and Pattern Recognition p. 1/3 MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun The Courant Institute, New
More informationArtificial Neural Networks 2
CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b
More informationArtificial Neuron (Perceptron)
9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationCOMP 551 Applied Machine Learning Lecture 14: Neural Networks
COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationDeep Feedforward Networks. Sargur N. Srihari
Deep Feedforward Networks Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More informationMLPR: Logistic Regression and Neural Networks
MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationOutline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.
Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct
More informationCOGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.
COGS Q250 Fall 2012 Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November. For the first two questions of the homework you will need to understand the learning algorithm using the delta
More informationBackpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 13 Mar 1, 2018 1 Reminders Homework 5: Neural
More informationNeural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of
More informationFeed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.
Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationStochastic gradient descent; Classification
Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP
More information