Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23

Supervised Learning Roadmap Supervised Learning: Assume that we are given the features x i. Could also use basis functions or kernels. 2 / 23

Supervised Learning Roadmap Supervised Learning: Assume that we are given the features x i. Could also use basis functions or kernels. Unsupervised Learning: Learn a representation z i based on features x i. Also used for supervised learning: use z i as features. 2 / 23

Supervised Learning Roadmap Linear Model yi w1 w2 w3 wd xi1 xi2 xi3 xid 3 / 23

Supervised Learning Roadmap Change of Basis Linear Model yi w1 w2 w3 wk yi zi1 zi2 zi3 zik w1 w2 w3 wd xi1 xi2 xi3 xid Basis xi1 xi2 xi3 xid 3 / 23

Supervised Learning Roadmap Change of Basis Basis from Latent-Factor Model yi Linear Model yi w1 w2 w3 wk yi zi1 zi2 zi3 zik w1 w2 w3 wk zi1 zi2 zi3 zik w1 w2 w3 wd xi1 xi2 xi3 xid Basis w and W are trained separately zi1 zi2 zi3 zik W11 Wkd xi1 xi2 xi3 xid xi1 xi2 xi3 xid 3 / 23

Supervised Learning Roadmap Linear Model Change of Basis yi Basis from Latent-Factor Model yi Simultaneously Learn Features for Task and Regression Model w1 w2 w3 wk yi zi1 zi2 zi3 zik w1 w2 w3 wk zi1 zi2 zi3 zik yi w1 w2 w3 wk w1 w2 w3 wd xi1 xi2 xi3 xid Basis w and W are trained separately zi1 zi2 zi3 zik W11 Wkd zi1 zi2 zi3 zik xi1 xi2 xi3 xid W11 Wkd xi1 xi2 xi3 xid xi1 xi2 xi3 xid 3 / 23

Feed-forward Neural Network Information always moves one direction. No loops. Never goes backwards. Forms a directed acyclic graph. 4 / 23

Feed-forward Neural Network Information always moves one direction. No loops. Never goes backwards. Forms a directed acyclic graph. Each node receives input only from immediately preceding layer. Simplest type of artificial neural network. 4 / 23

Neural Networks - Historical Background 1943: McCulloch and Pitts proposed first computational model of neuron 1949: Hebb proposed the first learning rule 1958: Rosenblatt s work on perceptrons 1969: Minsky and Papert s paper exposed limits of theory 1970s: Decade of dormancy for neural networks 1980-90s: Neural network return (self-organization, back-propagation algorithms, etc.) 5 / 23

Model of Single Neuron McCulloch and Pitts (1943): integrate and fire model (no hidden layers) yi w1 w2 w3 wd xi1 xi2 xi3 xid Denote the d input values for sample i by x i1, x i2,..., x id. Each of the d inputs has a weight w 1, w 2,..., w d. Compute prediction as weighted sum, d ŷ i = w 1x i1 + w 2x i2 + + w d x id = w jx ij j=1 6 / 23

Incorporating Hidden Layers Consider more than one neuron: yi w1 w2 w3 wk zi1 zi2 zi3 zik W11 Wkd xi1 xi2 xi3 xid 7 / 23

Incorporating Hidden Layers Consider more than one neuron: yi w1 w2 w3 wk zi1 zi2 zi3 zik W11 Wkd xi1 xi2 xi3 xid Input to hidden layer: function h of features from latent-factor model: z i = h(w x i). Each neuron has directed connection to ALL neurons of a subsequent layer. 7 / 23

Linear Activation Function A linear activation function has the form where a is called bias (intercept). h(w x i) = a + W x i, 8 / 23

Linear Activation Function A linear activation function has the form where a is called bias (intercept). h(w x i) = a + W x i, Example: linear regression with linear bias (linear-linear model) Representation: z i = h(w x i ) (from latent-factor model) Prediction: ŷ i = w T z i Loss: 1 2 (y i ŷ i ) 2 To train this model, we solve: 1 n argmin (y i w T z i) 2 = 1 w IR k,w IR k d 2 2 i=1 }{{} linear regression with z i as features n (y i w T h(w x i)) 2 i=1 8 / 23

Binary Activation Function To increase flexibility, something needs to be non-linear. 9 / 23

Binary Activation Function To increase flexibility, something needs to be non-linear. A Heaviside step function has the form 1 if v a h(v) = 0 otherwise where a is the threshold. 9 / 23

Binary Activation Function To increase flexibility, something needs to be non-linear. A Heaviside step function has the form 1 if v a h(v) = 0 otherwise where a is the threshold. Example: Let a = 0, This yields a binary z i = h(w x i). 9 / 23

Perceptrons Minsky and Papert (late 50s). Algorithm for supervised learning of binary classifiers. Decides whether input belongs to a specific class or not. 10 / 23

Perceptrons Minsky and Papert (late 50s). Algorithm for supervised learning of binary classifiers. Decides whether input belongs to a specific class or not. Uses binary activation function. Can learn AND, OR, NOT functions. 10 / 23

Learning Algorithm for Perceptrons The perceptron learning rule is given as follows: 1 For each x i and desired output y i in training set, ( ( 1 Calculate output: ŷ (t) i = h w (t)) ) T xi. 11 / 23

Learning Algorithm for Perceptrons The perceptron learning rule is given as follows: 1 For each x i and desired output y i in training set, ( ( 1 Calculate output: ŷ (t) i = h w (t)) ) T xi. 2 Update the weights: w (t+1) j = w (t) j + (y i ŷ (t) i )x ji 1 j d. 2 Continue these steps until 1 n 2 i=1 y i ŷ (t) i < γ (threshold). If training set is linearly separable, then guaranteed to converge. (Rosenblatt, 1962). 11 / 23

Non-Linear Continuous Activation Function What about a smooth approximation to the step function? 12 / 23

Non-Linear Continuous Activation Function What about a smooth approximation to the step function? A sigmoid function has the form h(v) = 1 1 + e v 12 / 23

Non-Linear Continuous Activation Function What about a smooth approximation to the step function? A sigmoid function has the form h(v) = 1 1 + e v Applying the sigmoid function element-wise, 1 z ic =. 1 + e W c T x i This is called a multi-layer perceptron or neural network. 12 / 23

Why Neural Network? A typical neuron. Neuron has many dendrites. Each dendrite takes input. Neuron has a single axon. Axon sends output. 13 / 23

Why Neural Network? A typical neuron. Neuron has many dendrites. Each dendrite takes input. Neuron has a single axon. Axon sends output. With the right input to dendrites: Action potential along axon. 13 / 23

Why Neural Network? First neuron: Each dendrite: x i1, x i1,..., x id Nucleus: computes Wc T x i 1 Axon: sends binary signal 1+e W c T x i Axon terminal: neurotransmitter, synapse Second neuron: Each dendrite receives: z i1, z i2,..., z ik Nucleus: computes w T z i Axon: sends signal ŷ i Describes a neural network with one hidden layer (2 neurons). 14 / 23

Artificial Neural Nets vs. Real Neural Nets yi Artificial neural network: x i is measuring of the world. z i is internal representation of the world. ŷ i as output of neuron for classification/regression. W11 w1 w2 w3 wk zi1 zi2 zi3 zik Wkd xi1 xi2 xi3 xid Real neural networks are more complicated: Timing of action potentials seems to be important. Rate coding: frequency of action potentials simulates continuous output. Neural networks don t reflect sparsity of action potentials. 15 / 23

The Universal Approximation Theorem One of the first versions proven by George Cybenko (1989). 16 / 23

The Universal Approximation Theorem One of the first versions proven by George Cybenko (1989). For feedforward neural networks, the universal approximation theorem:... claims that every continuous function defined on a compact set can be arbitrarily well approximated by a neural network with one hidden layer. Thus, a simple neural network capable of representing a wide variety of functions when given appropriate parameters. 16 / 23

Artificial Neural Networks With squared loss, our objective function is: 1 argmin w IR k,w IR k d 2 n (y i w T h(w x i)) 2 i=1 17 / 23

Artificial Neural Networks With squared loss, our objective function is: 1 argmin w IR k,w IR k d 2 n (y i w T h(w x i)) 2 i=1 Usual training procedure: stochastic gradient. Compute gradient of random example i, update w and W. Computing the gradient is known as backpropagation. 17 / 23

Backpropagation Output values ŷ i compared to correct values y i to compute error. 18 / 23

Backpropagation Output values ŷ i compared to correct values y i to compute error. Error is fed back through the network (backpropagation of errors). 18 / 23

Backpropagation Output values ŷ i compared to correct values y i to compute error. Error is fed back through the network (backpropagation of errors). Weights are adjusted to reduce value of error function by small amount. After number of cycles, network converges to state where error is small. To adjust weights, use non-linear optimization method gradient descent. Backpropagation can only be applied to differentiable activation functions. 18 / 23

Backpropagation Algorithm repeat until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do /*Propagate the inputs forward to compute the outputs*/ /*Propagate the deltas backwards from output layer to input layer*/ /*Update every weight in network using deltas*/ until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do /*Propagate the inputs forward to compute the outputs*/ for each node i in the input layer do a i x i for l = 2 to L do for each node j in layer l do v j i wi,jai a j h(v j) /*Propagate the deltas backwards from output layer to input layer*/ /*Update every weight in network using deltas*/ until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do /*Propagate the inputs forward to compute the outputs*/ for each node i in the input layer do a i x i for l = 2 to L do for each node j in layer l do v j i wi,jai a j h(v j) /*Propagate the deltas backwards from output layer to input layer*/ for each node j in output layer do [j] h (v j)(y j a j) for l = L 1 to 1 do for each node i in layer l do [i] h (v i) j wi,j [j] /*Update every weight in network using deltas*/ until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do /*Propagate the inputs forward to compute the outputs*/ for each node i in the input layer do a i x i for l = 2 to L do for each node j in layer l do v j i wi,jai a j h(v j) /*Propagate the deltas backwards from output layer to input layer*/ for each node j in output layer do [j] h (v j)(y j a j) for l = L 1 to 1 do for each node i in layer l do [i] h (v i) j wi,j [j] /*Update every weight in network using deltas*/ for each weight w i,j in network do w i,j w i,j + αa i [j] until some stopping criterion is satisfied 19 / 23

Backpropagation Derivations of the back-propagation equations are very similar to the gradient calculation for logistic regression. Uses the multivariable chain rule more than once. 20 / 23

Backpropagation Derivations of the back-propagation equations are very similar to the gradient calculation for logistic regression. Uses the multivariable chain rule more than once. Consider the loss for a single node: f(w, W ) = 1 k 2 (yi w jh(w jx i)) 2. j=1 Derivatives with respect to w j (weights connect hidden to output layer): f(w, W ) w j = (y i k w jh(w jx i)) h(w jx i) j=1 20 / 23

Backpropagation Notice repeated calculations in gradients: f(w, W ) w j f(w, W ) W ij = (y i = (y i k w jh(w jx)) h(w jx) j=1 k w jh(w jx)) w jh (W jx)x i j=1 21 / 23

Backpropagation Notice repeated calculations in gradients: f(w, W ) w j f(w, W ) W ij = (y i = (y i k w jh(w jx)) h(w jx) j=1 k w jh(w jx)) w jh (W jx)x i j=1 Same value for f(w,w ) w j with all j and f(w,w ) W ij for all i and j. Same value for f(w,w ) W ij with all i. 21 / 23

Disadvantages of Neural Networks Cannot be used if very little available data. 22 / 23

Disadvantages of Neural Networks Cannot be used if very little available data. Many free parameters (e.g., number of hidden nodes, learning rate, minimal error, etc.) 22 / 23

Disadvantages of Neural Networks Cannot be used if very little available data. Many free parameters (e.g., number of hidden nodes, learning rate, minimal error, etc.) Not great for precision calculations. Do not provide explanations (unlike slopes in linear models that represent correlations). Network overfits training data, fails to capture true statistical process behind the data. Early stopping can be used to avoid this. 22 / 23

Advantages of Neural Networks Easy to maintain. Versatile. Used on problems for which analytical methods do not yet exist. 23 / 23

Advantages of Neural Networks Easy to maintain. Versatile. Used on problems for which analytical methods do not yet exist. Models non-linear dependencies. Quickly work out patterns, even with noisy data. 23 / 23