Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science

Topics covered in the lecture: Outline /4 Neuron types Layers Loss functions Computing loss gradients via backpropagation Learning neural networks Regularization

Neural Networks Overview Composition of simple linear or non-linear functions (neurons) parametrized by weights and biases 3/4 Training examples: T m = {(xi, yi) (X Y) i =,, m}, where X R n and Y R K Here we consider H a hypothesis space of neural networks having a fixed architecture Learning methods are based on Empirical Risk Minimization: R T m(h (w,b) ) = m m i= l(y i, h (w,b) (x i )), where h (w,b) H denotes a neural network parametrized by w and b Note that in the following I will use L(w) m RT m(h(w,b)) and ŷ i h (w,b)c (x i ) to simplify the notation

McCulloch-Pitts Perceptron x w 4/4 x w P s ŷ x n w n b x = ( ) T x, x,, x n R n input (feature vector) w = ( ) T w, w,, w n R n weights b R bias (threshold) s = w, { x R inner potential if s < 0 f(s) = activation function else ŷ = h (w,b) (x) {, } output (activity) ( n ) ŷ = f(s) = f w i x i + b = f ( w, x + b) i= It is the linear classifier we have already seen

McCulloch-Pitts Perceptron: Treating Bias x w 0 = b w 5/4 x w P s ŷ x n w n Treat bias as an extra fixed input x0 = weighted w0 = b: ŷ = f ( w, x + b) = f ( w, x + w 0 ) = f ( w, x ) x = (, x,, xn)t Rn+ w = (w0, w,, wn)t Rn+ Unless otherwise noted we will use x, w instead of x, w

Activation Functions 6/4 Step Function Bipolar Step Function Linear - - - - - Sigmoid Hyperbolic Tangent ReLU -0 0-0 0 - - Logistic sigmoid: σ(s) es = + e s e s + Note: tanh(s) = es e s e s = σ(s) + e s

Linear Neuron Training examples: T m = {(xi, yi) (Rn+ R) i =,, m} 7/4 Single neuron with linear activation function linear regression: Inputs: X = ŷ = s = x, w, x x n x T = x m x mn x T m ŷ R Targets: y = (y,, ym)t, yi R Outputs: ŷ = (ŷ,, ŷm)t, ŷi R For the whole dataset we get: ŷ = Xw, ŷ R m

Linear Neuron: Maximum Likelihood Estimation Assumption: data are Gaussian distributed with mean xi, w and variance σ : 8/4 Likelihood for iid data: p (y w, X, σ) = y i N ( x i, w, σ ) = x i, w + N ( 0, σ ) m p (y i w, x i, σ) = i= m i= = ( πσ ) m e σ mi= (y i w,x i ) = = ( πσ ) m e σ (y Xw)T (y Xw) Negative Log Likelihood (switching to minimization): ( πσ ) e σ (y i w,x i ) = L (w) = m log ( πσ ) + σ (y Xw)T (y Xw)

Linear Neuron: Maximum Likelihood Estimation (contd) Note that 9/4 m i= (y i w, x i ) }{{} l(y i,ŷ i ) = (y Xw) T (y Xw) is the sum-of-squares or squared error (SE) Minimization of L (w) least squares estimaton Solving L w = 0 we get w = ( X T X ) X T y (see seminar) y 08x + + (0, ) 0 8 6 4 data model 4 6 8 0 x

Logistic Sigmoid and Probability Denote: ŷ = σ(s), ŷ (0, ) 0/4 Sigmoid output can represent a parameter of the Bernoulli distribution: p(y ŷ) = Ber(y ŷ) = ŷ y ( ŷ) y = {ŷ for y = ŷ for y = 0 Motivation: log-odds linear model (see AE4B33RPZ) Binary classifier: h(ŷ) = { if ŷ > 0 else Sigmoid -0 0

Logistic Regression MCP neuron using sigmoid activation function logistic regression: /4 Inputs: X = ŷ = σ( w, x ), ŷ (0, ) x x n x T = x m x mn x T m Target class: y = (y,, ym)t, yi {0, } Output class: ŷ = (ŷ,, ŷm)t, ŷi (0, )

Logistic Regression MLE Leads to the Cross-Entropy Likelihood, for the logistic regression: /4 p(y w, X) = m i= Ber(y i ŷ i ) = m i= ŷ y i i ( ŷ i ) y i Negative Log Likelihood: L(w) = m i= [y i log ŷ i + ( y i ) log ( ŷ i )] }{{} l(y i,ŷ i ) This loss function is called the cross-entropy The l(yi, ŷi) is the negative log probability of the correct answer y i {0, } given by the model output ŷ i (0, )

Maximum Likelihood Estimation Maximum Likelihood Estimation: w = argmin w L(w) Derivative of the loss wrt to the sigmoid argument: 3/4 L s i = ŷ i y i (see seminar) Gradient wrt logistic regression parameters: m L w = i= L s i s i w = m i= x i (ŷ i y i ) = X T (ŷ y) L w = 0 has no analytical solution = use numerical methods

Rectified Linear Unit (ReLU) Definition f(s) = max(0, s) 4/4 Fast to compute Helps with vanishing gradients problem: the gradient is constant for s > 0, while for sigmoid-like activations it becomes increasingly small Leads to sparse representations: s < 0 turns the neuron completely off Unbounded: use regularization to prevent numerical problems Might block gradient propagation dead units Leaky ReLU ReLU -

Multilayer Perceptron (MLP) X 5/4 X X softmax fully connected hidden layer hidden layer output layer Feed-forward ANN Fully-connected layers MLP for regression would typically use linear output layer

Recurrent Neural Network (RNN) 6/4 P P P Fully-Connected Recurrent Neural Network (FRNN) Both inputs and outputs are sequences Feedback connections memory

Modular and Hierarchical Architectures 7/4 module module Layers can be organized in modules Hierarchies of modules Module reuse

Linear Layer Output k: ŷk = x, wk, k = 0,,, K 8/4 All outputs using weight matrix W: ŷ = xt W Multiple samples: Ŷ = XW X W = w T w T K T = w 0 w 0K w n w nk x x x n w 0 3 w nk K ŷ ŷ ŷ 3 ŷ K X = x T = x x n Ŷ = ŷ T = ŷ ŷ K x T m x m x mn y T m ŷ m ŷ mk

Softmax Layer 9/4 s s s 3 s K softmax 3 K ŷ ŷ ŷ 3 ŷ K Multinominal classification, K mutually exclusive classes Definition: σk(s) e s k K c= es c, where K is the number of classes Softmax represents a probability distribution: σk (0, ) for k { K} and K c= σ c = Describes class membership probabilities: p(y = k s) = σk(s)

Softmax Layer MLE 0/4 Target: y = (y ym)t, yi {,,, K} One-hot encoding for sample i and class k: let yik = [yi = k] Likelihood: p(y w, X) = m i= K c= ŷ y ic ic Negative Log Likelihood: L(w) = m K y ic log(ŷ ic ) i= c= Again the cross-entropy See seminar for the gradient

Multinominal Logistic Regression linear layer + softmax layer = multinominal logistic regression: /4 ŷ k = σ k (x T W) X softmax s ŷ x s ŷ x 3 s 3 3 ŷ 3 x n K s K K ŷ K Classifier: h(x, W) = argmax k ŷ k

problem binary classification multinominal classification regression multi-output regression Loss Functions: Summary suggested loss function cross-entropy m [y i log ŷ i + ( y i ) log ( ŷ i )] i= multinominal cross-entropy m K y ic log(ŷ ic ) i= c= squared error m (y i ŷ i ) i= squared error m K (y ic ŷ ic ) i= c= Mean wrt to m is often used, in that case these losses exactly correspond to the empirical risk R T m(h) /4

Backpropagation Overview A method to compute a gradient of the loss function with respect to its parameters: L(w) L(w) is in turn used by optimization methods like gradient descent Here, we present the "modular" backpropagation (see Nando de Freitas Machine Learning course: https://wwwcsoxacuk/people/ nandodefreitas/machinelearning/) Let us use multinominal logistic regression as an example x x x n X 3 K s s s 3 s K softmax 3 K ŷ ŷ ŷ 3 ŷ K loss y L 3/4

Backpropagation: the Loss Function The loss function is the multinominal cross-entropy in this case: L(w) = m i= ( ) K exp ( x i, w c ) [y i = c] log K k= exp ( x i, w k ) c= 4/4 x x x n X 3 K s s s 3 s K softmax z = x z = s z 3 = ŷ y z 4 = L 3 K ŷ ŷ ŷ 3 ŷ K loss L

Backpropagation Based on Modules Computation of L(w) involves repetitive use of the chain rule 5/4 We can make things simpler by divide and conquer approach Divide to simplest possible modules (these can be later combined into complex hierarchies) Represent even the loss function as a module Passing messages z l in z l+ out l l+ @L @w l

Backpropagation: Backward Pass Message Let δl = L be the sensitivity of the loss to the module input for layer l, z l then: δ l i = L z l i = j L z l+ j zl+ j zi l = j δ l+ j z l+ j z l i We need to know how to compute derivatives of outputs wrt inputs only! 6/4 z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z @L @w = @L @z 3 = @L @z 3 4 = @L @z 4 = = @L @L =

Backpropagation: Parameters Similarly if the module has parameters we want to know how the loss changes wrt them: 7/4 L w l i = j L z l+ j zl+ j wi l = j δ l+ j z l+ j w l i Derivatives of module outputs wrt to the parameters are all we need z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z @L @w = @L @z 3 = @L @z 3 4 = @L @z 4 = = @L @L =

Backpropagation: Steps So for each module we need only to specify these three messages: forward: z l+ = f(z l ) backward: zl+ z l parameter (optional): zl+ w l 8/4 Forward Pass z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z @L @w = @L @z 3 Parameters 3 = @L @z 3 4 = @L @z 4 Backward Pass (Backpropagation)

forward: zl+ j = Example: Linear Layer n w ij zi, l i=0 j =,, K 9/4 backward: zl+ j zi l = w ij, i = 0,, n, j =,, K parameter: zl+ j w ik = [j = k]z l i X z l 0 w 0 z l+ z l z l+ z l 3 z l+ 3 z l n w nk K z l+ K

Example: Squared Error n ( forward: zl+ = yi zi) l i= 30/4 backward: zl+ z l i = (y i z l i), i {,, n}

Gradient Descent Task: find parameters which minimize loss over the training dataset: 3/4 θ = argmin θ L(θ) where θ is a set of all parameters defining the ANN (eg, all weight matrices) Gradient descent: θ(t+) = θ(t) η(t) L(θ(t)) where η (t) > 0 is the learning rate or step size at iteration t 0 0 0 5 5 5 0 0 0 05 05 05 00 00 05 0 5 0 00 00 05 0 5 0 00 00 05 0 5 0

Batch, Online and Mini-Batch Learning When to update weights? 3/4 (Full) Batch learning: after all patterns are used (epoch) inefficient for redundant datasets Online learning: after each training pattern noise can help overcome local minima but can also harm the convergence in the final stages while fine-tuning Stochastic Gradient Descent (SGD) does this convergence almost surely to local minimum when η (t) decreases appropriately in time Mini-batch learning: after a small sample of training patterns

Momentum Simulate inertia to overcome plateaus in the error landscape: 33/4 v (t+) = µv (t) η (t) L(θ (t) ) θ (t+) = θ (t) + v (t+) where µ [0, ] is the momentum parameter Momentum damps oscillations in directions of high curvature It builds velocity in directions with consistent (possibly small) gradient 0 0 5 5 0 0 05 05 00 00 05 0 5 0 00 00 05 0 5 0

Adagrad Adaptive Gradient method (Duchi, Hazan and Singer, 0) 34/4 Motivation: a magnitude of gradient differs a lot for different parameters Idea: reduce learning rates for parameters having high values of gradient g (t+) i θ (t+) i = g (t) i + = θ (t) i ( L θ (t) i ) η g (t+) i + ɛ L θ (t) i g i accumulates squared partial derivatives wrt to the parameter θi ɛ is a small positive number to prevent division by zero Weakness: ever increasing gi leads to slow convergence eventually

RMSProp Similar to Adagrad but employs a moving average: 35/4 g (t+) i = γg (t) i + ( γ) ( L θ (t) i ) γ is a decay parameter (typical value γ = 09) Unlike for Adagrad updates do not get infinitesimally small

Regularization How to deal with overfitting? 36/4 get more data find simpler model, search for optimal architecture, eg, number, type and size of layers use regularization Most types of regularization are based on penalties for model complexity Bayesian point of view: introduce prior distribution on model parameters

L Regularization Recall the solution for the linear regression w = (XT X) XT y What if XT X has no inverse? We can modify the solution by adding a small element to the diagonal: 37/4 w = ( X T X + λi ) X T y, λ > 0 It turns out that this approach no only helps with inverting XT X but it also improves model generalization It is the solution of the regularized loss function: L (w) = (y Xw) T (y Xw) + λw T w this one is called the L regularization, see seminar for the derivation The term λwt w = λ w Note that we omit bias in λwt w minimizes the size of the weight vector

L Regularization as Gaussian Prior Recall the likelihood: 38/4 p (y w, X) = ( πσ ) m e σ (y Xw)T (y Xw) Define a Gaussian prior with zero mean and variance σ0 for the parameters: p (w) = ( πσ0 ) e σ 0 w T w Then the posterior is: p(w y, X) = p(y w, X) p(w) p(y X) The denominator does not depend on the parameters w: p(w y, X) p(y w, X) p(w)

MAP Estimate Maximizing p(w y, X) gives us the Maximum a posteriori (MAP) estimate: 39/4 w MAP = argmax w p(w y, X) = argmin w ( log p(w y, X)) where log p(w y, X) = σ (y Xw)T (y Xw) + σ0 w T w + C We can omit C, define λ = σ already know: σ 0 and minimize the loss function we L (w) = (y Xw) T (y Xw) + λw T w

Weight Decay Discussion Having zero mean Gaussian prior keeps the weights smaller 40/4 Weight decay is widely used for most types of layers in ANNs Intuition: sigmoid-like neurons kept near zero potential (via small weights) behave similarly to linear neurons The same works for other models, eg, polynomial regression λ is usually set using cross validation Sigmoid -0 0

Other Regularization Approaches 4/4 L regularization: sum absolute values, ie, use λ w Early stopping: start with small weights, stop when validation loss starts to grow Randomize inputs: same as the weight decay for linear neurons Weight sharing and sparse connectivity: Convolutional Neural Networks Model averaging Dropout and DropConnect Augmenting dataset

Deep Neural Networks Next Lecture 4/4 Convolutional Neural Networks Transfer learning

w w P s ŷ n w n b

w 0 = b w w P s ŷ n w n

Step Function Bipolar Step Function Linear - - - - Sigmoid Hyperbolic Tangent ReLU -0 0 0 0 - -

y 08x + + (0, ) 0 8 6 4 data model 4 6 8 0 x

Sigmoid 0 0

ReLU -

X X X softmax fully connected hidden layer hidden layer output layer

P P P

module module

X w 0 ŷ ŷ 3 ŷ 3 n w nk K ŷ K

softmax s ŷ s ŷ s 3 3 ŷ 3 s K K ŷ K

X softmax s ŷ s ŷ 3 s 3 3 ŷ 3 n K s K K ŷ K

X softmax loss s ŷ x s ŷ 3 s 3 3 ŷ 3 L n K s K K ŷ K y

X softmax loss s ŷ x s ŷ x 3 s 3 3 ŷ 3 L x n K s K K ŷ K = x z = s z 3 = ŷ y z 4 = L

z l in z l+ out l l+ @L @w l

z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z @L @w = @L @z 3 = @L @z 3 4 = @L @z 4 = = @L @L =

Forward Pass z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z = @L @z 3 = @L @z 3 4 = @L @z 4 @L @w 3 Parameters Backward Pass (Backpropagation)

X l 0 w 0 z l+ l z l+ l 3 z l+ 3 l n w nk K z l+ K

0 0 0 5 5 5 0 0 0 05 05 05 00 00 05 0 5 0 00 00 05 0 5 0 00 00 05 0 5 0

0 0 5 5 0 0 05 05 00 00 05 0 5 0 00 00 05 0 5 0

Sigmoid 0 0