Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Similar documents
Deep Feedforward Networks

Deep Feedforward Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to Neural Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 5: Logistic Regression. Neural Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Neural Network Training

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Statistical Machine Learning from Data

Lecture 2: Learning with neural networks

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Multilayer Perceptron

Neural Networks: Optimization & Regularization

Rapid Introduction to Machine Learning/ Deep Learning

Artificial Neural Networks. MGS Lecture 2

Neural Networks and Deep Learning

4. Multilayer Perceptrons

Reading Group on Deep Learning Session 1

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Overfitting, Bias / Variance Analysis

Advanced Machine Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks: Backpropagation

Artificial Neural Networks 2

Computational statistics

Machine Learning Basics III

CSC 578 Neural Networks and Deep Learning

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lectures 9 Fully Connected Neural Networks

Introduction to Machine Learning

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CSCI567 Machine Learning (Fall 2018)

Lecture 17: Neural Networks and Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

The connection of dropout and Bayesian statistics

Deep Learning Lab Course 2017 (Deep Learning Practical)

Understanding Neural Networks : Part I

CS260: Machine Learning Algorithms

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 2: Modular Learning

Day 3 Lecture 3. Optimizing deep networks

Introduction to Convolutional Neural Networks (CNNs)

Normalization Techniques in Training of Deep Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

1 Machine Learning Concepts (16 points)

Deep Learning II: Momentum & Adaptive Step Size

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Intelligent Systems Discriminative Learning, Neural Networks

Feed-forward Network Functions

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Introduction to Machine Learning (67577)

Jakub Hajic Artificial Intelligence Seminar I

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Deep Learning & Artificial Intelligence WS 2018/2019

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

CSC321 Lecture 8: Optimization

CSC242: Intro to AI. Lecture 21

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Classification. Sandro Cumani. Politecnico di Torino

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Deep Neural Networks (1) Hidden layers; Back-propagation

Based on the original slides of Hung-yi Lee

Intro to Neural Networks and Deep Learning

Neural Networks (Part 1) Goals for the lecture

OPTIMIZATION METHODS IN DEEP LEARNING

Feedforward Neural Nets and Backpropagation

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Bits of Machine Learning Part 1: Supervised Learning

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

More Tips for Training Neural Network. Hung-yi Lee

Neural networks. Chapter 20. Chapter 20 1

Lecture 3 Feedforward Networks and Backpropagation

Regression with Numerical Optimization. Logistic

Stochastic gradient descent; Classification

Logistic Regression. Machine Learning Fall 2018

Multilayer Perceptron

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Neural Networks and Deep Learning.

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CSC 411 Lecture 10: Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

CSCI567 Machine Learning (Fall 2014)

Summary and discussion of: Dropout Training as Adaptive Regularization

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural networks. Chapter 19, Sections 1 5 1

Artificial Intelligence

Lecture 3 Feedforward Networks and Backpropagation

Transcription:

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science

Topics covered in the lecture: Outline /4 Neuron types Layers Loss functions Computing loss gradients via backpropagation Learning neural networks Regularization

Neural Networks Overview Composition of simple linear or non-linear functions (neurons) parametrized by weights and biases 3/4 Training examples: T m = {(xi, yi) (X Y) i =,, m}, where X R n and Y R K Here we consider H a hypothesis space of neural networks having a fixed architecture Learning methods are based on Empirical Risk Minimization: R T m(h (w,b) ) = m m i= l(y i, h (w,b) (x i )), where h (w,b) H denotes a neural network parametrized by w and b Note that in the following I will use L(w) m RT m(h(w,b)) and ŷ i h (w,b)c (x i ) to simplify the notation

McCulloch-Pitts Perceptron x w 4/4 x w P s ŷ x n w n b x = ( ) T x, x,, x n R n input (feature vector) w = ( ) T w, w,, w n R n weights b R bias (threshold) s = w, { x R inner potential if s < 0 f(s) = activation function else ŷ = h (w,b) (x) {, } output (activity) ( n ) ŷ = f(s) = f w i x i + b = f ( w, x + b) i= It is the linear classifier we have already seen

McCulloch-Pitts Perceptron: Treating Bias x w 0 = b w 5/4 x w P s ŷ x n w n Treat bias as an extra fixed input x0 = weighted w0 = b: ŷ = f ( w, x + b) = f ( w, x + w 0 ) = f ( w, x ) x = (, x,, xn)t Rn+ w = (w0, w,, wn)t Rn+ Unless otherwise noted we will use x, w instead of x, w

Activation Functions 6/4 Step Function Bipolar Step Function Linear - - - - - Sigmoid Hyperbolic Tangent ReLU -0 0-0 0 - - Logistic sigmoid: σ(s) es = + e s e s + Note: tanh(s) = es e s e s = σ(s) + e s

Linear Neuron Training examples: T m = {(xi, yi) (Rn+ R) i =,, m} 7/4 Single neuron with linear activation function linear regression: Inputs: X = ŷ = s = x, w, x x n x T = x m x mn x T m ŷ R Targets: y = (y,, ym)t, yi R Outputs: ŷ = (ŷ,, ŷm)t, ŷi R For the whole dataset we get: ŷ = Xw, ŷ R m

Linear Neuron: Maximum Likelihood Estimation Assumption: data are Gaussian distributed with mean xi, w and variance σ : 8/4 Likelihood for iid data: p (y w, X, σ) = y i N ( x i, w, σ ) = x i, w + N ( 0, σ ) m p (y i w, x i, σ) = i= m i= = ( πσ ) m e σ mi= (y i w,x i ) = = ( πσ ) m e σ (y Xw)T (y Xw) Negative Log Likelihood (switching to minimization): ( πσ ) e σ (y i w,x i ) = L (w) = m log ( πσ ) + σ (y Xw)T (y Xw)

Linear Neuron: Maximum Likelihood Estimation (contd) Note that 9/4 m i= (y i w, x i ) }{{} l(y i,ŷ i ) = (y Xw) T (y Xw) is the sum-of-squares or squared error (SE) Minimization of L (w) least squares estimaton Solving L w = 0 we get w = ( X T X ) X T y (see seminar) y 08x + + (0, ) 0 8 6 4 data model 4 6 8 0 x

Logistic Sigmoid and Probability Denote: ŷ = σ(s), ŷ (0, ) 0/4 Sigmoid output can represent a parameter of the Bernoulli distribution: p(y ŷ) = Ber(y ŷ) = ŷ y ( ŷ) y = {ŷ for y = ŷ for y = 0 Motivation: log-odds linear model (see AE4B33RPZ) Binary classifier: h(ŷ) = { if ŷ > 0 else Sigmoid -0 0

Logistic Regression MCP neuron using sigmoid activation function logistic regression: /4 Inputs: X = ŷ = σ( w, x ), ŷ (0, ) x x n x T = x m x mn x T m Target class: y = (y,, ym)t, yi {0, } Output class: ŷ = (ŷ,, ŷm)t, ŷi (0, )

Logistic Regression MLE Leads to the Cross-Entropy Likelihood, for the logistic regression: /4 p(y w, X) = m i= Ber(y i ŷ i ) = m i= ŷ y i i ( ŷ i ) y i Negative Log Likelihood: L(w) = m i= [y i log ŷ i + ( y i ) log ( ŷ i )] }{{} l(y i,ŷ i ) This loss function is called the cross-entropy The l(yi, ŷi) is the negative log probability of the correct answer y i {0, } given by the model output ŷ i (0, )

Maximum Likelihood Estimation Maximum Likelihood Estimation: w = argmin w L(w) Derivative of the loss wrt to the sigmoid argument: 3/4 L s i = ŷ i y i (see seminar) Gradient wrt logistic regression parameters: m L w = i= L s i s i w = m i= x i (ŷ i y i ) = X T (ŷ y) L w = 0 has no analytical solution = use numerical methods

Rectified Linear Unit (ReLU) Definition f(s) = max(0, s) 4/4 Fast to compute Helps with vanishing gradients problem: the gradient is constant for s > 0, while for sigmoid-like activations it becomes increasingly small Leads to sparse representations: s < 0 turns the neuron completely off Unbounded: use regularization to prevent numerical problems Might block gradient propagation dead units Leaky ReLU ReLU -

Multilayer Perceptron (MLP) X 5/4 X X softmax fully connected hidden layer hidden layer output layer Feed-forward ANN Fully-connected layers MLP for regression would typically use linear output layer

Recurrent Neural Network (RNN) 6/4 P P P Fully-Connected Recurrent Neural Network (FRNN) Both inputs and outputs are sequences Feedback connections memory

Modular and Hierarchical Architectures 7/4 module module Layers can be organized in modules Hierarchies of modules Module reuse

Linear Layer Output k: ŷk = x, wk, k = 0,,, K 8/4 All outputs using weight matrix W: ŷ = xt W Multiple samples: Ŷ = XW X W = w T w T K T = w 0 w 0K w n w nk x x x n w 0 3 w nk K ŷ ŷ ŷ 3 ŷ K X = x T = x x n Ŷ = ŷ T = ŷ ŷ K x T m x m x mn y T m ŷ m ŷ mk

Softmax Layer 9/4 s s s 3 s K softmax 3 K ŷ ŷ ŷ 3 ŷ K Multinominal classification, K mutually exclusive classes Definition: σk(s) e s k K c= es c, where K is the number of classes Softmax represents a probability distribution: σk (0, ) for k { K} and K c= σ c = Describes class membership probabilities: p(y = k s) = σk(s)

Softmax Layer MLE 0/4 Target: y = (y ym)t, yi {,,, K} One-hot encoding for sample i and class k: let yik = [yi = k] Likelihood: p(y w, X) = m i= K c= ŷ y ic ic Negative Log Likelihood: L(w) = m K y ic log(ŷ ic ) i= c= Again the cross-entropy See seminar for the gradient

Multinominal Logistic Regression linear layer + softmax layer = multinominal logistic regression: /4 ŷ k = σ k (x T W) X softmax s ŷ x s ŷ x 3 s 3 3 ŷ 3 x n K s K K ŷ K Classifier: h(x, W) = argmax k ŷ k

problem binary classification multinominal classification regression multi-output regression Loss Functions: Summary suggested loss function cross-entropy m [y i log ŷ i + ( y i ) log ( ŷ i )] i= multinominal cross-entropy m K y ic log(ŷ ic ) i= c= squared error m (y i ŷ i ) i= squared error m K (y ic ŷ ic ) i= c= Mean wrt to m is often used, in that case these losses exactly correspond to the empirical risk R T m(h) /4

Backpropagation Overview A method to compute a gradient of the loss function with respect to its parameters: L(w) L(w) is in turn used by optimization methods like gradient descent Here, we present the "modular" backpropagation (see Nando de Freitas Machine Learning course: https://wwwcsoxacuk/people/ nandodefreitas/machinelearning/) Let us use multinominal logistic regression as an example x x x n X 3 K s s s 3 s K softmax 3 K ŷ ŷ ŷ 3 ŷ K loss y L 3/4

Backpropagation: the Loss Function The loss function is the multinominal cross-entropy in this case: L(w) = m i= ( ) K exp ( x i, w c ) [y i = c] log K k= exp ( x i, w k ) c= 4/4 x x x n X 3 K s s s 3 s K softmax z = x z = s z 3 = ŷ y z 4 = L 3 K ŷ ŷ ŷ 3 ŷ K loss L

Backpropagation Based on Modules Computation of L(w) involves repetitive use of the chain rule 5/4 We can make things simpler by divide and conquer approach Divide to simplest possible modules (these can be later combined into complex hierarchies) Represent even the loss function as a module Passing messages z l in z l+ out l l+ @L @w l

Backpropagation: Backward Pass Message Let δl = L be the sensitivity of the loss to the module input for layer l, z l then: δ l i = L z l i = j L z l+ j zl+ j zi l = j δ l+ j z l+ j z l i We need to know how to compute derivatives of outputs wrt inputs only! 6/4 z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z @L @w = @L @z 3 = @L @z 3 4 = @L @z 4 = = @L @L =

Backpropagation: Parameters Similarly if the module has parameters we want to know how the loss changes wrt them: 7/4 L w l i = j L z l+ j zl+ j wi l = j δ l+ j z l+ j w l i Derivatives of module outputs wrt to the parameters are all we need z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z @L @w = @L @z 3 = @L @z 3 4 = @L @z 4 = = @L @L =

Backpropagation: Steps So for each module we need only to specify these three messages: forward: z l+ = f(z l ) backward: zl+ z l parameter (optional): zl+ w l 8/4 Forward Pass z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z @L @w = @L @z 3 Parameters 3 = @L @z 3 4 = @L @z 4 Backward Pass (Backpropagation)

forward: zl+ j = Example: Linear Layer n w ij zi, l i=0 j =,, K 9/4 backward: zl+ j zi l = w ij, i = 0,, n, j =,, K parameter: zl+ j w ik = [j = k]z l i X z l 0 w 0 z l+ z l z l+ z l 3 z l+ 3 z l n w nk K z l+ K

Example: Squared Error n ( forward: zl+ = yi zi) l i= 30/4 backward: zl+ z l i = (y i z l i), i {,, n}

Gradient Descent Task: find parameters which minimize loss over the training dataset: 3/4 θ = argmin θ L(θ) where θ is a set of all parameters defining the ANN (eg, all weight matrices) Gradient descent: θ(t+) = θ(t) η(t) L(θ(t)) where η (t) > 0 is the learning rate or step size at iteration t 0 0 0 5 5 5 0 0 0 05 05 05 00 00 05 0 5 0 00 00 05 0 5 0 00 00 05 0 5 0

Batch, Online and Mini-Batch Learning When to update weights? 3/4 (Full) Batch learning: after all patterns are used (epoch) inefficient for redundant datasets Online learning: after each training pattern noise can help overcome local minima but can also harm the convergence in the final stages while fine-tuning Stochastic Gradient Descent (SGD) does this convergence almost surely to local minimum when η (t) decreases appropriately in time Mini-batch learning: after a small sample of training patterns

Momentum Simulate inertia to overcome plateaus in the error landscape: 33/4 v (t+) = µv (t) η (t) L(θ (t) ) θ (t+) = θ (t) + v (t+) where µ [0, ] is the momentum parameter Momentum damps oscillations in directions of high curvature It builds velocity in directions with consistent (possibly small) gradient 0 0 5 5 0 0 05 05 00 00 05 0 5 0 00 00 05 0 5 0

Adagrad Adaptive Gradient method (Duchi, Hazan and Singer, 0) 34/4 Motivation: a magnitude of gradient differs a lot for different parameters Idea: reduce learning rates for parameters having high values of gradient g (t+) i θ (t+) i = g (t) i + = θ (t) i ( L θ (t) i ) η g (t+) i + ɛ L θ (t) i g i accumulates squared partial derivatives wrt to the parameter θi ɛ is a small positive number to prevent division by zero Weakness: ever increasing gi leads to slow convergence eventually

RMSProp Similar to Adagrad but employs a moving average: 35/4 g (t+) i = γg (t) i + ( γ) ( L θ (t) i ) γ is a decay parameter (typical value γ = 09) Unlike for Adagrad updates do not get infinitesimally small

Regularization How to deal with overfitting? 36/4 get more data find simpler model, search for optimal architecture, eg, number, type and size of layers use regularization Most types of regularization are based on penalties for model complexity Bayesian point of view: introduce prior distribution on model parameters

L Regularization Recall the solution for the linear regression w = (XT X) XT y What if XT X has no inverse? We can modify the solution by adding a small element to the diagonal: 37/4 w = ( X T X + λi ) X T y, λ > 0 It turns out that this approach no only helps with inverting XT X but it also improves model generalization It is the solution of the regularized loss function: L (w) = (y Xw) T (y Xw) + λw T w this one is called the L regularization, see seminar for the derivation The term λwt w = λ w Note that we omit bias in λwt w minimizes the size of the weight vector

L Regularization as Gaussian Prior Recall the likelihood: 38/4 p (y w, X) = ( πσ ) m e σ (y Xw)T (y Xw) Define a Gaussian prior with zero mean and variance σ0 for the parameters: p (w) = ( πσ0 ) e σ 0 w T w Then the posterior is: p(w y, X) = p(y w, X) p(w) p(y X) The denominator does not depend on the parameters w: p(w y, X) p(y w, X) p(w)

MAP Estimate Maximizing p(w y, X) gives us the Maximum a posteriori (MAP) estimate: 39/4 w MAP = argmax w p(w y, X) = argmin w ( log p(w y, X)) where log p(w y, X) = σ (y Xw)T (y Xw) + σ0 w T w + C We can omit C, define λ = σ already know: σ 0 and minimize the loss function we L (w) = (y Xw) T (y Xw) + λw T w

Weight Decay Discussion Having zero mean Gaussian prior keeps the weights smaller 40/4 Weight decay is widely used for most types of layers in ANNs Intuition: sigmoid-like neurons kept near zero potential (via small weights) behave similarly to linear neurons The same works for other models, eg, polynomial regression λ is usually set using cross validation Sigmoid -0 0

Other Regularization Approaches 4/4 L regularization: sum absolute values, ie, use λ w Early stopping: start with small weights, stop when validation loss starts to grow Randomize inputs: same as the weight decay for linear neurons Weight sharing and sparse connectivity: Convolutional Neural Networks Model averaging Dropout and DropConnect Augmenting dataset

Deep Neural Networks Next Lecture 4/4 Convolutional Neural Networks Transfer learning

w w P s ŷ n w n b

w 0 = b w w P s ŷ n w n

Step Function Bipolar Step Function Linear - - - - Sigmoid Hyperbolic Tangent ReLU -0 0 0 0 - -

y 08x + + (0, ) 0 8 6 4 data model 4 6 8 0 x

Sigmoid 0 0

ReLU -

X X X softmax fully connected hidden layer hidden layer output layer

P P P

module module

X w 0 ŷ ŷ 3 ŷ 3 n w nk K ŷ K

softmax s ŷ s ŷ s 3 3 ŷ 3 s K K ŷ K

X softmax s ŷ s ŷ 3 s 3 3 ŷ 3 n K s K K ŷ K

X softmax loss s ŷ x s ŷ 3 s 3 3 ŷ 3 L n K s K K ŷ K y

X softmax loss s ŷ x s ŷ x 3 s 3 3 ŷ 3 L x n K s K K ŷ K = x z = s z 3 = ŷ y z 4 = L

z l in z l+ out l l+ @L @w l

z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z @L @w = @L @z 3 = @L @z 3 4 = @L @z 4 = = @L @L =

z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z @L @w = @L @z 3 = @L @z 3 4 = @L @z 4 = = @L @L =

Forward Pass z = x X z = f (z ) softmax z 3 = f (z ) loss L = z 4 = f 3 (z 3 ) layer 3 layer layer = @L @z = @L @z 3 = @L @z 3 4 = @L @z 4 @L @w 3 Parameters Backward Pass (Backpropagation)

X l 0 w 0 z l+ l z l+ l 3 z l+ 3 l n w nk K z l+ K

0 0 0 5 5 5 0 0 0 05 05 05 00 00 05 0 5 0 00 00 05 0 5 0 00 00 05 0 5 0

0 0 5 5 0 0 05 05 00 00 05 0 5 0 00 00 05 0 5 0

Sigmoid 0 0