Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Similar documents
Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning II: Momentum & Adaptive Step Size

Day 3 Lecture 3. Optimizing deep networks

Machine Learning Lecture 14

Deep Feedforward Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 6 Optimization for Deep Neural Networks

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

More on Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Ch.6 Deep Feedforward Networks (2/3)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

CS260: Machine Learning Algorithms

Tips for Deep Learning

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Understanding Neural Networks : Part I

Deep Neural Networks (1) Hidden layers; Back-propagation

Advanced Machine Learning

Stochastic gradient descent; Classification

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Artificial Neural Networks. MGS Lecture 2

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Neural Networks and Deep Learning

CSC321 Lecture 8: Optimization

Lecture 4 Backpropagation

Artificial Neuron (Perceptron)

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Introduction to Neural Networks

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

ECS289: Scalable Machine Learning

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Tips for Deep Learning

4. Multilayer Perceptrons

Regularization and Optimization of Backpropagation

Training Neural Networks Practical Issues

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

CSC 411 Lecture 10: Neural Networks

Deep Learning Lab Course 2017 (Deep Learning Practical)

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Deep Feedforward Networks

Based on the original slides of Hung-yi Lee

Introduction to Convolutional Neural Networks (CNNs)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Deep Neural Networks (1) Hidden layers; Back-propagation

Optimization for Training I. First-Order Methods Training algorithm

CSC321 Lecture 7: Optimization

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

IMPROVING STOCHASTIC GRADIENT DESCENT

Lecture 17: Neural Networks and Deep Learning

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Welcome to the Machine Learning Practical Deep Neural Networks. MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1

1 What a Neural Network Computes

y(x n, w) t n 2. (1)

Normalization Techniques in Training of Deep Neural Networks

Machine Learning Basics III

CSC 578 Neural Networks and Deep Learning

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

STA141C: Big Data & High Performance Statistical Computing

Rapid Introduction to Machine Learning/ Deep Learning

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

More Tips for Training Neural Network. Hung-yi Lee

Reading Group on Deep Learning Session 1

Computational statistics

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Neural Networks: Optimization & Regularization

ECS171: Machine Learning

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Large-scale Stochastic Optimization

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Stochastic Gradient Descent (SGD) The Classical Convergence Thoerem

Neural Networks. Lecture 2. Rob Fergus

Multilayer Perceptrons and Backpropagation

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Deep Learning Lecture 2

Neural Networks: Backpropagation

Introduction to Deep Learning CMPT 733. Steven Bergner

Lecture 5: Logistic Regression. Neural Networks

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

arxiv: v2 [cs.lg] 9 Jun 2017

Neural Networks. Lecture 6. Rob Fergus

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Artifical Neural Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Machine Learning Lecture 12

Multi-layer Neural Networks

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Advanced Training Techniques. Prajit Ramachandran

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 3: Deep Learning Optimizations

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Machine Learning Basics

Transcription:

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 1

Computational Graphs

Computational graphs Each node is an operation Data flows between nodes (scalars, vectors, matrices, tensors) More complex operations can be formed by composing simpler operations MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 2

Computational graph example 1 z x x y Graph for to compute z = xy MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 3

Computational graph example 2 y sigmoid dot + x w b Graph for logistic regression: y = sigmoid(w x + b) MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 4

Computational graph example 3 H relu matmul + X W b Graph for ReLU layer: H = relu(wx + b) MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 5

Computational graphs and back-propagation z f(x,y) x y MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 6

Computational graphs and back-propagation z = f(x,y) de/dz f f x y de dx = de dz dz dx de dy = de dz dz dy Chain rule of differentiation as the backward pass through the computational graph MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 6

Computational graphs Each node is an operation Data flows between nodes (scalars, vectors, matrices, tensors) More complex operations can be formed by composing simpler operations Implement chain rule of differentiation as a backward pass through the graph Back-propagation: Multiply the local gradient of an operation with an incoming gradient (or sum of gradients) See http://colah.github.io/posts/2015-08-backprop/ MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 7

How to set the learning rate?

Weight Updates Let d i (t) = E/ w i (t) be the gradient of the error function E with respect to a weight w i at update time t Vanilla gradient descent updates the weight along the negative gradient direction: Hyperparameter η - learning rate w i (t) = ηd i (t) w i (t + 1) = w i (t) + w i (t) Initialise η, and update as the training progresses (learning rate schedule) MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 8

Learning Rate Schedules Proofs of convergence for stochastic optimisation rely on a learning rate that reduces through time (as 1/t) - Robbins and Munro (1951) Learning rate schedule typically initial larger steps followed by smaller steps for fine tuning: Results in faster convergence and better solutions MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 9

Learning Rate Schedules Proofs of convergence for stochastic optimisation rely on a learning rate that reduces through time (as 1/t) - Robbins and Munro (1951) Learning rate schedule typically initial larger steps followed by smaller steps for fine tuning: Results in faster convergence and better solutions Time-dependent schedules w i (t) = η(t)d i (t) Piecewise constant: pre-determined η for each epoch Exponential: η(t) = η(0) exp( t/r) (r training set size) Reciprocal: η(t) = η(0)(1 + t/r) c (c 1) MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 9

Learning Rate Schedules Proofs of convergence for stochastic optimisation rely on a learning rate that reduces through time (as 1/t) - Robbins and Munro (1951) Learning rate schedule typically initial larger steps followed by smaller steps for fine tuning: Results in faster convergence and better solutions Time-dependent schedules w i (t) = η(t)d i (t) Piecewise constant: pre-determined η for each epoch Exponential: η(t) = η(0) exp( t/r) (r training set size) Reciprocal: η(t) = η(0)(1 + t/r) c (c 1) Performance-dependent η e.g. NewBOB : fixed η until validation set stops improving, then halve η each epoch (i.e. constant, then exponential) MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 9

Training with Momentum w i (t) = ηd i (t) + α w i (t 1) α 0.9 is the momentum hyperparameter Weight changes start by following the gradient After a few updates they start to have velocity no longer pure gradient descent Momentum term encourages the weight change to go in the previous direction Damps the random directions of the gradients, to encourage weight changes in a consistent direction MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 10

Adaptive Learning Rates Tuning learning rate (and momentum) parameters can be expensive (hyperparameter grid search) it works, but we can do better Adaptive learning rates and per-weight learning rates AdaGrad normalise the update for each weight RMSProp AdaGrad forces the learning rate to always decrease, this constraint is relaxed with RMSProp Adam RMSProp with momentum Well-explained by Andrej Karpathy at http://cs231n.github.io/neural-networks-3/ MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 11

AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i (0) = 0 S i (t) = S i (t 1) + d i (t) 2 w i (t) = η Si (t) + ɛ d i(t) ɛ 10 8 is a small constant to prevent division by 0 errors MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 12

AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i (0) = 0 S i (t) = S i (t 1) + d i (t) 2 w i (t) = η Si (t) + ɛ d i(t) ɛ 10 8 is a small constant to prevent division by 0 errors The update step for a parameter w i is normalised by the (square root of) the sum squared gradients for that parameter Weights with larger gradient magnitudes will have smaller effective learning rates S i cannot get smaller, so the effective learning rates monotonically decrease AdaGrad can decrease the effective learning rate too aggressively in deep networks MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 12

AdaGrad Separate, normalised update for each weight Normalised by the sum squared gradient S S i (0) = 0 S i (t) = S i (t 1) + d i (t) 2 w i (t) = η Si (t) + ɛ d i(t) ɛ 10 8 is a small constant to prevent division by 0 errors The update step for a parameter w i is normalised by the (square root of) the sum squared gradients for that parameter Weights with larger gradient magnitudes will have smaller effective learning rates S i cannot get smaller, so the effective learning rates monotonically decrease AdaGrad can decrease the effective learning rate too aggressively in deep networks Duchi et al, http://jmlr.org/papers/v12/duchi11a.html MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 12

RMSProp RProp (Riedmiller & Braun, http://dx.doi.org/10.1109/icnn.1993.298623) is a method for batch gradient descent with an adaptive learning rate for each parameter, and uses only the sign of the gradient (which is equivalent to normalising by the gradient) RMSProp can be viewed as a stochastic gradient descent version of RProp normalised by a moving average of the squared gradient (Hinton, http: //www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) similar to AdaGrad, but replacing the sum by a moving average for S: β 0.9 is the decay rate S i (t) = βs i (t 1) + (1 β) d i (t) 2 η w i (t) = Si (t) + ɛ d i(t) MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 13

RMSProp RProp (Riedmiller & Braun, http://dx.doi.org/10.1109/icnn.1993.298623) is a method for batch gradient descent with an adaptive learning rate for each parameter, and uses only the sign of the gradient (which is equivalent to normalising by the gradient) RMSProp can be viewed as a stochastic gradient descent version of RProp normalised by a moving average of the squared gradient (Hinton, http: //www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) similar to AdaGrad, but replacing the sum by a moving average for S: S i (t) = βs i (t 1) + (1 β) d i (t) 2 w i (t) = η Si (t) + ɛ d i(t) β 0.9 is the decay rate Effective learning rates no longer guaranteed to decrease MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 13

Adam Hinton commented about RMSProp: Momentum does not help as much as it normally does MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 14

Adam Hinton commented about RMSProp: Momentum does not help as much as it normally does Adam (Kingma & Ba, https://arxiv.org/abs/1412.6980) can be viewed as addressing this it is a variant of RMSProp with momentum: M i (t) = αm i (t 1) + (1 α)d i (t) S i (t) = βs i (t 1) + (1 β)d i (t) 2 η w i (t) = Si (t) + ɛ M i(t) Here a momentum-smoothed gradient is used for the update in place of the gradient. Kingma and Ba recommend α 0.9, β 0.999 MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 14

Adam Hinton commented about RMSProp: Momentum does not help as much as it normally does Adam (Kingma & Ba, https://arxiv.org/abs/1412.6980) can be viewed as addressing this it is a variant of RMSProp with momentum: M i (t) = αm i (t 1) + (1 α)d i (t) S i (t) = βs i (t 1) + (1 β)d i (t) 2 η w i (t) = Si (t) + ɛ M i(t) Here a momentum-smoothed gradient is used for the update in place of the gradient. Kingma and Ba recommend α 0.9, β 0.999 MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 14

Adam Hinton commented about RMSProp: Momentum does not help as much as it normally does Adam (Kingma & Ba, https://arxiv.org/abs/1412.6980) can be viewed as addressing this it is a variant of RMSProp with momentum: M i (t) = αm i (t 1) + (1 α)d i (t) S i (t) = βs i (t 1) + (1 β)d i (t) 2 w i (t) = η Si (t) + ɛ M i (t) Here a momentum-smoothed gradient is used for the update in place of the gradient. Kingma and Ba recommend α 0.9, β 0.999 MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 14

Adam Hinton commented about RMSProp: Momentum does not help as much as it normally does Many hyperparameters: batch-size, learning-rate, momentum, learning-decay-rate, num-layers, num-units,.. Adam (Kingma & Ba, https://arxiv.org/abs/1412.6980) can be viewed as addressing this it is a variant of RMSProp with momentum: M i (t) = αm i (t 1) + (1 α)d i (t) S i (t) = βs i (t 1) + (1 β)d i (t) 2 How to set η them? w i (t) = Si (t) + ɛ M i(t) Here a momentum-smoothed gradient is used for the update in place of the gradient. Kingman and Ba recommend α 0.9, β 0.999

Coursework 1 http://www.inf.ed.ac.uk/teaching/courses/mlp/coursework-2018.html Build a baseline using the EMNIST dataset Compare RMSProp and Adam with SGD Cosine annealing learning rate scheduler L2 regularization / weight decay with Adam and cosine annealing Inspired by Loshchilov and Hutter, Fixing Weight Decay Regularization in Adam Main aims of the coursework Read and understand a recent paper in the area Take the ideas from a paper, implement them in Python, carry out experiments to address research questions Write a clear, concise, correct report that includes What you did Why you did it and provides an interpretation of your results, and some conclusions MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 15

How should we initialise deep networks?

Random weight initialisation Initialise weights to small random numbers r, sampling weights independently from a Gaussian or from a uniform distribution control the initialisation by setting the mean (typically to 0) and variance of the weight distribution Biases may be initialised to 0 output (softmax) biases can be normalised to log(p(c)), log of prior probability of the corresponding class c MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 16

Random weight initialisation Initialise weights to small random numbers r, sampling weights independently from a Gaussian or from a uniform distribution control the initialisation by setting the mean (typically to 0) and variance of the weight distribution Biases may be initialised to 0 output (softmax) biases can be normalised to log(p(c)), log of prior probability of the corresponding class c Calibration variance of the input to a unit independent of the number of incoming connections ( fan-in, n in ) Heuristic: w i U( 1/n in, 1/n in ) [U is uniform distribution] Corresponds to a variance Var(w i ) = 1/(3n in ) (Since, if x U(a, b), then Var(x) = (b a) 2 /12 so if x U( n, n), then Var(x) = n 2 /3) MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 16

Why Var(w) 1/n? Consider a linear unit: if w and x are zero-mean, then n in y = w i x i n in i=1 Var(y) = Var( w i x i ) = n in Var(x) Var(w) i=1 if w and x are iid (independent and identically distributed) So, if we want variance of inputs x and outputs y to be the same, set Var(w i ) = 1 n in Nicely explained at http://andyljones.tumblr.com/post/110998971763/ an-explanation-of-xavier-initialization MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 17

GlorotInit ( Xavier initialisation ) We would like to constrain the variance of each layer to be 1/n in, thus w i U( 3/n in, 3/n in ) However we need to take the backprop into account, hence we would also like Var(w i ) = 1/n out As compromise set the variance to be Var(w i ) = 2/(n in + n out ) This corresponds to Glorot and Bengio s normalised initialisation ( w i U 6/(n in + n out ), ) 6/(n in + n out ) Glorot and Bengio, Understanding the difficulty of training deep feedforward networks, AISTATS, 2010. http://www.jmlr.org/proceedings/papers/v9/glorot10a.html MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 18

Summary Computational graphs Learning rate schedules and gradient descent algorithms Initialising the weights Reading Goodfellow et al, sections 6.5, 8.3, 8.5 Olah, Calculus on Computational Graphs: Backpropagation, http://colah.github.io/posts/2015-08-backprop/ Andrej Karpathy, CS231n notes (Stanford) http://cs231n.github.io/neural-networks-3/ Additional Reading Kingma and Ba, Adam: A Method for Stochastic Optimization, ICLR-2015 https://arxiv.org/abs/1412.6980 Glorot and Bengio, Understanding the difficulty of training deep feedforward networks, AISTATS-2010. http://www.jmlr.org/proceedings/papers/v9/glorot10a.html MLP Lecture 5 / 16 October 2018 Deep Neural Networks (3) 19