Training Neural Networks Practical Issues

Similar documents
Introduction to Convolutional Neural Networks (CNNs)

Machine Learning Lecture 14

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation

CS60010: Deep Learning

Ch.6 Deep Feedforward Networks (2/3)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Course 395: Machine Learning - Lectures

Lecture 35: Optimization and Neural Nets

Deep Feedforward Networks

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Lecture 6 Optimization for Deep Neural Networks

Convolutional Neural Network Architecture

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Gradient-Based Learning. Sargur N. Srihari

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Slide credit from Hung-Yi Lee & Richard Socher

P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks

Lecture 17: Neural Networks and Deep Learning

Machine Learning (CSE 446): Backpropagation

Tips for Deep Learning

Tips for Deep Learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Deep Learning & Artificial Intelligence WS 2018/2019

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Backpropagation and Neural Networks part 1. Lecture 4-1

Feed-forward Network Functions

Deep Feedforward Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Introduction to Machine Learning (67577)

Deep Learning Lab Course 2017 (Deep Learning Practical)

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

CSE 591: Introduction to Deep Learning in Visual Computing. - Parag S. Chandakkar - Instructors: Dr. Baoxin Li and Ragav Venkatesan

Lecture 2: Learning with neural networks

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Day 3 Lecture 3. Optimizing deep networks

Lecture 15: Exploding and Vanishing Gradients

SGD and Deep Learning

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CSC321 Lecture 15: Exploding and Vanishing Gradients

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Deep Learning for NLP

Advanced Machine Learning

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Reading Group on Deep Learning Session 1

Notes on Adversarial Examples

Introduction to Deep Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

More Tips for Training Neural Network. Hung-yi Lee

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Lecture 7 Convolutional Neural Networks

Normalization Techniques

Normalization Techniques in Training of Deep Neural Networks

Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Machine Learning Basics III

Deep Neural Networks (1) Hidden layers; Back-propagation

Lecture 5: Logistic Regression. Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Neural Networks: Introduction

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lecture 2: Modular Learning

Convolutional neural networks

OPTIMIZATION METHODS IN DEEP LEARNING

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Lecture 4 Backpropagation

Deep Residual. Variations

Artificial Neural Networks

1 What a Neural Network Computes

Neural networks and optimization

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Week 5: Logistic Regression & Neural Networks

Introduction to Deep Learning

Advanced Training Techniques. Prajit Ramachandran

A summary of Deep Learning without Poor Local Minima

CSC321 Lecture 16: ResNets and Attention

Maxout Networks. Hien Quoc Dang

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CS489/698: Intro to ML

Tutorial on Machine Learning for Advanced Electronics

Deep Learning (CNNs)

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Based on the original slides of Hung-yi Lee

Neural Network Language Modeling

CSC 411 Lecture 10: Neural Networks

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Numerical Computation for Deep Learning

Recurrent neural networks

Transcription:

Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from Andrew Ng lectures, Deep Learning Specialization, coursera, 2017.

Outline Activation Functions Data Preprocessing Weight Initialization Checking gradients

Neuron

Activation functions

Activation functions: sigmoid Squashes numbers to range [0,1] Historically popular since they have nice interpretation as a saturating firing rate of a neuron σ x = 1 1 + e x Problems: 1. Saturated neurons kill the gradients

Activation functions: sigmoid What happens when x = -10? What happens when x = 0? What happens when x = 10?

Activation functions: sigmoid Squashes numbers to range [0,1] Historically popular since they have nice interpretation as a saturating firing rate of a neuron σ x = 1 1 + e x Problems: 1. Saturated neurons kill the gradients 2. Sigmoid outputs are not zero-centered

Activation functions: sigmoid Consider what happens when the input to a neuron (x) is always positive: f w i x i + b What can we say about the gradients on w? i

Activation functions: sigmoid Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative f i w i x i + b

Activation functions: sigmoid Squashes numbers to range [0,1] Historically popular since they have nice interpretation as a saturating firing rate of a neuron σ x = 1 1 + e x 3 problems: 1. Saturated neurons kill the gradients 2. Sigmoid outputs are not zero-centered 3. exp() is a bit compute expensive

Activation functions: tanh Squashes numbers to range [-1,1] zero centered (nice) still kills gradients when saturated :( tanh x = 1 e 2x 1 + e 2x = 1 2σ 2x [LeCun et al., 1991]

Activation functions: ReLU (Rectified Linear Unit) [Krizhevsky et al., 2012] Does not saturate (in +region) Very computationally efficient Converges much faster than sigmoid/tanh in practice (e.g. 6x) Actually more biologically plausible than sigmoid f(x) = max(0, x)

Activation functions: ReLU What happens when x = -10? What happens when x = 0? What happens when x = 10?

Activation functions: ReLU [Krizhevsky et al., 2012] Does not saturate (in +region) Very computationally efficient Converges much faster than sigmoid/tanh in practice (e.g. 6x) Actually more biologically plausible than sigmoid Problems: Not zero-centered output An annoyance: hint: what is the gradient when x < 0? f(x) = max(0, x)

Activation functions: ReLU => people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)

Activation functions: Leaky ReLU Does not saturate Computationally efficient Converges much faster than sigmoid/tanh in practice! (e.g. 6x) will not die. [Mass et al., 2013] [He et al., 2015] Parametric Rectifier (PReLU) α is determined during backpropagation

Activation Functions: Exponential Linear Units (ELU) All benefits of ReLU Closer to zero mean outputs Negative saturation regime compared with Leaky ReLU adds some robustness to noise [Clevert et al., 2015] Computation requires exp()

Maxout Neuron [Goodfellow et al., 2013] Does not have the basic form of dot product->nonlinearity Generalizes ReLU and Leaky ReLU Linear Regime! Does not saturate! Does not die! Problem: doubles the number of parameters/neuron :(

Activation functions: TLDR In practice: Use ReLU. Be careful with your learning rates Try out Leaky ReLU / Maxout / ELU Try out tanh but don t expect much Don t use sigmoid

Data Preprocessing

Normalizing the input On the training set compute mean of each input (feature) μ i = n=1 N σ i 2 = n=1 x i n N N n 2 x i μi N

Normalizing the input On the training set compute mean of each input (feature) μ i = n=1 N N N σ i 2 = n=1 x i n x i n μi 2 N Remove mean: from each of the input mean of the corresponding input x i x i μ i Normalize variance: x i x i σ i If we normalize the training set we use the same μ and σ 2 to normalize test data too

Why zero-mean input? Reminder: sigmoid Consider what happens when the input to a neuron is always positive... What can we say about the gradients on w? Always all positive or all negative this is also why you want zero-mean data!

[Andrew Ng, Deep Neural Network, 2017] 2017 Coursera Inc. Why normalize inputs? Very different range of input features

[Andrew Ng, Deep Neural Network, 2017] 2017 Coursera Inc. Why normalize inputs? Speed up training

TLDR: In practice for Images: center only Example: consider CIFAR-10 example with [32,32,3] images Subtract the mean image (e.g. AlexNet) (mean image = [32,32,3] array) Subtract per-channel mean (e.g. VGGNet) (mean along each channel = 3 numbers) Not common to normalize variance, to do PCA or whitening

Weight initialization How to choose the starting point for the iterative process of optimization

Weight initialization Initializing to zero (or equally) First idea: Small random numbers gaussian with zero mean and 1e-2 standard deviation Works ~okay for small networks, but problems with deeper networks.

Multi-layer network Output = a [L] = f z [L] = f W [L] a [L 1] = f W [L] f(w [L 1] a [L 2] a [L] = output a [L] f [L] W [L] z [L] a [L 1] f [2] z [2] = f W [L] f W [L 1] f W [2] f W [1] x a [1] W [2] f [1] z [1] W [1] x

Vanishing/exploding gradinets E n W [l] = E [l] n a [l] a W [l] = δ[l] a [l 1] f z [l] δ [l 1] = f z [l] W [l] δ [l] = f z [l] W [l] f z [l+1] W [l+1] δ [l+1] = = f z [l] W [l] f z [l+1] W [l+1] f z [l+2] W [l+2] f z [L] W [L] δ [L]

Vanishing/exploding gradinets δ [l 1] = = f z [l] W [l] f z [l+1] W [l+1] f z [l+2] W [l+2] f z [L] W [L] δ [L] For deep networks: Large weights can cause exploding gradients Small weights can cause vanishing gradients

Vanishing/exploding gradinets δ [l 1] = = f z [l] W [l] f z [l+1] W [l+1] f z [l+2] W [l+2] f z [L] W [L] δ [L] For deep networks: Large weights can cause exploding gradients Small weights can cause vanishing gradients Example: W [1] = = W [L] = ωi, f z = z δ [1] = ωi L 1 δ [L] = ω L 1 δ [L]

Lets look at some activation statistics E.g. 10-layer net with 500 neurons on each layer using tanh non-linearities Initialization: gaussian with zero mean and 0.01 standard deviation

Lets look at some activation statistics Initialization: gaussian with zero mean and 0.01 standard deviation Tanh activation function All activations become zero! Q: think about the backward pass. What do the gradients look like? Hint: think about backward pass for a W*X gate.

Lets look at some activation statistics Initialization: gaussian with zero mean and 1 standard deviation Tanh activation function Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero.

Xavier initialization: intuition To have similar variances for neurons outputs, for neurons with larger number of inputs we need to scale down the variance: z = w 1 x 1 +. +w r x r Thus, we scale down the weights variances when there exist higher fan in Thus, Xavier initialization can help to reduce exploding and vanishing gradient problem

Xavier initialization [Glorot et al., 2010] Initialization: gaussian with zero mean and 1/ fan_in standard deviation fan_in for fully connected layers = number of neurons in the previous layer Tanh activation function Reasonable initialization. (Mathematical derivation assumes linear activations)

Xavier initialization Initialization: gaussian with zero mean and 1/ fan_in standard deviation but when using the ReLU nonlinearity it breaks. Tanh activation function

Initialization: ReLu activation Initialization: gaussian with zero mean and 1/ fan in /2 standard deviation [He et al., 2015] (note additional /2)

Initialization: ReLu activation Initialization: gaussian with zero mean and 1/ fan in /2 standard deviation [He et al., 2015] (note additional /2)

Proper initialization is an active area of research Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init, Mishkin and Matas, 2015.

Check implementation of Backpropagation After implementing backprop we may need to check is it correct We need to identify the modules that may have bug: Which elements of the approximate vector of gradients are far from the corresponding ones in the exact gradient vector

Check implementation of Backpropagation f x = x 4 Error: O(h) when h 0 x h x x + h Error: O(h 2 ) when h 0

Check implementation of Backpropagation f x = x 4 Error: O(h) when h 0 x h x x + h Error: O(h 2 ) when h 0 f 3 + 0.001 f 3 0.001 2 0.001 81.1080540 80.8920540 = 0.002 = 108.000006 f 3 = 4 3 3 = 108

Check gradients and debug backpropagation: θ : containing all parameters of the network (W 1, W L vectorized and concatenated) are

Check gradients and debug backpropagation: θ : containing all parameters of the network (W 1, W L vectorized and concatenated) are J i = J θ i (i = 1,, p) is found via backprop

Check gradients and debug backpropagation: θ : containing all parameters of the network (W 1, W L vectorized and concatenated) are J i = J θ i (i = 1,, p) is found via backprop For each i: J [i] = J θ 1,,θ i 1,θ i +h,θ i,,θ p J(θ 1,,θ i 1,θ i h,θ i,,θ p ) 2h

Check gradients and debug backpropagation: θ: containing all parameters of the network (W 1, W L and concatenated) are vectorized J i = J θ i (i = 1,, p) is found via backprop For each i: J [i] = J θ 1,,θ i 1,θ i +h,θ i,,θ p J(θ 1,,θ i 1,θ i h,θ i,,θ p ) 2h Compute ε = J J 2 J 2+ J 2

CheckGrad with e.g. h = 10 7 : ε < 10 7 (great) ε 10 5 (may need check) okay for objectives with kinks. But if there are no kinks then 1e-4 is too high. ε 10 3 (concern) ε > 10 2 (probably wrong) Use double precision and stick around active range of floating point Be careful with the step size h (not too small that cause underflow) Gradient check after burn-in of training After running gradcheck at random initialization, run grad check again when you've trained for some number of iterations

Before learning: sanity checks Tips/Tricks Look for correct loss at chance performance. Increasing the regularization strength should increase the loss Network can overfit a tiny subset of data. make sure you can achieve zero cost. set regularization to zero

Initial loss Example: CIFAR-10

Double check that the loss is reasonable

Primitive check of the whole network Make sure that you can overfit very small portion of the training data Example: Take the first 20 examples from CIFAR-10 turn off regularization use simple vanilla sgd Very small loss, train accuracy 1.00, nice!

Resource Please see the following notes: http://cs231n.github.io/neural-networks-2/ http://cs231n.github.io/neural-networks-3/