TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Similar documents
Introduction to Convolutional Neural Networks 2018 / 02 / 23

Lecture 17: Neural Networks and Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

EE-559 Deep learning Recurrent Neural Networks

EE-559 Deep learning LSTM and GRU

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Long-Short Term Memory and Other Gated RNNs

arxiv: v3 [cs.lg] 14 Jan 2018

Recurrent Neural Networks. Jian Tang

RECURRENT NETWORKS I. Philipp Krähenbühl

Recurrent neural networks

Slide credit from Hung-Yi Lee & Richard Socher

Normalization Techniques in Training of Deep Neural Networks

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent and Recursive Networks

Deep Residual. Variations

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Convolutional Neural Network Architecture

Introduction to RNNs!

Recurrent Neural Network

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CSC321 Lecture 15: Exploding and Vanishing Gradients

Neural Architectures for Image, Language, and Speech Processing

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Sequence Modeling with Neural Networks

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Lecture 11 Recurrent Neural Networks I

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Based on the original slides of Hung-yi Lee

Introduction to Deep Neural Networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Advanced Machine Learning

Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 16: ResNets and Attention

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

CSC321 Lecture 10 Training RNNs

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Ch.6 Deep Feedforward Networks (2/3)

Natural Language Processing and Recurrent Neural Networks

arxiv: v1 [cs.cl] 21 May 2017

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Long-Short Term Memory

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Stephen Scott.

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

text classification 3: neural networks

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Machine Learning Lecture 14

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Neural Networks and Introduction to Deep Learning

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Lecture 3 Feedforward Networks and Backpropagation

Deep Learning Recurrent Networks 10/11/2017

Introduction to (Convolutional) Neural Networks

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Deep Learning Recurrent Networks 2/28/2018

A summary of Deep Learning without Poor Local Minima

Faster Training of Very Deep Networks Via p-norm Gates

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Recurrent Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Structured Neural Networks (I)

Neural Networks Language Models

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Deep Learning (CNNs)

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Reading Group on Deep Learning Session 1

ARTIFICIAL neural networks (ANNs) are made from

Recurrent Neural Networks

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Learning Long-Term Dependencies with Gradient Descent is Difficult

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Deep Recurrent Neural Networks

Deep Learning Recurrent Networks 10/16/2017

Introduction to Convolutional Neural Networks (CNNs)

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan

Deep Learning for Automatic Speech Recognition Part II

Conditional Language modeling with attention

Generating Sequences with Recurrent Neural Networks

Transcription:

TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Vanishing and Exploding Gradients ReLUs Xavier Initialization Batch Normalization Highway Architectures: Resnets, LSTMs and GRUs

Causes of Vanishing and Exploding Gradients Activation function saturation Repeated multiplication by network weights

Activation Function Saturation Consider the sigmoid activation function 1/(1 + e x ). The gradient of this function is quite small for x > 4. In deep networks backpropagation can go through many sigmoids and the gradient can vanish

Activation Function Saturation ReLU(x) = max(x, 0) The ReLU does not saturate at positive inputs (good) but is completely saturated at negative inputs (bad). Alternate variations of ReLU still have small gradients at negative inputs.

Repeated Multiplication by Network Weights Consider a deep CNN. L i+1 = Relu(Conv(L i, F i )) For i large, L i has been multiplied by many weights. If the weights are small then the neuron values, and hence the weight gradients, decrease exponentially with depth. If the weights are large, and the activation functions do not saturate, then the neuron values, and hence the weight gradients, increase exponentially with depth.

Methods for Maintaining Gradients Initialization Batch Normalization Highway Architectures (Skip Connections)

Methods for Maintaining Gradients [Kaiming He] We will say highway architecture rather than skip connections.

Initialization

Xavier Initialization Initialize a weight matrix (or tensor) to preserve zero-mean unit variance distributions. If we assume x i has unit mean and zero variance then we want y j = N 1 j=0 x i w i,j to have zero mean and unit variance. Xavier initialization ( randomly sets w i,j to be uniform in the ) interval 3N, 3N. Assuming independence this gives zero mean and unit variance for y j.

EDF Implementation def xavier(shape): sq = np.sqrt(3.0/np.prod(shape[:-1])) return np.random.uniform(-sq,sq,shape) This assumes that we sum over all but the last index. For example, an image convolution filter has shape (H, W, C 1, C 2 ) and we sum over the first three indices.

Kaiming Initialization A ReLU nonlinearity reduces the variance. Before a( ReLU nonlinearity it seems better to use the larger ) interval 6N, 6N.

Batch Normalization

Normalization We first compute the mean and variance over the batch for each channel and renormalizes the channel value. Norm(x) = x ˆµ ˆσ ˆµ = 1 B b x b ˆσ = 1 B 1 (x b ˆµ) 2 b

Backpropagation Through Normalization Note that we can backpropgate through the normalization operation. Here the different batch elements are not independent. At test time a single fixed estimate of µ and σ is used.

Adding an Affine Transformation BatchNorm(x) = γnorm(x) + β Keep in mind that this done seperately for each channel. This allows the batch normlization to learn an arbitrary affine transformation (offset and scaling). It can even undo the normaliztion.

Batch Normalization Batch Normalization is empirically very successful in CNNs. Not so successful in RNNs (RNNs are discussed below). It is typically used just prior to a nonlinear activation function. It was originally justified in terms of covariant shift. However, it can also be justified along the same lines as Xavier initialization we need to keep the input to an activation function in the active region.

Normalization Interacts with SGD Consider backpropagation through a weight layer. y.value[ ] += w.value[ ] x.value[... ] w.grad[ ] += y.grad[ ] x.value[ ] Replacing x by x/ˆσ seems related to RMSProp for the update of w.

Highway Architectures

Deep Residual Networks (ResNets) by Kaiming He 2015 Here we have a highway with diversions. The highway path connects input to outputs and preserves gradients. Resnets were introduced in late 2015 (Kaiming He et al.) and revolutionized computer vision. The resnet that won the Imagenet competition in 2015 had 152 diversions.

Pure and Gated Highway Architectures Pure (Resnet): L i+1 = L i + D i Forget Gated (LSTM): L i+1 = F i L i + D i Exclusively Gated (GRU): L i+1 = F i L i + (1 F i ) D i In the pure case the diversion D i fits the residual of the identity function

Methods for Maintaining Gradients [Kaiming He] Identity skip connections give a pure highway architecture.

[Kaiming He]

Deep Residual Networks As with most of deep learning, not much is known about what resnets are actually doing. For example, different diversions might update disjoint channels making the networks shallower than they look. They are capable of representing very general circuit topologies.

A Bottleneck Diversion This reduction in the number of channels used in the diversion suggests a modest update of the highway information. [Kaiming He]

Long Short Term Memory (LSTMs)

Recurrent Neural Networks [Christopher Olah] In EDF: h t+1 = tanh(w h h t + W x x t + β) for t in range(t): H.append(None) H[t+1] = Sigmoid(ADD(VDot(Wh,H[t]), VDot(Wx,X[t]), Beta)

Gradient Clipping An RNN uses the same weights at every time step. If we avoid saturation of the activation functions then we get exponentially growing or shrinking eigenvectors of the weight matrix. To handle exploding gradients we can define a gradient clipping component. h t+1 = clip(tanh(w h h t + W x x t + b)) The forward method of the clip component is the identity function. The backward method shrinks the gradients to stay within some limit.

Long Short Term Memory (LSTM) [Christopher Olah] The highway path goes across the top and is called the Carry: Resnet: LSTM: L i+1 = L i + D i C t+1 = F t C t + D t

Long Short Term Memory (LSTM) [Christopher Olah] The highway path C (across the top of the figure) is gated. C t+1 = F t C t + D t The product F C is componentwise (as in Numpy). We say F gates C. F is called the forget gate.

Long Short Term Memory (LSTM) [Christopher Olah] C t+1 = F t C t + D t F t = σ(w F [X t, H t ]) D t = σ(w I [X t, H t ]) tanh(w D [X t, H t ]) H t+1 = σ(w O [X t, H t ]) tanh(w H C t+1 )

Gated Recurrent Unity (GRU) by Cho et al. 2014 The highway path is H. [Christopher Olah] H t+1 = (1 I t ) H t + I t D t I t = σ(w I [X t, H t ]) D t = tanh(w D [X t, σ(w r [x t, H t ]) H t ])

GRUs vs. LSTMs In TTIC31230 class projects GRUs consistently outperformed LSTMs. They are also conceptually simpler.

GRU Language Models

GRU Character Language Models h 0 and c 0 are parameters. For t > 0, we have h t as defined by the GRU equations. P (x t+1 x 0, x 1,..., x t ) = Softmax(W o h t ) The parameter W o has shape (V, H) where V is the vocabulary size (number of characters in a character model) and H is the dimension of the hidden state vectors. l(x 0,..., x T 1 ) = T 1 t=0 ln 1/P (x t x 0,..., x T 1 )

Generating from a Character Language Model Repeatedly select x t+1 from P (x t+1 x 0, x 1,..., x t )

END