CSC321 Lecture 16: ResNets and Attention

Similar documents
Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 20: Reversible and Autoregressive Models

CSC321 Lecture 10 Training RNNs

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CSC321 Lecture 15: Recurrent Neural Networks

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Long-Short Term Memory and Other Gated RNNs

Lecture 15: Recurrent Neural Nets

Conditional Language modeling with attention

Lecture 17: Neural Networks and Deep Learning

Recurrent Neural Networks. Jian Tang

CSC321 Lecture 5: Multilayer Perceptrons

Neural Architectures for Image, Language, and Speech Processing

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Slide credit from Hung-Yi Lee & Richard Socher

CSC321 Lecture 9: Generalization

Machine Learning. Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to RNNs!

Convolutional Neural Network Architecture

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Sequence Modeling with Neural Networks

CSC321 Lecture 4 The Perceptron Algorithm

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Understanding How ConvNets See

CSC 411 Lecture 10: Neural Networks

Deep Learning Recurrent Networks 2/28/2018

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Neural Networks in Structured Prediction. November 17, 2015

Stephen Scott.

CSC321 Lecture 7 Neural language models

Machine Learning Lecture 10

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

CSC321 Lecture 5 Learning in a Single Neuron

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Network

Introduction to Convolutional Neural Networks 2018 / 02 / 23

CSC321 Lecture 6: Backpropagation

Deep Learning Recurrent Networks 10/11/2017

CSC321 Lecture 4: Learning a Classifier

attention mechanisms and generative models

CSC321 Lecture 9 Recurrent neural nets

CSC321 Lecture 9: Generalization

CSC321 Lecture 4: Learning a Classifier

Recurrent and Recursive Networks

Machine Learning Lecture 12

Lecture 11 Recurrent Neural Networks I

Grundlagen der Künstlichen Intelligenz

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

CSCI 315: Artificial Intelligence through Deep Learning

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Name: Student number:

Lecture 9: Generalization

CSC321 Lecture 20: Autoencoders

Recurrent neural networks

Introduction to Deep Learning

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Recurrent Neural Networks

SGD and Deep Learning

Based on the original slides of Hung-yi Lee

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Lecture 4: Training a Classifier

Convolution and Pooling as an Infinitely Strong Prior

CSC321 Lecture 2: Linear Regression

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

ECE521 Lectures 9 Fully Connected Neural Networks

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Introduction to Convolutional Neural Networks (CNNs)

Better Conditional Language Modeling. Chris Dyer

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

NEURAL LANGUAGE MODELS

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Lecture 4: Training a Classifier

Memory-Augmented Attention Model for Scene Text Recognition

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Neural Networks: Backpropagation

Pr[X = s Y = t] = Pr[X = s] Pr[Y = t]

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC321 Lecture 8: Optimization

CSC321 Lecture 7: Optimization

Introduction to Deep Learning

Ways to make neural networks generalize better

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Lecture 5 Neural models for NLP

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

CS 301. Lecture 18 Decidable languages. Stephen Checkoway. April 2, 2018

Natural Language Processing and Recurrent Neural Networks

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

CSCI3390-Assignment 2 Solutions

Transcription:

CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24

Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the art approach to object recognition. It applies the insights of avoiding exploding/vanishing gradients to train really deep conv nets. Topic 2: Attention Machine translation: it s hard to summarize long sentences in a single vector, so let s let the decoder peek at the input. Vision: have a network glance at one part of an image at a time, so that we can understand what information it s using We can use attention to build differentiable computers (e.g. Neural Turing Machines) Roger Grosse CSC321 Lecture 16: ResNets and Attention 2 / 24

Deep Residual Networks I promised you I d explain the best ImageNet object recognizer from 2015, but that it required another idea. Year Model Top-5 error 2010 Hand-designed descriptors + SVM 28.2% 2011 Compressed Fisher Vectors + SVM 25.8% 2012 AlexNet 16.4% 2013 a variant of AlexNet 11.7% 2014 GoogLeNet 6.6% 2015 deep residual nets 4.5% That idea is exploding and vanishing gradients, and dealing with them by making it easy to pass information directly through a network. Roger Grosse CSC321 Lecture 16: ResNets and Attention 3 / 24

Deep Residual Networks Recall: the Jacobian h (T ) / h (1) is the product of the individual Jacobians: h (T ) h (1) = h(t ) h(2) h (T 1) h (1) But this applies to multilayer perceptrons and conv nets as well! (Let t index the layers rather than time.) Then how come we didn t have to worry about exploding/vanishing gradients until we talked about RNNs? MLPs and conv nets were at most 10s of layers deep. RNNs would be run over hundreds of time steps. This means if we want to train a really deep conv net, we need to worry about exploding/vanishing gradients! Roger Grosse CSC321 Lecture 16: ResNets and Attention 4 / 24

Deep Residual Networks Remember Homework 3? You derived backprop for this architecture: z = W (1) x + b (1) h = φ(z) y = x + W (2) h This is called a residual unit, and it s actually pretty useful. (This usage of unit is different from units of a network.) Each layer adds something (i.e. a residual) to the previous value, rather than producing an entirely new value. Note: the network for F can have multiple layers, be convolutional, etc. Roger Grosse CSC321 Lecture 16: ResNets and Attention 5 / 24

Deep Residual Networks We can string together a bunch of residual units. What happens if we set the parameters such that F(x (l) ) = 0 in every layer? Then it passes x (1) straight through unmodified! This means it s easy for the network to represent the identity function. Backprop: x (l) = x (l+1) F + x (l+1) ( x = x (l+1) I + F ) x This means the derivatives don t vanish. Roger Grosse CSC321 Lecture 16: ResNets and Attention 6 / 24

Deep Residual Networks Deep Residual Networks (ResNets) consist of many layers of residual units. For vision tasks, the F functions are usually 2- or 3-layer conv nets. Performance on CIFAR-10, a small object recognition dataset: For a regular convnet, performance declines with depth, but for a ResNet, it keeps improving. Roger Grosse CSC321 Lecture 16: ResNets and Attention 7 / 24

Deep Residual Networks A 152-layer ResNet achieved 4.49% top-5 error on Image Net. An ensemble of them achieved 3.57%. Previous state-of-the-art: 6.6% (GoogLeNet) Humans: 5.1% They were able to train ResNets with more than 1000 layers, but classification performance leveled off by 150. What are all these layers doing? We don t have a clear answer, but the idea that they re computing increasingly abstract features is starting to sound fishy... Roger Grosse CSC321 Lecture 16: ResNets and Attention 8 / 24

Attention-Based Machine Translation Next topic: attention-based models. Remember the encoder/decoder architecture for machine translation: The network reads a sentence and stores all the information in its hidden units. Some sentences can be really long. Can we really store all the information in a vector of hidden units? Let s make things easier by letting the decoder refer to the input sentence. Roger Grosse CSC321 Lecture 16: ResNets and Attention 9 / 24

Attention-Based Machine Translation We ll look at the translation model from the classic paper: Bahdanau et al., Neural machine translation by jointly learning to align and translate. ICLR, 2015. Basic idea: each output word comes from one word, or a handful of words, from the input. Maybe we can learn to attend to only the relevant ones as we produce the output. Roger Grosse CSC321 Lecture 16: ResNets and Attention 10 / 24

Attention-Based Machine Translation The model has both an encoder and a decoder. The encoder computes an annotation of each word in the input. It takes the form of a bidirectional RNN. This just means we have an RNN that runs forwards and an RNN that runs backwards, and we concantenate their hidden vectors. The idea: information earlier or later in the sentence can help disambiguate a word, so we need both directions. The RNN uses an LSTM-like architecture called gated recurrent units. Roger Grosse CSC321 Lecture 16: ResNets and Attention 11 / 24

Attention-Based Machine Translation The decoder network is also an RNN. Like the encoder/decoder translation model, it makes predictions one word at a time, and its predictions are fed back in as inputs. The difference is that it also receives a context vector c (t) at each time step, which is computed by attending to the inputs. Roger Grosse CSC321 Lecture 16: ResNets and Attention 12 / 24

Attention-Based Machine Translation The context vector is computed as a weighted average of the encoder s annotations. c (i) = j α ij h (j) The attention weights are computed as a softmax, where the inputs depend on the annotation and the decoder s state: α ij = exp(e ij) j exp(e ij ) e ij = a(s (i 1), h (j) ) Note that the attention function depends on the annotation vector, rather than the position in the sentence. This means it s a form of content-based addressing. My language model tells me the next word should be an adjective. Find me an adjective in the input. Roger Grosse CSC321 Lecture 16: ResNets and Attention 13 / 24

Attention-Based Machine Translation Here s a visualization of the attention maps at each time step. Nothing forces the model to go linearly through the input sentence, but somehow it learns to do it. It s not perfectly linear e.g., French adjectives can come after the nouns. Roger Grosse CSC321 Lecture 16: ResNets and Attention 14 / 24

Attention-Based Machine Translation The attention-based translation model does much better than the encoder/decoder model on long sentences. Roger Grosse CSC321 Lecture 16: ResNets and Attention 15 / 24

Attention-Based Caption Generation Attention can also be used to understand images. We humans can t process a whole visual scene at once. The fovea of the eye gives us high-acuity vision in only a tiny region of our field of view. Instead, we must integrate information from a series of glimpses. The next few slides are based on this paper from the UofT machine learning group: Xu et al. Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention. ICML, 2015. Roger Grosse CSC321 Lecture 16: ResNets and Attention 16 / 24

Attention-Based Caption Generation The task is caption generation, just like Programming Assignment 2. Encoder: a classification conv net (VGGNet, similar to AlexNet). This computes a bunch of feature maps over the image. Decoder: an attention-based RNN, analogous to the decoder in the translation model In each time step, the decoder computes an attention map over the entire image, effectively deciding which regions to focus on. It receives a context vector, which is the weighted average of the conv net features. Roger Grosse CSC321 Lecture 16: ResNets and Attention 17 / 24

Attention-Based Caption Generation This lets us understand where the network is looking as it generates a sentence. Roger Grosse CSC321 Lecture 16: ResNets and Attention 18 / 24

Attention-Based Caption Generation This can also help us understand the network s mistakes. Roger Grosse CSC321 Lecture 16: ResNets and Attention 19 / 24

Neural Turing Machines We said earlier that multilayer perceptrons are like differentiable circuits. Using an attention model, we can build differentiable computers. We ve seen hints that sparsity of memory accesses can be useful: Computers have a huge memory, but they only access a handful of locations at a time. Can we make neural nets more computer-like? Roger Grosse CSC321 Lecture 16: ResNets and Attention 20 / 24

Neural Turing Machines Recall Turing machines: You have an infinite tape, and a head, which transitions between various states, and reads and writes to the tape. If in state A and the current symbol is 0, write a 0, transition to state B, and move right. These simple machines are universal they re capable of doing any computation that ordinary computers can. Roger Grosse CSC321 Lecture 16: ResNets and Attention 21 / 24

Neural Turing Machines Neural Turing Machines are an analogue of Turing machines where all of the computations are differentiable. This means we can train the parameters by doing backprop through the entire computation. Each memory location stores a vector. The read and write heads interact with a weighted average of memory locations, just as in the attention models. The controller is an RNN (in particular, an LSTM) which can issue commands to the read/write heads. Roger Grosse CSC321 Lecture 16: ResNets and Attention 22 / 24

Neural Turing Machines Repeat copy task: receives a sequence of binary vectors, and has to output several repetitions of the sequence. Pattern of memory accesses for the read and write heads: Roger Grosse CSC321 Lecture 16: ResNets and Attention 23 / 24

Neural Turing Machines Priority sort: receives a sequence of (key, value) pairs, and has to output the values in sorted order by key. Sequence of memory accesses: Roger Grosse CSC321 Lecture 16: ResNets and Attention 24 / 24