CSC321 Lecture 15: Recurrent Neural Networks

Similar documents
CSC321 Lecture 9 Recurrent neural nets

Lecture 15: Recurrent Neural Nets

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 16: ResNets and Attention

Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 7 Neural language models

CSC321 Lecture 15: Exploding and Vanishing Gradients

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Recurrent Neural Networks

Introduction to RNNs!

arxiv: v1 [cs.cl] 21 May 2017

CSC321 Lecture 20: Reversible and Autoregressive Models

Deep learning for Natural Language Processing and Machine Translation

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Long-Short Term Memory and Other Gated RNNs

Improved Learning through Augmenting the Loss

CSC321 Lecture 4: Learning a Classifier

Lecture 4: Training a Classifier

Recurrent Neural Networks. Jian Tang

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CSC321 Lecture 5 Learning in a Single Neuron

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Name: Student number:

Machine Learning. Neural Networks

Lecture 4: Training a Classifier

EE-559 Deep learning LSTM and GRU

Deep Recurrent Neural Networks

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

CSC321 Lecture 8: Optimization

CSC321 Lecture 7: Optimization

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CSC321 Lecture 4: Learning a Classifier

CSCI 315: Artificial Intelligence through Deep Learning

CSC321 Lecture 20: Autoencoders

CSC321 Lecture 4 The Perceptron Algorithm

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

Sequence Modeling with Neural Networks

Deep Learning Recurrent Networks 2/28/2018

Conditional Language Modeling. Chris Dyer

Self-Attention with Relative Position Representations

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

CSC321 Lecture 9: Generalization

Based on the original slides of Hung-yi Lee

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

CSC 411 Lecture 10: Neural Networks

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

CSC321 Lecture 6: Backpropagation

Conditional Language modeling with attention

NEURAL LANGUAGE MODELS

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Slide credit from Hung-Yi Lee & Richard Socher

Nonlinear Classification

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Lecture 17: Neural Networks and Deep Learning

Lecture 6: Backpropagation

Recurrent Neural Network

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

From perceptrons to word embeddings. Simon Šuster University of Groningen

Course 395: Machine Learning - Lectures

Section 20: Arrow Diagrams on the Integers

Learning to translate with neural networks. Michael Auli

An overview of word2vec

Combining Static and Dynamic Information for Clinical Event Prediction

Midterm sample questions

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

CSC321 Lecture 2: Linear Regression

Lecture 5: Logistic Regression. Neural Networks

CSC321 Lecture 9: Generalization

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

arxiv: v1 [cs.cl] 22 Jun 2015

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

AI Programming CS F-20 Neural Networks

Better Conditional Language Modeling. Chris Dyer

Deep Learning for Natural Language Processing

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Multi-Source Neural Translation

ECE521 Lecture 7/8. Logistic Regression

Naïve Bayes, Maxent and Neural Models

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Faster Training of Very Deep Networks Via p-norm Gates

Recurrent and Recursive Networks

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSE 446 Dimensionality Reduction, Sequences

Multi-Source Neural Translation

Deep Learning: a gentle introduction

Transcription:

CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26

Overview Sometimes we re interested in predicting sequences Speech-to-text and text-to-speech Caption generation Machine translation If the input is also a sequence, this setting is known as sequence-to-sequence prediction. We already saw one way of doing this: neural language models But autoregressive models are memoryless, so they can t learn long-distance dependencies. Recurrent neural networks (RNNs) are a kind of architecture which can remember things over time. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 2 / 26

Overview Recall that we made a Markov assumption: p(w i w 1,..., w i 1 ) = p(w i w i 3, w i 2, w i 1 ). This means the model is memoryless, i.e. it has no memory of anything before the last few words. But sometimes long-distance context can be important. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 3 / 26

Overview Autoregressive models such as the neural language model are memoryless, so they can only use information from their immediate context (in this figure, context length = 1): If we add connections between the hidden units, it becomes a recurrent neural network (RNN). Having a memory lets an RNN use longer-term dependencies: Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 4 / 26

Recurrent neural nets We can think of an RNN as a dynamical system with one set of hidden units which feed into themselves. The network s graph would then have self-loops. We can unroll the RNN s graph by explicitly representing the units at all time steps. The weights and biases are shared between all time steps Except there is typically a separate set of biases for the first time step. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 5 / 26

RNN examples Now let s look at some simple examples of RNNs. This one sums its inputs: linear output unit 2 1.5 2.5 3.5 w=1 w=1 w=1 w=1 w=1 linear hidden unit w=1 2 1.5 2.5 w=1 w=1 w=1 3.5 w=1 w=1 w=1 w=1 w=1 input unit 2-0.5 1 1 T=1 T=2 T=3 T=4 Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 6 / 26

RNN examples This one determines if the total values of the first or second input are larger: logistic output unit 1.00 0.92 0.03 w=5 linear hidden unit w=1 4 0.5-0.7 w=1 w= -1 input unit 1 input unit 2 2-2 0 3.5 1 2.2 T=1 T=2 T=3 Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 7 / 26

Example: Parity Assume we have a sequence of binary inputs. We ll consider how to determine the parity, i.e. whether the number of 1 s is even or odd. We can compute parity incrementally by keeping track of the parity of the input so far: Parity bits: 0 1 1 0 1 1 Input: 0 1 0 1 1 0 1 0 1 1 Each parity bit is the XOR of the input and the previous parity bit. Parity is a classic example of a problem that s hard to solve with a shallow feed-forward net, but easy to solve with an RNN. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 8 / 26

Example: Parity Assume we have a sequence of binary inputs. We ll consider how to determine the parity, i.e. whether the number of 1 s is even or odd. Let s find weights and biases for the RNN on the right so that it computes the parity. All hidden and output units are binary threshold units. Strategy: The output unit tracks the current parity, which is the XOR of the current input and previous output. The hidden units help us compute the XOR. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 9 / 26

Example: Parity Unrolling the parity RNN: Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 10 / 26

Example: Parity The output unit should compute the XOR of the current input and previous output: y (t 1) x (t) y (t) 0 0 0 0 1 1 1 0 1 1 1 0 Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 11 / 26

Example: Parity Let s use hidden units to help us compute XOR. Have one unit compute AND, and the other one compute OR. Then we can pick weights and biases just like we did for multilayer perceptrons. y (t 1) x (t) h (t) 1 h (t) 2 y (t) 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 12 / 26

Example: Parity Let s use hidden units to help us compute XOR. Have one unit compute AND, and the other one compute OR. Then we can pick weights and biases just like we did for multilayer perceptrons. y (t 1) x (t) h (t) 1 h (t) 2 y (t) 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 12 / 26

Example: Parity We still need to determine the hidden biases for the first time step. The network should behave as if the previous output was 0. This is represented with the following table: x (1) h (1) 1 h (1) 2 0 0 0 1 0 1 Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 13 / 26

Backprop Through Time As you can guess, we don t usually set RNN weights by hand. Instead, we learn them using backprop. In particular, we do backprop on the unrolled network. This is known as backprop through time. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 14 / 26

Backprop Through Time Here s the unrolled computation graph. Notice the weight sharing. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 15 / 26

Backprop Through Time Activations: Parameters: L = 1 y (t) = L L y (t) r (t) = y (t) φ (r (t) ) h (t) = r (t) v + z (t+1) w z (t) = h (t) φ (z (t) ) u = t v = t w = t z (t) x (t) r (t) h (t) z (t+1) h (t) Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 16 / 26

Backprop Through Time Now you know how to compute the derivatives using backprop through time. The hard part is using the derivatives in optimization. They can explode or vanish. Addressing this issue will take all of the next lecture. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 17 / 26

Language Modeling One way to use RNNs as a language model: As with our language model, each word is represented as an indicator vector, the model predicts a distribution, and we can train it with cross-entropy loss. This model can learn long-distance dependencies. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 18 / 26

Language Modeling When we generate from the model (i.e. compute samples from its distribution over sentences), the outputs feed back in to the network as inputs. At training time, the inputs are the tokens from the training set (rather than the network s outputs). This is called teacher forcing. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 19 / 26

Some remaining challenges: Vocabularies can be very large once you include people, places, etc. It s computationally difficult to predict distributions over millions of words. How do we deal with words we haven t seen before? In some languages (e.g. German), it s hard to define what should be considered a word. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 20 / 26

Language Modeling Another approach is to model text one character at a time! This solves the problem of what to do about previously unseen words. Note that long-term memory is essential at the character level! Note: modeling language well at the character level requires multiplicative interactions, which we re not going to talk about. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 21 / 26

Language Modeling From Geoff Hinton s Coursera course, an example of a paragraph generated by an RNN language model one character at a time: He was elected President during the Revolutionary War and forgave Opus Paul at Rome. The regime of his crew of England, is now Arab women's icons in and the demons that use something between the characters sisters in lower coil trains were always operated on the line of the ephemerable street, respectively, the graphic or other facility for deformation of a given proportion of large segments at RTUS). The B every chord was a "strongly cold internal palette pour even the white blade. J. Martens and I. Sutskever, 2011. Learning recurrent neural networks with Hessian-free optimization. http://machinelearning.wustl.edu/mlpapers/paper_files/icml2011martens_532.pdf Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 22 / 26

Neural Machine Translation We d like to translate, e.g., English to French sentences, and we have pairs of translated sentences to train on. What s wrong with the following setup? Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 23 / 26

Neural Machine Translation We d like to translate, e.g., English to French sentences, and we have pairs of translated sentences to train on. What s wrong with the following setup? The sentences might not be the same length, and the words might not align perfectly. You might need to resolve ambiguities using information from later in the sentence. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 23 / 26

Neural Machine Translation Encoder-decoder architecture: the network first reads and memorizes the sentence. When it sees the end token, it starts outputting the translation. The encoder and decoder are two different networks with different weights. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio. EMNLP 2014. Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals and Quoc Le, NIPS 2014. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 24 / 26

What can RNNs compute? In 2014, Google researchers built an encoder-decoder RNN that learns to execute simple Python programs, one character at a time! Input: j=8584 for x in range(8): j+=920 b=(1500+j) print((b+7567)) Target: 25011. Input: i=8827 c=(i-5347) print((c+8704) if 2641<8500 else 5308) Target: 1218. Input: vqppkn sqdvfljmnc y2vxdddsepnimcbvubkomhrpliibtwztbljipcc Target: hkhpg A training input with characters scrambled Example training inputs W. Zaremba and I. Sutskever, Learning to Execute. http://arxiv.org/abs/1410.4615 Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 25 / 26

What can RNNs compute? Some example results: Input: print(6652). Target: 6652. Baseline prediction: 6652. Naive prediction: 6652. Mix prediction: 6652. Combined prediction: 6652. print((5997-738)). Target: 5259. Baseline prediction: 5101. Naive prediction: 5101. Mix prediction: 5249. Combined prediction: 5229. Input: d=5446 for x in range(8):d+=(2678 if 4803<2829 else 9848) print((d if 5935<4845 else 3043)). Target: 3043. Baseline prediction: 3043. Naive prediction: 3043. Mix prediction: 3043. Combined prediction: 3043. Input: print(((1090-3305)+9466)). Target: 7251. Baseline prediction: 7111. Naive prediction: 7099. Mix prediction: 7595. Combined prediction: 7699. Take a look through the results (http://arxiv.org/pdf/1410.4615v2.pdf#page=10). It s fun to try to guess from the mistakes what algorithms it s discovered. Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 26 / 26