Recurrent and Recursive Networks

Similar documents
Long-Short Term Memory and Other Gated RNNs

Sequence Modeling with Neural Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

CSC321 Lecture 15: Exploding and Vanishing Gradients

Slide credit from Hung-Yi Lee & Richard Socher

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks. Jian Tang

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Network

Natural Language Processing and Recurrent Neural Networks

Lecture 15: Exploding and Vanishing Gradients

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Neural Architectures for Image, Language, and Speech Processing

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

RECURRENT NETWORKS I. Philipp Krähenbühl

arxiv: v3 [cs.lg] 14 Jan 2018

Deep Recurrent Neural Networks

Recurrent Neural Networks

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Learning Recurrent Networks 2/28/2018

CSC321 Lecture 16: ResNets and Attention

EE-559 Deep learning Recurrent Neural Networks

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Introduction to RNNs!

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent neural networks

Neural Networks Language Models

Applied Natural Language Processing

Structured Neural Networks (I)

Neural Networks in Structured Prediction. November 17, 2015

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Tracking the World State with Recurrent Entity Networks

CSCI 315: Artificial Intelligence through Deep Learning

Deep Learning Recurrent Networks 10/11/2017

Natural Language Processing

CSC321 Lecture 10 Training RNNs

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Conditional Language modeling with attention

Lecture 17: Neural Networks and Deep Learning

Stephen Scott.

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Long-Short Term Memory

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

(

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

ECE521 Lecture 7/8. Logistic Regression

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Generating Sequences with Recurrent Neural Networks

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

NEURAL LANGUAGE MODELS

Long Short-Term Memory (LSTM)

Speech and Language Processing

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Hidden Markov Models

EE-559 Deep learning LSTM and GRU

Lecture 5 Neural models for NLP

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Task-Oriented Dialogue System (Young, 2000)

Recurrent Neural Networks

Lecture 15: Recurrent Neural Nets

arxiv: v1 [cs.ne] 14 Nov 2012

Introduction to Deep Neural Networks

Based on the original slides of Hung-yi Lee

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Learning Long-Term Dependencies with Gradient Descent is Difficult

Fun with weighted FSTs

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Sample Exam COMP 9444 NEURAL NETWORKS Solutions

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Sequence Transduction with Recurrent Neural Networks

Statistical Methods for NLP

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

4. Multilayer Perceptrons

Hidden Markov Models

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Biologically Plausible Speech Recognition with LSTM Neural Nets

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

A Critical Review of Recurrent Neural Networks for Sequence Learning

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Natural Language Processing

On the use of Long-Short Term Memory neural networks for time series prediction

CSC321 Lecture 15: Recurrent Neural Networks

Faster Training of Very Deep Networks Via p-norm Gates

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Lecture 3: ASR: HMMs, Forward, Viterbi

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Deep Learning Recurrent Networks 10/16/2017

Transcription:

Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann

Introduction

Applications of sequence modelling Map unsegmented connected handwriting to strings. Map sequences of acoustic signals to sequences of phonemes. Translate sentences from one language into another one. Generate baby names, poems, source code, patent applications.

The bag-of-words model The gorgeously elaborate continuation of The Lord of the Rings trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson s expanded vision of J.R.R. Tolkien s Middle-earth. pos is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920 s, as if to stop would hasten the economic and global political turmoil that was to come. neg

The bag-of-words model a adequately cannot co-writer column continuation describe director elaborate expanded gorgeously huge is J.R.R. Jackson Lord Middle-earth of of of of Peter Rings so that The The the Tolkien trilogy vision words pos 1920 s a an and as at come core economic emptiness exploration gaiety global hasten if is its little movie of of political relentless sour stop that that the the the the to to turmoil underlay was would neg

Part-of-speech tagging jag bad om en kort bit PN VB PP DT JJ NN NN NN SN PN AB VB PL RG NN AB NN

Hidden Markov Models (HMMs) P(bad VB) P(kort JJ) P(jag PN) P(om PP) P(en DT) P(bit NN) jag bad om en kort bit PN VB PP DT JJ NN P(PN BOS) P(PP VB) P(JJ DT) P(EOS NN) P(VB PN) P(DT PP) P(NN JJ) transition probabilities, emission probabilities

P(VB VB) VB P(w VB) P(VB BOS) P(EOS VB) BOS P(PN VB) P(VB PN) EOS P(PN BOS) P(EOS PN) PN P(w PN) P(PN PN)

A weakness of Hidden Markov Models The only information that an HMM has access to at any given point in time is its state. Suppose that the HMM has n states. Then the number of the current state can be written using log n bits. Thus the current state contains at most log n bits of information about the sequence generated so far.

Strengths of recurrent neural networks Distributed hidden state In recurrent neural networks, several units can be active at once, which allows them to store a lot of information efficiently. contrast that with one current state in HMMs Non-linear dynamics Different units can interact with each other in non-linear ways. This makes recurrent neural networks Turing-complete. contrast that with linear dynamical systems Attribution: Geoffrey Hinton

Recurrent neural networks (RNNs) Recurrent neural networks can be visualised as networks with feedback connections, which form directed cycles between units. These feedback connections are unfolded over time. A crucial property of recurrent neural networks is that they share the same set of parameters across different timesteps.

RNN, cyclic representation o f = delay of one timestep h x

RNN, unrolled o (1) o (2) o (3) f f f h (1) h (2) h (3) f x (1) x (2) x (3)

General observations The parameters of the model are shared across all timesteps. The hidden state can be influenced by the entire input seen so far. Contrast this with the Markov assumption of HMMs. The hidden state can be a lossy summary of the input sequence. Hopefully, this state will encode useful information for the task at hand. The model has the same input size regardless of sequence length. specified in terms of transitions from one state to the other

Different types of RNN architectures transducer encoder generator

Training recurrent neural networks

Computation graph for a standard architecture y (1) y (2) y (3) L (1) L (2) L (3) o (1) o (2) o (3) W h (1) V W V W h (2) h (3) V W U U U x (1) x (2) x (3)

Assumptions The hidden states are computed by some nonlinear activation function, such as tanh. The outputs at each time step are normalised log-probabilities representing distributions over a finite set of labels. In the book, this is assumed to happen implicitly when computing the loss.

Backpropagation through time Unrolled recurrent neural networks are just feedforward networks, and can therefore be trained using backpropagation. parameter sharing; linear constraints on the parameters This way of training recurrent neural networks is called backpropagation through time. Given that the unrolled computation graphs can be very deep, the vanishing gradient problem is exacerbated in RNNs.

t E f y k z k w jk f y j z j w ij y i

t E f w jk f w ij y k z k y j z j y i

Backpropagation through time y (1) y (2) y (3) L (1) L (2) L (3) o (1) o (2) o (3) W h (1) V W V W h (2) h (3) V W U U U x (1) x (2) x (3)

Initial values of the hidden state We could manually specify the initial state in terms of some sensible starting value. We could learn the initial state by starting with a random guess and then updating that guess during backpropagation.

Networks with output recurrence y (1) y (2) y (3) L (1) L (2) L (3) W W W o (1) o (2) o (3) V V V W h (1) h (2) h (3) U U U x (1) x (2) x (3)

The limitations of recurrent neural networks In principle, recurrent networks are capable of learning longdistance dependencies. In practice, standard gradient-based learning algorithms do not perform very well. Bengio et al. (1994) the vanishing gradient problem Today, there are several methods available for training recurrent neural networks that avoids these problems. LSTMs, optimisation with small gradients, careful weight initialisations,

Vanishing and exploding gradients sigmoid tanh relu sigmoid tanh relu 1 1 0,5 0,75 0 0,5-0,5 0,25-1 -6-3 0 3 6 0-6 -3 0 3 6 activation functions gradients

Recursive neural networks L o y U W U W U W V V V V x (1) x (2) x (3) x (4)

Long Short-Term Memory (LSTM)

Long Short-Term Memory The Long Short-Term Memory (LSTM) architecture was specifically designed to battle the vanishing gradients problem Metaphor: The dynamic state of the neural network can be considered as a short-term memory. The LSTM architecture tries to make this short-term memory last as long as possible by preventing vanishing gradients. Central idea: gating mechanism

Memory cell and gating mechanism The crucial innovation in an LSTM is the design of its memory cell. Information is written into the cell whenever its write gate is on. The information stays in the cell as long as its keep gate is on. Information is read from the cell whenever its read gate is on.

Information flow in an LSTM 1.7 write write write keep 1.7 keep 1.7 keep 1.7 keep write write write time 1.7 Attribution: Geoffrey Hinton

A look inside an LSTM cell y (i) s (i 1) + tanh s (i) h (i 1) x (i) σ σ tanh σ h (i) Attribution: Chris Olah

The keep gate ( forget gate ) h (i 1) x (i) σ Attribution: Chris Olah

The write gate ( input gate ) h (i 1) x (i) σ Attribution: Chris Olah

Update candidate h (i 1) x (i) tanh Attribution: Chris Olah

Updating the internal state s (i 1) + s (i) x (i) Attribution: Chris Olah

The write gate ( output gate ) h (i 1) x (i) σ Attribution: Chris Olah

Updating the external state y (i) tanh h (i) Attribution: Chris Olah

Peephole connections y i s i 1 + tanh s i x i σ σ tanh σ Attribution: Chris Olah

Gated Recurrent Unit (GRU) h (i 1) + h (i) 1 σ σ tanh x (i) Attribution: Chris Olah

Bidirectional RNNs In speech recognition, the correct interpretation of a given sound may depend on both the previous sounds and the next sounds. Bidirectional RNNs combine one RNN that moves forward through time with another RNN that moves backward. The output can be a representation that depends on both the past and the future, without having to specify a fixed-sized window.

A bidirectional RNN y (1) y (2) y (3) h (1) h (2) F B g (2) g (3) x (1) x (2) x (3) Attribution: Chris Olah