CSCI 315: Artificial Intelligence through Deep Learning

Similar documents
CSCI 252: Neural Networks and Graphical Models. Fall Term 2016 Prof. Levy. Architecture #7: The Simple Recurrent Network (Elman 1990)

Long-Short Term Memory and Other Gated RNNs

Neural Networks Language Models

Recurrent Neural Networks. Jian Tang

Sequence Modeling with Neural Networks

CSC321 Lecture 15: Exploding and Vanishing Gradients

RECURRENT NETWORKS I. Philipp Krähenbühl

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Lecture 15: Exploding and Vanishing Gradients

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Learning Recurrent Networks 2/28/2018

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

(

Lecture 15: Recurrent Neural Nets

CSC321 Lecture 16: ResNets and Attention

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

CSC321 Lecture 5: Multilayer Perceptrons

Natural Language Processing and Recurrent Neural Networks

CSC321 Lecture 10 Training RNNs

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Slide credit from Hung-Yi Lee & Richard Socher

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Long-Short Term Memory

EE-559 Deep learning Recurrent Neural Networks

Recurrent and Recursive Networks

Long Short-Term Memory (LSTM)

CSC321 Lecture 15: Recurrent Neural Networks

NLP Programming Tutorial 8 - Recurrent Neural Nets

Lecture 17: Neural Networks and Deep Learning

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

arxiv: v3 [cs.lg] 14 Jan 2018

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks 2. 2 Receptive fields and dealing with image inputs

On the use of Long-Short Term Memory neural networks for time series prediction

CS 4700: Foundations of Artificial Intelligence

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Recurrent Neural Networks

Introduction to RNNs!

Lecture 5 Neural models for NLP

EE-559 Deep learning LSTM and GRU

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

CS 4700: Foundations of Artificial Intelligence

Gianluca Pollastri, Head of Lab School of Computer Science and Informatics and. University College Dublin

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Recurrent Neural Network

Lecture 11 Recurrent Neural Networks I

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Lecture 11 Recurrent Neural Networks I

Tracking the World State with Recurrent Entity Networks

Financial Risk and Returns Prediction with Modular Networked Learning

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Applied Natural Language Processing

Structured Neural Networks (I)

Deep Learning Recurrent Networks 10/11/2017

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Recurrent neural networks

Based on the original slides of Hung-yi Lee

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Artificial Neural Networks

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Generating Sequences with Recurrent Neural Networks

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Recurrent Neural Networks

Artificial Neuron (Perceptron)

CMSC 421: Neural Computation. Applications of Neural Networks

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Deep Feedforward Networks

Multilayer Perceptron

Input layer. Weight matrix [ ] Output layer

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Statistical NLP for the Web

Learning Unitary Operators with Help from u(n)

4. Multilayer Perceptrons

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Artificial Intelligence

Stephen Scott.

Machine Learning. Boris

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Neural Architectures for Image, Language, and Speech Processing

Spike-based Long Short-Term Memory networks

CSC 411 Lecture 10: Neural Networks

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

SGD and Deep Learning

Machine Learning (CSE 446): Neural Networks

Natural Language Processing

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Speech and Language Processing

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Transcription:

CSCI 315: Artificial Intelligence through Deep Learning W&L Winter Term 2017 Prof. Levy Recurrent Neural Networks (Chapter 7)

Recall our first-week discussion...

How do we know stuff?

(MIT Press 1996)

Intelligence as Prediction

Context Simple Recurrent Network (Elman 1990) Context layer is just a copy of hidden layer at previous time. Like input layer, it is fully connected to hidden layer. It acts as an additional input that provides a history (context) for the current input. So C in ABCABC looks different to the hidden layer than the C in BACBAC. COPY

Experiment #1: Sequential XOR Recall XOR (Exclusive OR) as a minimal test case for nontrivial machine learning: Input 0 1 0 1 Output 0 1 1 0 This is a purely static learning task: for a given input, the output is always the same. We can turn XOR into a prediction task by repeating a sequence consisting of two random bits, followed by their XOR: 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1...

Target sequence is input sequence shifted left by one time step: input: 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1 target: 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1 Training sequences was 3000 bits long. SRN had one input / target unit; two hidden / context units. 600 iterations of back-prop (each iteration = one pass through the entire sequence) What sort of results do we expect : how good will network be at predicting each next target?

Experiment #4: Discovering Lexical Classes Lexical class: noun, verb, adjective, etc. Can be further analyzed into animate noun (person, animal), inanimate noun (car, rock), transitive verb (take, see), intransitive (leave, go), etc.

Used a little grammar to generate simple English sentences from these words:

31 words, each represented by a one-hot code: 150 hidden/context units Training set of sentences 27,354 words long Six passes through the training sequence

Instead of looking at the error signal, Elman looked at the average values (vector) computed by the hidden units when presented with each word. With 150 hidden units, it is difficult to see coherent patterns. So Elman used Hierarchical Cluster Analysis to group hidden-layer vectors recursively: 1) Create a distance matrix of the Euclidean distance of each vector from the others 2) If two words have a small distance, put them in the same group 3) Average vector for a group can be used to represent the group as a whole, enabling distance comparisons between groups.

Hypothetical distance matrix break break cat man move plate smash woman cat man move plate smash woman.9.8.2.7.1.8.1.8.3.9.2.9.4.8.1.8.2.8.8.4.9

Results

Responding to Novelty

SRN: Summing Up SRN revived the nature / nurture debate in cognitive science, revealing a huge amount of learnable hidden structure in word sequences. Elman (1990) became one of the top-cited papers in psychology and cognitive science Led researchers to wonder whether training algorithm could be beefed up to deal with real-world, practical sequence tasks (translation, part-of-speech tagging)

From SRN to BPTT SRN back-prop is truncated in that it uses only the most recent hidden-layer activations Hence there is a rapid discounting of errors computed at the previous time steps. By unrolling (copying) the net over time, we can make better use of errors computed farther back in time: hence, Back-Prop Through Time. In essence, we turn a simple recurrent net into a complicated non-recurrent net.

Back-Prop Through Time During training, we maintain a distinct copy of the weights at each time step, and modify them independently using backprop. After training, we re-roll the unrolled network back into its original form, averaging trained weight copies to get the final weights. https://www.researchgate.net/publication/2903062_a_guide_to_recurrent_neural_networks_and_backpropagation

BPTT: The Vanishing Gradient Problem Even with an unrolled network, the error gradient will eventually become too small to be useful for weight update The figure at right illustrates this for a toy network of three units (input, hidden, output), but it is true of any BPTT network in general.

Long Short-Term Memory: A Solution to the Vanishing Gradient LSTM: Invented by Schmidhüber in 1997, but citations have exploded recently thanks to Deep Learning Best intro: google COLAH LSTM Consider a general RNN architecture, shown at right: http://colah.github.io/posts/2015-08-understanding-lstms/

General RNN Architecture our yt Architecture (?)

Unrolled

SRN Revisited As in previous SRN illustration, hidden layer is computed from current input and previous hidden. But modern approach uses hyperbolic tangent tanh() here, instead of logistic sigmoid.

tanh()

LSTM = layer x = elementwise multiply + = elementwise add σ = logistic sigmoid

Cell State gate

Cell State gate transistor

Forget Gate Layer ft will be 0 where we want to forget a vector component, 1 where we want to remember it. For example, introduction of a singular noun in xt might override a previous plural noun, for verb agreement later. So we forget the plural noun.

Input Gate Layer it will be 1 where we want to learn a vector component, 0 where we don t. For example, introduction of a singular noun in xt might override a previous plural noun, for verb agreement later. So we remember the singular noun. Ĉt is the actual data that we want to remember.

Cell State Update New cell state Ct is what we want to forget from previous state, plus what we want to remember from new info.

Output ot is yet another sigmoidal gate that allows us to select the components of the output. Finally, we pass the current cell state Ct through tanh(), to keep it in [-1,+1].

LSTM vs. SRN: Summary SRN has two layers of weights - input,context hidden - hidden output LSTM has four: - Wf - Wi - Wc - Wo Traditional SRN used logistic sigmoid everywhere. LSTM uses tanh(), with sigmoid reserved for gating. So how does LSTM avoid the vanishing gradient?