RECURRENT NETWORKS I. Philipp Krähenbühl

Similar documents
Recurrent Neural Networks. Jian Tang

Sequence Modeling with Neural Networks

Long-Short Term Memory and Other Gated RNNs

CSCI 315: Artificial Intelligence through Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Slide credit from Hung-Yi Lee & Richard Socher

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent and Recursive Networks

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

arxiv: v3 [cs.lg] 14 Jan 2018

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Introduction to RNNs!

EE-559 Deep learning LSTM and GRU

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Natural Language Processing and Recurrent Neural Networks

Long-Short Term Memory

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Lecture 11 Recurrent Neural Networks I

(

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Neural Networks Language Models

Lecture 17: Neural Networks and Deep Learning

Stephen Scott.

Recurrent neural networks

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 10 Training RNNs

EE-559 Deep learning Recurrent Neural Networks

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Learning Long-Term Dependencies with Gradient Descent is Difficult

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Deep Learning Recurrent Networks 2/28/2018

Lecture 15: Exploding and Vanishing Gradients

Recurrent Neural Network

Lecture 11 Recurrent Neural Networks I

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture 5: Recurrent Neural Networks

Structured Neural Networks (I)

Deep Learning for Computer Vision

Long Short-Term Memory (LSTM)

Speech and Language Processing

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Recurrent Neural Networks

CS 224n: Assignment #3

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Natural Language Processing

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Tracking the World State with Recurrent Entity Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

ARTIFICIAL neural networks (ANNs) are made from

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Spike-based Long Short-Term Memory networks

Financial Risk and Returns Prediction with Modular Networked Learning

Reservoir Computing and Echo State Networks

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

NEURAL LANGUAGE MODELS

Deep Learning Recurrent Networks 10/11/2017

Recurrent Latent Variable Networks for Session-Based Recommendation

Recurrent Neural Networks

Learning Unitary Operators with Help from u(n)

Stanford Machine Learning - Week V

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Architectures for Image, Language, and Speech Processing

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Based on the original slides of Hung-yi Lee

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

arxiv: v1 [cs.ne] 19 Dec 2016

CSC321 Lecture 16: ResNets and Attention

Deep Learning Recurrent Networks 10/16/2017

Feed-forward Network Functions

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Natural Language Processing

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Neural networks. Chapter 19, Sections 1 5 1

Deep Neural Networks (1) Hidden layers; Back-propagation

Lecture 8: Recurrent Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Gianluca Pollastri, Head of Lab School of Computer Science and Informatics and. University College Dublin

Training Neural Networks Practical Issues

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

arxiv: v1 [cs.cl] 21 May 2017

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Transcription:

RECURRENT NETWORKS I Philipp Krähenbühl

RECAP: CLASSIFICATION conv 1 conv 2 conv 3 conv 4 1 2 tu

RECAP: SEGMENTATION conv 1 conv 2 conv 3 conv 4

RECAP: DETECTION conv 1 conv 2 conv 3 conv 4

RECAP: GENERATION noise conv 1 conv 2 conv 3 conv 4

FEED FORWARD NETWORKS order of computation: conv 1 conv 2 conv 3 conv 4 1 2 tu

FEED FORWARD NETWORKS (Fied) order of conv 1 conv 2 conv 3 conv 4 1 2 tu computation Lower to upper layers Once we have the result discard all activations

WOULD YOU USE THIS TO DRIVE A CAR?

1 conv 3 conv 2 conv 1 WOULD YOU USE THIS TO DRIVE A CAR?

WOULD YOU USE THIS TO DRIVE A CAR? Independent decision for each frame No state or memory 1 Real world not conv 3 For supertu kart it might still be ok conv 2 Probably not conv 1

conv 1 conv 2 conv 3 1 conv 1 conv 2 conv 3 1 1 conv 3 conv 2 conv 1 HOW DO WE KEEP A STATE AROUND?

conv 1 conv 2 conv 3 1 conv 1 conv 2 conv 3 1 1 conv 3 conv 2 conv 1 HOW DO WE KEEP A STATE AROUND?

RECURRENT NEURAL NETWORK (RNN) State update: h t = f h ( t, h t 1, θ h ) h y Output: y t = f y ( t, h t, θ y ) recurrent connection

ELMAN NETWORKS State update: h t = f h ( t, h t 1, θ h ) = σ(u h h t 1 + W h + b h ) Output: y t = f y ( t, h t, θ y ) = σ(w y h t + b y ) h sigmoid y sigmoid

JORDAN NETWORKS State update: h t = f h ( t, h t 1, θ h ) = σ(u h y t 1 + W h + b h ) Output: y t = f y ( t, h t, θ y ) = σ(w y h t + b y ) h sigmoid y sigmoid

HOW DO WE TRAIN RNNS? State update: h t = f h ( t, h t 1, θ h ) h y Output: y t = f y ( t, h t, θ y ) recurrent connection

UNROLLING THROUGH TIME y 0 y 1 y 2 y t 0 0 1 2 t

UNROLLING THROUGH TIME y 0 y 1 y 2 y t Unrolled RNN Freed forward network 0 0 1 2 t Shared parameters Trained with back-prop

UNROLLING THROUGH TIME - ISSUES Long unrolling y 0 y 1 y 2 y t Vanishing or eploding gradients 0 0 1 2 t Very long unrolling Computationally epensive

VERY LONG UNROLLING y 0 y 1 y 2 y t Solution (hack) During training: Cut RNN (set h=0) after n timesteps 0 0 1 2 t Often still trains well in practice

EXPLODING AND VANISHING GRADIENTS h t h t 1 α h n h 0 α n h y Vanishing gradients: α 1 : α n 0 Eploding gradients: α α 1 : α n

EXPLODING AND VANISHING GRADIENTS Eploding gradients Gradient clipping (hack) l l clip h t 1 ( h t h t h t 1, ε, ε ) Vanishing gradients Different RNN structure

LSTM Long short-term memory f t = σ(w f t + U f h t 1 + b f ) i t = σ(w i t + U i h t 1 + b i ) o t = σ(w o t + U o h t 1 + b o ) c t = f t c t 1 + i t τ(w c t + U c h t 1 + b c ) h t = o t τ(c t ) c t 1 h t 1 + f t i t sigmoid sigmoid tanh o t sigmoid tanh c t h t t

LSTM Long short-term memory c t 1 + c t Cell state c f t i t o t tanh Allows for information to just flow through nearly unchanged h t 1 sigmoid sigmoid tanh sigmoid h t t

LSTM Long short-term memory c t 1 + c t Forget gate f f t i t sigmoid sigmoid tanh o t sigmoid tanh Clears the cell state h t 1 h t t

LSTM Long short-term memory Input gate i Allows a state update (or not) c t 1 + f t i t sigmoid sigmoid tanh o t sigmoid tanh c t Input h t 1 h t h (previous cell state) t

LSTM Long short-term memory Output gate o c t 1 + c t Should we produce an output? f t i t sigmoid sigmoid tanh o t sigmoid tanh Output h t 1 h t tanh of cell state t

LSTM Long short-term memory Can learn to keep state for up to 100 time steps Fewer vanishing gradients c t 1 + f t i t sigmoid sigmoid tanh o t sigmoid tanh c t h t 1 h t Trained by unrolling through time t

GRU Gated Recurrent Unit z t = σ(w z t + U z h t 1 + b z ) h t 1 + h t r t = σ(w r t + U r h t 1 + b r ) z t 1 z t h t = τ(w h t + U h (r t h t 1 ) + b h ) sigmoid tanh h t h t = (1 z t ) h t 1 + z t h t sigmoid r t t

GRU Gated Recurrent Unit Similar performance to LSTM h t 1 z t 1 z t + h t sigmoid tanh h t Almost same state update r t Fewer gates sigmoid t

SUMMARY Training RNNs Unroll in time + Backdrop Eploding gradients Clip Vanishing gradients (no long term interactions) Use LSTM or GRU