Recurrent Neural Networks. Jian Tang

Similar documents
Slide credit from Hung-Yi Lee & Richard Socher

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Long-Short Term Memory and Other Gated RNNs

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 10 Training RNNs

Natural Language Processing and Recurrent Neural Networks

Introduction to RNNs!

RECURRENT NETWORKS I. Philipp Krähenbühl

Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 16: ResNets and Attention

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion. Slides were adapted from lectures by Richard Socher

Sequence Modeling with Neural Networks

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

CSCI 315: Artificial Intelligence through Deep Learning

Lecture 17: Neural Networks and Deep Learning

arxiv: v3 [cs.lg] 14 Jan 2018

EE-559 Deep learning LSTM and GRU

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Learning Long-Term Dependencies with Gradient Descent is Difficult

Neural Networks Language Models

Learning Unitary Operators with Help from u(n)

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent and Recursive Networks

Recurrent Neural Networks

Recurrent Neural Network

EE-559 Deep learning Recurrent Neural Networks

Lecture 11 Recurrent Neural Networks I

Based on the original slides of Hung-yi Lee

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Neural Architectures for Image, Language, and Speech Processing

Conditional Language modeling with attention

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Lecture 15: Recurrent Neural Nets

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Lecture 5: Recurrent Neural Networks

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Recurrent neural networks

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CSC321 Lecture 15: Recurrent Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Long-Short Term Memory

Long Short-Term Memory (LSTM)

Deep Learning Recurrent Networks 2/28/2018

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Structured Neural Networks (I)

Natural Language Processing

Natural Language Processing

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

arxiv: v1 [cs.cl] 21 May 2017

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Generating Sequences with Recurrent Neural Networks

NLP Programming Tutorial 8 - Recurrent Neural Nets

Gated Recurrent Neural Tensor Network

Neural Networks in Structured Prediction. November 17, 2015

Stephen Scott.

Deep Learning for Computer Vision

NEURAL LANGUAGE MODELS

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

(

Faster Training of Very Deep Networks Via p-norm Gates

arxiv: v1 [stat.ml] 18 Nov 2017

Implicitly-Defined Neural Networks for Sequence Labeling

Deep Learning Recurrent Networks 10/11/2017

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

On the use of Long-Short Term Memory neural networks for time series prediction

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

lecture 6: modeling sequences (final part)

Improved Learning through Augmenting the Loss

Introduction to Deep Neural Networks

Neural Network Language Modeling

Time series forecasting using ARIMA and Recurrent Neural Net with LSTM network

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Neural Networks: Backpropagation

Lecture 8: Recurrent Neural Networks

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Logistic Regression. COMP 527 Danushka Bollegala

More on Neural Networks

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Transcription:

Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1

RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating h h t 1 h t h t+1 h t = F θ (h t 1, x t ) h t

Recurrent Neural Networks Can produce an output at each time step: unfolding the graph tell us how to back-prop through time h t = tanh(wh t 1 +Ux t )

Recurrent Neural Networks Produce a single output at the end of sequence h t = tanh(wh t 1 +Ux t )

Language Modeling A language model computes a probability for a sequence of words: Useful for machine translation Word ordering: p(the cat is small) > p(small the is cat) Word choice: p(walking home after school) > p(walking house after school)

RNN for Language Modeling Estimate the probability of a sequence odel: every variable predicted from all previous ones. At a single time step:

RNN for Language Modeling Main idea: we use the same set of W weights at all time steps! Everything else is the same: is some initialization vector for the hidden layer at time step 0 is the column vector of L at index [t] at time step t

RNN for Language Modeling is a probability distribution over the vocabulary Same cross entropy loss function but predicting words instead of classes

RNN for Language Modeling Evaluation could just be negative of average log probability over dataset of size (number of words) T: But more common: Perplexity: 2 J Lower is better!

Training RNN is very Hard Multiply the same matrix at each time step during forward prop y t 1 y t y t+1 h t 1 h t h t+1 W W x t 1 x t x t+1 Ideally inputs from many time steps ago can modify output y Take for an example RNN with 2 time steps! Insightful!

Gradient Vanishing/Exploding Multiply the same matrix at each time step during backprop y t 1 y t y t+1 h t 1 h t h t+1 W W x t 1 x t x t+1

Details Similar but simpler RNN formulation: Total error is the sum of each error at time steps t Hardcore chain rule application:

Details Similar to backprop but less efficient formulation Useful for analysis we ll look at: Remember: More chain rule, remember: Each partial is a Jacobian:

Details From previous slide: h t 1 h t Remember: To compute Jacobian, derive each element of matrix: Where: Check at home that you understand the diag matrix formulation

Details Analyzing the norms of the Jacobians, yields: Where we defined s as upper bounds of the norms The gradient is a product of Jacobian matrices, each associated with a step in the forward computation. This can become very small or very large quickly [Bengio et al 1994], and the locality assumption of gradient descent breaks down. à Vanishing or exploding gradient

Long-short Term Memory (LSTM) From multiplication to summation Input gate (current cell matters) Forget (gate 0, forget past) Output (how much cell is exposed) New memory cell Final memory cell: Final hidden state:

Gated Recurrent Unit (GRU, Cho et al. 2014) Update gate Reset gate New memory content: If reset gate unit is ~0, then this ignores previous memory and only stores the new word information Final memory at time step combines current and previous time steps:

Gated Recurrent Unit (GRU, Cho et al. 2014) Final memory h t-1 h t Memory (reset) ~ ht-1 ~ ht Update gate z t-1 z t Reset gate r t-1 r t Input: x t-1 x t

Gated Recurrent Unit (GRU, Cho et al. 2014) If reset is close to 0, ignore previous hidden state à Allows model to drop information that is irrelevant in the future Update gate z controls how much of past state should matter now. If z close to 1, then we can copy information in that unit through many time steps! Less vanishing gradient! Units with short-term dependencies often have reset gates very active

Gated Recurrent Unit (GRU, Cho et al. 2014) Units with long term dependencies have active update gates z Illustration:

Deep Bidirectional RNN (Irsoy and Cardie) Going Deep y h (3) h (2) h (1) h! (i) t = f (W!"! (i) ht (i 1) +V!" (i) h! (i) t 1 + b! (i) ) h! (i) t = f (W!" (i)ht (i 1) +V!" (i)! (i) h t+1 + b! (i) ) y t = g(u[h! (L)! (L) t ;h t ]+ c) x Each memory layer passes an intermediate sequential representation to the next.

Optimization for Long-term Dependencies Avoiding gradient exploding Clipping Gradients

Optimization for Long-term Dependencies Avoiding gradient vanishing With LSTM or GRU Or regularize or constrain the parameters so as to encourage information flow Make!"!# $ close to!". Pascanu et al. (2013a) propose the!# $!# $%&!# $ following regularizer: Ω = )( * -. -h * -h * -h *01 -. -h * 1) 5

Applications: Language Modeling

Applications: Sentence Classification

Applications: Sequence Tagging Figure: Bidirectional LSTM-CRF

Applications: Sequential Recommendation Figure: User sequential behaviors

References Chapter 10, Deep Learning Book