Slide credit from Hung-Yi Lee & Richard Socher

Similar documents
Recurrent Neural Networks. Jian Tang

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Lecture 11 Recurrent Neural Networks I

CSC321 Lecture 10 Training RNNs

Lecture 11 Recurrent Neural Networks I

CSC321 Lecture 15: Exploding and Vanishing Gradients

Long-Short Term Memory and Other Gated RNNs

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Neural Architectures for Image, Language, and Speech Processing

Lecture 17: Neural Networks and Deep Learning

Introduction to RNNs!

Deep Learning Recurrent Networks 2/28/2018

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Based on the original slides of Hung-yi Lee

Lecture 15: Exploding and Vanishing Gradients

arxiv: v3 [cs.lg] 14 Jan 2018

Learning Long-Term Dependencies with Gradient Descent is Difficult

Stephen Scott.

Recurrent and Recursive Networks

CSC321 Lecture 16: ResNets and Attention

Sequence Modeling with Neural Networks

RECURRENT NETWORKS I. Philipp Krähenbühl

EE-559 Deep learning LSTM and GRU

Natural Language Processing and Recurrent Neural Networks

Deep Recurrent Neural Networks

Lecture 5: Recurrent Neural Networks

EE-559 Deep learning Recurrent Neural Networks

Natural Language Processing

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Recurrent Neural Network

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Neural Networks Language Models

Learning Unitary Operators with Help from u(n)

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Speech and Language Processing

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Recurrent Neural Networks. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion. Slides were adapted from lectures by Richard Socher

Long-Short Term Memory

NEURAL LANGUAGE MODELS

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Task-Oriented Dialogue System (Young, 2000)

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Recurrent Neural Networks

Deep learning for Natural Language Processing and Machine Translation

Natural Language Processing

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

arxiv: v2 [cs.ne] 7 Apr 2015

Backpropagation and Neural Networks part 1. Lecture 4-1

arxiv: v1 [cs.cl] 21 May 2017

Gated Recurrent Neural Tensor Network

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Deep Learning Recurrent Networks 10/11/2017

text classification 3: neural networks

Training Neural Networks Practical Issues

CSCI 315: Artificial Intelligence through Deep Learning

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Introduction to Deep Neural Networks

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Random Coattention Forest for Question Answering

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Recurrent neural networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Natural Language Processing

Deep Learning for Computer Vision

NLP Programming Tutorial 8 - Recurrent Neural Nets

ARTIFICIAL neural networks (ANNs) are made from

Introduction to Deep Learning

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Improved Learning through Augmenting the Loss

Recurrent Neural Networks

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Short-term water demand forecast based on deep neural network ABSTRACT

Gianluca Pollastri, Head of Lab School of Computer Science and Informatics and. University College Dublin

Artificial Neuron (Perceptron)

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

y(x n, w) t n 2. (1)

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

On the use of Long-Short Term Memory neural networks for time series prediction

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Generating Sequences with Recurrent Neural Networks

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

CSC321 Lecture 15: Recurrent Neural Networks

CSC 411 Lecture 10: Neural Networks

Transcription:

Slide credit from Hung-Yi Lee & Richard Socher 1

Review Recurrent Neural Network 2

Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step Assumption: temporal information matters 3

output word prob dist RNN Language Modeling hidden P(next word is wreck ) P(next word is a ) input P(next word is nice ) context vector P(next word is beach ) vector of START vector of wreck vector of a vector of nice Idea: pass the information from the previous hidden layer to leverage all contexts 4

RNNLM Formulation At each time step, probability of the next word vector of the current word 5

Recurrent Neural Network Definition : tanh, ReLU http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 6

Model Training All model parameters can be updated by y t-1 y t y t+1 target predicted http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 7

Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 8

Backpropagation Layer l 1 1 2 j l w ij Layer l 1 2 i l i Backward Pass Error signal l a j x 1 j l 1 l 1 Forward Pass 9

Backpropagation l δ l δ 1 l δ 2 Layer l 1 2 l z 1 l z 2 Layer L-1 L-1 1 z 2 z L1 1 L1 2 Layer L L 1 2 L z 1 L z 2 C y C y 1 C y 2 l i Backward Pass Error signal l δ i i l z i l W 1 T m W L T L1 z m n L z n C y n 10

Backpropagation through Time (BPTT) Unfold x t s t o t y t Input: init, x 1, x 2,, x t Output: o t Target: y t init x 1 s 1 x t-2 x t-1 s t-1 Cy C s t-2 o1 C o 2 C o n 11

Backpropagation through Time (BPTT) Unfold x t s t o t y t Input: init, x 1, x 2,, x t Output: o t Target: y t x t-2 x t-1 s t-1 1 s t-2 1 2 Cy x 1 s 1 2 n init n 12

Backpropagation through Time (BPTT) Unfold x t s t o t y t x t-1 s t-1 Cy Input: init, x 1, x 2,, x t Output: o t Target: y t x t-2 s t-2 x 1 s 1 init 13

Backpropagation through Time (BPTT) Unfold Input: init, x 1, x 2,, x t Output: o t Target: y t init i x 1 s 1 j x t-2 j the same memory x t i x t-1 s t-1 j i s t-2 pointer pointer j i s t o t y t Cy Weights are tied together 14

Backpropagation through Time (BPTT) Unfold Input: init, x 1, x 2,, x t Output: o t Target: y t i x 1 s 1 j x t-2 j x t i x t-1 s t-1 j k i s t-2 k j i s t o t y t Cy init k Weights are tied together 15

BPTT Forward Pass: Backward Pass: Compute s 1, s 2, s 3, s 4 For C (4) For C (3) For C (2) For C (1) y 1 y 2 y 3 y 4 C (1) C (2) C (3) C (4) o 1 o 2 o 3 o 4 init s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 16

RNN Training Issue The gradient is a product of Jacobian matrices, each associated with a step in the forward computation Multiply the same matrix at each time step during backprop The gradient becomes very small or very large quickly vanishing or exploding gradient Bengio et al., Learning long-term dependencies with gradient descent is difficult, IEEE Trans. of Neural Networks, 1994. [link] Pascanu et al., On the difficulty of training recurrent neural networks, in ICML, 2013. [link] 17

Rough Error Surface Cost w 2 w1 The error surface is either very flat or very steep Bengio et al., Learning long-term dependencies with gradient descent is difficult, IEEE Trans. of Neural Networks, 1994. [link] Pascanu et al., On the difficulty of training recurrent neural networks, in ICML, 2013. [link] 18

Possible Solutions Recurrent Neural Network 19

Exploding Gradient: Clipping clipped gradient Idea: control the gradient value to avoid exploding Cost Parameter setting: values from half to ten times the average can still yield convergence w 2 w 1 20

Vanishing Gradient: Initialization + ReLU IRNN initialize all W as identity matrix I use ReLU for activation functions Le et al., A Simple Way to Initialize Recurrent Networks of Rectified Linear Units, arxiv, 2016. [link] 21

Vanishing Gradient: Gating Mechanism RNN models temporal sequence information can handle long-term dependencies in theory I grew up in France I speak fluent French. Issue: RNN cannot handle such long-term dependencies in practice due to vanishing gradient apply the gating mechanism to directly encode the long-distance information http://colah.github.io/posts/2015-08-understanding-lstms/ 22

Extension Recurrent Neural Network 23

Bidirectional RNN h = h; h represents (summarizes) the past and future around a single token 24

Deep Bidirectional RNN Each memory layer passes an intermediate representation to the next 25

Concluding Remarks Recurrent Neural Networks Definition Issue: Vanishing/Exploding Gradient Solution: Exploding Gradient: Clipping Vanishing Gradient: Initialization, ReLU, Gated RNNs Extension Bidirectional Deep RNN 26