Generating Sequences with Recurrent Neural Networks

Similar documents
arxiv: v3 [cs.lg] 14 Jan 2018

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I

Deep Learning Recurrent Networks 2/28/2018

arxiv: v1 [cs.cl] 21 May 2017

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Long-Short Term Memory

Lecture 17: Neural Networks and Deep Learning

Long-Short Term Memory and Other Gated RNNs

Chinese Character Handwriting Generation in TensorFlow

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks. Jian Tang

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

EE-559 Deep learning Recurrent Neural Networks

Sequence Modeling with Neural Networks

Improved Learning through Augmenting the Loss

Natural Language Processing and Recurrent Neural Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

On the use of Long-Short Term Memory neural networks for time series prediction

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Recurrent and Recursive Networks

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Slide credit from Hung-Yi Lee & Richard Socher

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Sequence Transduction with Recurrent Neural Networks

Deep Recurrent Neural Networks

Introduction to RNNs!

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

EE-559 Deep learning LSTM and GRU

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

arxiv: v1 [cs.ne] 14 Nov 2012

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Recurrent neural networks

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

CSC321 Lecture 16: ResNets and Attention

Stephen Scott.

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Gated Feedback Recurrent Neural Networks

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CSCI 315: Artificial Intelligence through Deep Learning

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

NEURAL LANGUAGE MODELS

Reading Group on Deep Learning Session 2

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Applied Natural Language Processing

Recurrent Neural Network

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

ECE521 Lectures 9 Fully Connected Neural Networks

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Neural Networks in Structured Prediction. November 17, 2015

Deep Learning for Automatic Speech Recognition Part II

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Deep Learning Recurrent Networks 10/11/2017

CSC321 Lecture 10 Training RNNs

(

Seq2Tree: A Tree-Structured Extension of LSTM Network

Implicitly-Defined Neural Networks for Sequence Labeling

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

arxiv: v1 [cs.ne] 19 Dec 2016

Biologically Plausible Speech Recognition with LSTM Neural Nets

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Unfolded Recurrent Neural Networks for Speech Recognition

Implicitly-Defined Neural Networks for Sequence Labeling

arxiv: v2 [cs.ne] 7 Apr 2015

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

Memory-Augmented Attention Model for Scene Text Recognition

Bayesian Networks (Part I)

Deep learning for Natural Language Processing and Machine Translation

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Long Short-Term Memory (LSTM)

Learning Long-Term Dependencies with Gradient Descent is Difficult

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Part-of-Speech Tagging + Neural Networks 2 CS 287

Task-Oriented Dialogue System (Young, 2000)

arxiv: v1 [stat.ml] 18 Nov 2017

A New Smoothing Method for Lexicon-based Handwritten Text Keyword Spotting

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Presented By: Omer Shmueli and Sivan Niv

Tracking the World State with Recurrent Entity Networks

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Transcription:

Generating Sequences with Recurrent Neural Networks Alex Graves University of Toronto & Google DeepMind Presented by Zhe Gan, Duke University May 15, 2015 1 / 23

Outline Deep recurrent neural network based on Long Short-term Memory Text Prediction (where the data are discrete) Handwriting Predction (where the data are real-valued) Handwriting Synthesis (Generation of handwriting for a given text) 2 / 23

Recurrent Neural Network Given an input sequence x and an output sequence y, RNN aims to find a deterministic function that maps x to y. white circle: input sequence; black circle: hidden states; grey circle: output sequence. 1-layer RNN 3-layer RNN DRNN-1O time time Figure: Basic 1: Schematic structure ofillustration Recurrent Neural of anetwork DRNN. (taken Arrows fromrepresent M. Hermans connect and black B. Schrauwen and grey circles (2013,[3])). represent input frames, hidden states, and output fr Standard RNN, folded out in time. Middle: DRNN of 3 layers folded ou 3 / 23

Prediction Network Here we focus on generating sequences: repeatedly predicting what will happen next, treating your past predictions as if they were real. Sampling from a conditional model Pr(x) = t Pr(x t x 1:t 1 ). The network is deep in both space (vertically) and time (horizontally). Figure 1: Figure: Deep recurrent DRNN neural prediction network prediction architecture. The circles represent network layers, the solid lines represent weighted connections 4 / 23

Prediction Network Notation: input sequence x = (x 1,..., x T ); output sequence y = (y 1,..., y T ); hidden layers: h n = (h1 n,..., hn T ), n = 1,..., N. Given the input sequence, the hidden layer activations are computed by h 1 t = H(W ih 1x t + W h 1 h 1h1 t 1 + b 1 h ) (1) h n t = H(W ih nx t + W h n 1 h nhn 1 t + W h n h nhn t 1 + b n h ) (2) Given the hidden sequence, the output sequence is computed by N ŷ t = b y + W h n y ht n, y t = Y(ŷ t ) (3) n=1 y t is used to parameterize the predictive distribution Pr(x t+1 y t ). Goal: minimize negative log likelihood T L(x) = log Pr(x t+1 y t ) (4) t=1 5 / 23

Long short-term Memory (LSTM) In most RNNs, H is an element-wise sigmoid function. Unable to store information about past inputs for very long; Not robust to mistakes. Idea of LSTM (S. Hochreiter, 1997, [4]): Using linear memory cells surrounded by multiplicative gate units to store read, write and reset information. Input gate: scales input to cell (write) Output gate: scales output from cell (read) Forget gate: scales old cell value (reset) 6 / 23

Long short-term Memory (LSTM) For the LSTM used in F.Gers (2003,[2]), H is implemented as Trained by backpropagation through time (BPTT) algorithm. 7 / 23

Text Prediction (Language Modeling) Task: Generating text sequences one character at a time. How to model discrete data: a length K one-hot vector. Pr(x t+1 = k y t ) = exp(ŷt k ) K k k =1 exp(ŷt ) (5) Word-level modeling: K can be as large as 100, 000, many parameters. Character-level modeling: K is small, and allows the network to invent novel words and strings. Penn Treebank Experiments dataset size: a little over a million words in total vocabulary size: 10, 000 words; characters size: 49. the architecture was a single hidden layer with 1000 LSTM units. trained with SGD; allow dynamic evaluation. 8 / 23

Penn Treebank Experiments Metric: (i) bits-per-character (BPC), the average of log 2 Pr(x t+1 y t ) over the whole test set; (ii) perplexity 2 5.6BPC 9 / 23

Wikipedia Experiments - Generating Text Dataset size: 100M bytes, 96M training and 4M validation. Except English words, also including foreign words, indented XML tags, website addresses, and markup used to indicate page formatting. 10 / 23

Wikipedia Experiments - Generating Text Model size: input and output layers are size of 205, with seven hidden layers of 700 LSTM cells, approx. 21.3M weights, trained by SGD. Results: (A four-page long sample) samples only make sense in the level of short phrases. learned a large vocabulary of words, also invented feasible-looking words and names, also learned basic punctuation, with commas, etc. being able to correctly open and close quotation marks and parentheses. 11 / 23

Wikipedia Experiments - Generating Text Results (Cont.) being able to also balance formatting marks, such as the equals signs to denote headings. generates convincing looking internet addresses. generates non-english characters. 12 / 23

Handwriting Prediction Task: generate pen trajectories by predicting one (x, y) and the end-of-stroke markers one point at a time. Data: IAM online handwriting, 10K training sequences, many writers, unconstrained style, captured from whiteboard 13 / 23

Recurrent Mixture Density Networks MDN (C. Bishop, 1994 [1]): use the outputs of a NN to parameterize a GMM. Extended to RNN by M. Schuster (1999,[5]). Input and Output: Y that maps ŷ to y: Predictive distribution: 14 / 23

Handwriting Prediction Network Details 3 inputs: (x, y),pen up/down 121 output units 20 two dimensional Gaussians for (x, y) = 40 means (linear) + 40 variance (exp) + 20 correlations (tanh) + 20 weights (softmax) 1 sigmoid for up/down 3 hidden layers, 400 LSTM cells in each, 3.6 M weights in total Generated Samples 15 / 23

Handwriting Prediction Top: Density Map, Bottom: mixture component weights. 16 / 23

Handwriting Synthesis Task: want to tell the network what to write without losing the distribution over how it writes How: achieve this by conditioning the predictions on a text sequence Challenges: (i) text and writing are of different lengths; (ii) alignment is unknown. Solution: before each prediction, let the network decide where it is in the text sequence. - Soft Windows 17 / 23

Handwriting Synthesis - Soft Windows Given a length U character sequence c, and a length T data sequence x, the soft window w t is defined as w t = window weights φ(t, u) = U φ(t, u)c u (6) u=1 K αt k exp( βt k (κ k t u) 2 ) k=1 the network s belief that it is written character c u at time t. 18 / 23

Handwriting Synthesis - Network Architecture 19 / 23

Handwriting Synthesis - Alignment Figure: Each point on the map shows the value of φ(t, u). 20 / 23

Handwriting Synthesis Top: Density Map, Bottom: mixture component weights. 21 / 23

Handwriting Synthesis Figure: Unbiased, Biased & Primed Sampling. 22 / 23

References I Christopher M Bishop. Mixture density networks. 1994. Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. Learning precise timing with lstm recurrent networks. JMLR, 2003. Michiel Hermans and Benjamin Schrauwen. Training and analysing deep recurrent neural networks. In NIPS, 2013. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997. Mike Schuster. Better generative models for sequential data problems: Bidirectional recurrent mixture density networks. In NIPS, 1999. Some slides are based on the materials below. http://www.cs.toronto.edu/~graves/gen_seq_rnn.pdf and https://www.youtube.com/watch?v=-yx1syedhbg. 23 / 23