Conditional Language modeling with attention

Similar documents
Better Conditional Language Modeling. Chris Dyer

CSC321 Lecture 16: ResNets and Attention

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Stephen Scott.

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Long-Short Term Memory and Other Gated RNNs

Neural Networks. Intro to AI Bert Huang Virginia Tech

Neural Architectures for Image, Language, and Speech Processing

Recurrent Neural Networks. Jian Tang

Lecture 17: Neural Networks and Deep Learning

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

CSC321 Lecture 10 Training RNNs

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Learning Recurrent Networks 2/28/2018

Recurrent Neural Network

Sequence Modeling with Neural Networks

Lecture 5 Neural models for NLP

Neural Networks in Structured Prediction. November 17, 2015

Applied Natural Language Processing

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

NEURAL LANGUAGE MODELS

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Introduction to RNNs!

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Recurrent and Recursive Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Conditional Language Modeling. Chris Dyer

Natural Language Processing

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

CSC321 Lecture 15: Recurrent Neural Networks

Deep learning for Natural Language Processing and Machine Translation

Neural Networks Language Models

Lecture 15: Exploding and Vanishing Gradients

Slide credit from Hung-Yi Lee & Richard Socher

Deep Learning Architectures and Algorithms

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Introduction to Deep Neural Networks

arxiv: v2 [cs.cl] 1 Jan 2019

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Natural Language Processing and Recurrent Neural Networks

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks

Based on the original slides of Hung-yi Lee

CSC321 Lecture 15: Exploding and Vanishing Gradients

From perceptrons to word embeddings. Simon Šuster University of Groningen

Natural Language Processing

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Natural Language Processing

Generative Models for Sentences

Reinforcement Learning

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Neural Network Language Modeling

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Deep Neural Networks (1) Hidden layers; Back-propagation

Today s Lecture. Dropout

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

arxiv: v1 [cs.cl] 21 May 2017

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

CSC321 Lecture 5: Multilayer Perceptrons

arxiv: v3 [cs.lg] 14 Jan 2018

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Artificial Neural Networks. MGS Lecture 2

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

attention mechanisms and generative models

Random Coattention Forest for Question Answering

Speech and Language Processing

Natural Language Processing with Deep Learning CS224N/Ling284

Notes on Deep Learning for NLP

Artificial Intelligence

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Long-Short Term Memory

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

with Local Dependencies

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

Introduction to Convolutional Neural Networks 2018 / 02 / 23

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

Feedforward Neural Networks

Convolutional Neural Networks

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

On the use of Long-Short Term Memory neural networks for time series prediction

RECURRENT NETWORKS I. Philipp Krähenbühl

ATASS: Word Embeddings

Contents. (75pts) COS495 Midterm. (15pts) Short answers

CSC321 Lecture 9 Recurrent neural nets

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Deep Learning for NLP

GloVe: Global Vectors for Word Representation 1

Transcription:

Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현

Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability of the next word, given the history of previously generated words and conditioning context x? p w x = l t=1 p(w t x, w 1, w 2,, w t 1 ) w = w 1, w 2,, w l sequence of words x: conditioning context

Basic language modeling with RNN in machine translation Problem 1 Compressing a lot of information in a finite-sized vector Problem 2 Learning gradient is troublesome Use attention mechanism

Solving vector problem in translation 1. Represent input sentence as matrix 2. Condition on matrix generate target sentence 1. Solve capacity problem(more capacity to hold longer sentence) 2. Solve gradient flow problem

Solving vector problem in translation Seeeeeeeeentence 1 Seeeeeentence 2 Seentence 3 Different size of vectors Use matrix Rows: fixed Columns: # of words Q: How do we make matrix?????

Forming matrix: concatenation(method 1) Each word type is represented by n-dimensional vector Take all vectors for the sentence and concatenate them in to matrix The simplest method i th column of f = i th word vector

Forming matrix: Convolutional Nets(method 2) Simply concatenate matrix Apply convolutional networks to obtain a context-dependent matrix Eventually end up with single fixed-size vector representation A Convolutional Encoder Model for Neural Machine Translation(Gehring et al., 2016)

Forming matrix: bidirectional RNN(method 3) + Forward Backward Concatenation

Forming matrix: bidirectional RNN(method 3) Most widely used matrix representation One column per word GRU, LSTM are available instead of RNN

Other ways Very little systemic work Undiscovered things Use CNN learn local /grammar relationship Word embedding + syntactic information - Embedding + POS tagging(linguistic information) Phrase types rather than word types Multi-word expressions are a pain in the neck. =troublesome.

Generate sentence with basic RNN model in machine translation Prediction by sampling Prediction by sampling Feed in representation of word at next time step

Generate sentence with attention model in machine translation Representation vector F generate using attention Idea Generate output sentence word by using an RNN (a word at a time) At each time step, there are 2 inputs - Output from previous time(: a fixed-sized vector embedding) - A fixed-sized vector encoding a view of the input matrix a t (attention): weighting of the input columns at each time step Fa t : a weighted sum of the columns of F based on how important they are on current time step

Generate sentence with attention model in machine translation Start with start symbol Encoded sentence Compare hidden state with columns of matrix Attention weighting

Generate sentence with attention model in machine translation Make context vector by taking production and addition Feed in context vector Run RNN as usual Compute hidden state Sample a word from vocabulary Use sampled word in next time step

Generate sentence with attention model in machine translation Use hidden unit(h 1 ) to achieve attention Construct a 2 Feed in context vector Find how much weight to give to each column

Generate sentence with attention model in machine translation Process until the stop symbol By keeping track of attention weights over time Look at history of what model has paid attention to in producing particular output

Computing attention At each time step, we want to attend to different words in the source sentence We need a weight for every column s i : hidden state h i = h i T : h i T T (=F) : summarize information of preceding, following words e ij (= u t ) = a(s i 1, h j ) (= F T r i = F T Vs i 1 ) (r i = V s i 1 ) - e ij indicates how important h j is, un-normalized attention weight α ij (= a t ) = exp(e ij) : normalization Tx k=1 exp(e ik ) T c i (= c t ) = x j=1 aij h j : weighted sum of h j s i = f(s i 1, y i 1, c i ) 1 2 3 1 p y i y 1,, y i 1, x = g(y i 1, s i, c i ) Linear model is simple but does not work well in practice 2 3

Computing attention At each time step, we want to attend to different words in the source sentence We need a weight for every column s i : hidden state h i = h i T : h i T T (=F) : summarize information of preceding, following words e ij (= u t ) = v T tanh(wf + r i ) (r i = V s i 1 ) v, W : learned parameter - e ij indicates how important h j is, un-normalized attention weight α ij (= a t ) = exp(e ij) : normalization Tx k=1 exp(e ik ) T c i (= c t ) = x j=1 aij h j : weighted sum of h j s i = f(s i 1, y i 1, c i ) 1 2 3 1 p y i y 1,, y i 1, x = g(y i 1, s i, c i ) 2 3

Putting it all together

Putting it all together

Putting it all together

Model variant Compute attention weights & context vector as a function of previous hidden state of RNN X feed context vector into the hidden state at time t Use information from context only when deciding what to generate More time in test Less time in training

Summary Good for interpretability Attention is closely related to pooling in convnets Attention weights provide interpretation you can look at Bahdanau s attention model only seems to care about content Some work has begun to add other structural biases (Luong et al., 2015) https://medium.com/@ozinkegliyin/six-challenges-for-neural-machine-translation-8a780ead92ab

Solving gradient flow problem Situation Multiply scalar(attention weight) on column Weight column from attention mask Assumption: large error on cross entropy loss (problem on parameter of the model) Back propagate errors down to the representation of the word Stronger gradient on more attentional word Much more direct connection at time step Help forgetting problem of LSTM

Image caption generation with attention Show, attend and tell: neural image caption generation with visual attention(xu et al., 2016) Move all over the image Compute representation that are functions of local fields F = [ a 1 ] F = [ a 1 a 2 ] F = [ a 1 a 2 a 3 ]

Hard attention VS Soft attention Soft attention: differentiable Bahdanau et al., 2014 Deterministic Attention term, loss function are differentiable function of inputs All gradients exist Use standard back propagation c t = Fa t (: weighted average) Hard attention: not differentiable Xu et al., 2015 Do not know the correct answer s t ~Categorical a t, c t = F :,st Sample a column: sample N sequences of attention decisions from the model Gradient = gradient probability of sequence Reinforcement learning: reward function=log probability of the word https://stackoverflow.com/questions/35549588/soft-attention-vs-hard-attention

Hard attention(continue) L = logp w x = log p w, s x = log p s x p w x, s s p s x logp w x, s N 1 p s i x logp(w x, s) N i=1 s Jenson s inequality MC approximation f( n i=1 p i x i ) n i=1 p i f(x i ) http://suhak.tistory.com/221

Result

Result Conclusion Significant performance improvements More interpretability Better gradient flow Better capacity

Q&A