CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Similar documents
Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Conditional Language modeling with attention

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

arxiv: v1 [cs.cl] 21 May 2017

CSC321 Lecture 15: Recurrent Neural Networks

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

An overview of word2vec

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Improved Learning through Augmenting the Loss

Data Mining & Machine Learning

Introduction to RNNs!

CSC321 Lecture 16: ResNets and Attention

arxiv: v2 [cs.cl] 1 Jan 2019

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Deep learning for Natural Language Processing and Machine Translation

Natural Language Processing

Combining Static and Dynamic Information for Clinical Event Prediction

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning for NLP Part 2

Sequence Modeling with Neural Networks

CSC321 Lecture 10 Training RNNs

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

DISTRIBUTIONAL SEMANTICS

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Random Coattention Forest for Question Answering

Based on the original slides of Hung-yi Lee

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

A Survey of Techniques for Sentiment Analysis in Movie Reviews and Deep Stochastic Recurrent Nets

Neural Architectures for Image, Language, and Speech Processing

word2vec Parameter Learning Explained

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning for Natural Language Processing

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Lecture 5 Neural models for NLP

Neural Networks for NLP. COMP-599 Nov 30, 2016

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

CS230: Lecture 10 Sequence models II

Text2Action: Generative Adversarial Synthesis from Language to Action

Lecture 15: Recurrent Neural Nets

EE-559 Deep learning LSTM and GRU

GloVe: Global Vectors for Word Representation 1

Introduction to Deep Neural Networks

arxiv: v3 [cs.cl] 30 Jan 2016

RaRE: Social Rank Regulated Large-scale Network Embedding

Notes on Deep Learning for NLP

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Deep Learning for NLP

Stephen Scott.

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Conditional Language Modeling. Chris Dyer

Recurrent Neural Network

Lecture 17: Neural Networks and Deep Learning

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Interpreting Deep Classifiers

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Learning to translate with neural networks. Michael Auli

Deep Learning For Mathematical Functions

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

CSC321 Lecture 15: Exploding and Vanishing Gradients

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Self-Attention with Relative Position Representations

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Recurrent Residual Learning for Sequence Classification

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Gated RNN & Sequence Generation. Hung-yi Lee 李宏毅

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Deep Q-Learning with Recurrent Neural Networks

Diversifying Neural Conversation Model with Maximal Marginal Relevance

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Multi-Source Neural Translation

Better Conditional Language Modeling. Chris Dyer

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Edinburgh Research Explorer

Convolutional Neural Networks

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Towards Universal Sentence Embeddings

Implicitly-Defined Neural Networks for Sequence Labeling

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

APPLIED DEEP LEARNING PROF ALEXIEI DINGLI

Lecture 15: Exploding and Vanishing Gradients

Bayesian Networks (Part I)

Structured Neural Networks (I)

Lecture 6: Neural Networks for Representing Word Meaning

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Feature Design. Feature Design. Feature Design. & Deep Learning

Neural Networks Language Models

NEURAL LANGUAGE MODELS

Recurrent Neural Networks

Multi-Source Neural Translation

Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition

arxiv: v1 [stat.ml] 18 Nov 2017

Tackling the Limits of Deep Learning for NLP

Transcription:

CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors - Augment a RNN with ii. Operations iii. Applications: debasing / restaurant reviews Attention mechanisms II. Attention i. Machine Translation ii. Image Captioning

Word Vector Representation vocabulary one-hot representation word-vector representation "a". V eng = "abs" " football" "zone" "zoo" v football = e football =.3.84.57 How do you get the vector representation?

Word Vector Representation: Training v football =. Data? window Stanford is going to beat Cal next week target p("a" " football" p("abs" " football" p("abyss" " football" p("zebra" " football" p("zone" " football" p("zoo" " football" =.38 =.26 =. =.3 =.2 =.5 Target word (x Stanford is is going going loss? e "a" L = Nearby word (y is Stanford going is to ylog(ŷ z [] = W [] v + b [] e "abs" one-hot " e "zone" shape = (2, V e "zoo"

e "a" e "abs" " e "zone" e "zoo" = e football W [] v football = Word Vector Representation: Embedding matrix

Word Vector Representation: Visualization and Operations Scatter plot of Word vectors Operations on vectors France dog cat lion zoo parc Vegas casino Paris Italia Rome Paris baguette goal ball football maths deep-learning computer-science man king = women x x = queen Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean: Distributed Representations of Words and Phrases and their Compositionality Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean: Efficient Estimation of Word Representations in Vector Space Laurens van der Maaten, Geoffrey Hinton: Visualizing Data using t-sne

Word Vector Representation: bias Sexist bias man king = women x x = queen but also man - woman = computer programmer - homemaker Tolga Bolukbasi et al. : Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

Word Vector Representation: Application Sentiment analysis on restaurant reviews. Data? 5 reviews Review (x Label (Negative/ Tonight was awesome Worst entrée ever.. 2. Architecture? Tonight v "tonight" =.2 e "tonight" =.32 was v "was" =.75 e "was" =. AVERAGE e σ (We + b >.5.74 awesome v "awesome" =.54 e "awesome" =.3 Generalizes very well because of word vector representations 3. Loss? L = ylog(ŷ + ( ylog( ŷ

Word Vector Representation: Application To remember: - In NLP, Words are often represented by meaningful vectors - These vectors are trained thanks to a Neural Network - We can do operations on these vectors - They can be biased, depending on the dataset used to train them - They have a great generalization power

Attention: Motivation Neural Machine Translation.32.64 is.32.64 in.32.64..2 Encoder Decoder h h h 2 h 2 h 3 h 3 the kitchen <eos>.32.64.32.64.32.64 h 4 h 5 h 4 h 5 s..2 s 2..2 dans s 3..2 la s 4.32.64.32.64 dans.32.64..2 cuisine..2 <eos> s 5 s 6 la.32.64 cuisine.32.64

Attention: Motivation Neural Machine Translation.32.64.32.64.32.64..2 Encoder Decoder h h h 2 h 2 h 3 h 3.32.64.32.64.32.64 h 4 h 5 h 4 h 5 s..2 s 2..2 dans s 3..2 la s 4.32.64.32.64 dans.32.64..2 cuisine..2 <eos> s 5 s 6 la.32.64 cuisine.32.64 is in the kitchen <eos> Inverting input works better?

Neural Machine Translation with Attention: Problems & Ideas Some problems: - The encoder encodes all information in the source sentence into a fixed length vector - While it seems that some specific parts of the source sentence are more useful to predict some parts of the output sentence - Bad performances on long sentences Ideas: - We d like to spread the information encoded from the source sentence and selectively retrieve the relevant parts at each prediction of the output sentence - Why don t we use every hidden state hj from the encoding part?

Attention: Motivation.32.64.32.64.32.64..2 h h h 2 h 2 h 3 h 3.32.64.32.64.32.64 h 4 h 5 h 4 h 5 s..2 s 2..2 dans s 3..2 la s 4.32.64.32.64 dans.32.64..2 cuisine..2 <eos> s 5 s 6 la.32.64 cuisine.32.64 is in the kitchen <eos> c = α, j j= 6 h j

Attention: Motivation dans la cuisine <eos> generated at each step during decoding c = 6 α, j h j j=..2..2..2..2..2..2 h h 2 h 3 h 4 h 5 s s 2 s 3 s 4 s 5 s 6 h h 2 h 3 h 4 h 5.32.64.32.64.32.64.32.64.32.64.32.64 h 4 = contains information about the input sentence up to is with a stronger focus on the parts closer to is α 2,3 = f (s,h 3 y 2 x 3 = probability that the target word ( is translated from source word ( in (= score that went through a function c = the expected annotation over all annotations with probabilities α, j kitchen the in is <eos>

Neural Machine Translation with Attention: Architecture..2 c i = the expected annotation over all annotations with probabilities α i, j = f (s i,h j α i, j s c Attention mechanism h h 2 h 3 h 4 h 5.

Neural Machine Translation with Attention: Architecture c i = the expected annotation over all annotations with probabilities α i, j..2..2 α i, j = f (s i,h j s s 2 c 2 s Attention mechanism h h 2 h 3 h 4 h 5.

Neural Machine Translation with Attention: Architecture dans c i = the expected annotation over all annotations with probabilities α i, j..2..2..2 α i, j = f (s i,h j s s 2 s 3 c 3 s 2 Attention mechanism h h 2 h 3 h 4 h 5.

Neural Machine Translation with Attention: Architecture dans la..2..2..2..2 s s 2 s 3 s 4 c 4 s 3 Attention mechanism h h 2 h 3 h 4 h 5.

Neural Machine Translation with Attention: Architecture dans la..2..2..2..2 How to train this? s s 2 s 3 s 4 c s c 2 s 2 c 3 s 3 c 4 same as Machine Translation but takes also derivatives with respect to attention parameters Attention mechanism Attention mechanism Attention mechanism Attention mechanism h h 2 h 3 h 4 h 5.

Neural Machine Translation with Attention: Training What are the parameters? Encoder Attention Decoder W [] = EmbeddingMatrix LSMT: f t = σ (W f [h t, x t ]+ b f i t = σ (W i [h t, x t ]+ b i C ~ t = tanh(w C [h t, x t ]+ b C C t = f t C t + i t C ~ t o t = σ (W o [h t, x t ]+ b o c i = T x j= α ij h j α ij = exp(e ij T x k= exp(e ik e ij = v a T tanh(w a s i +U a h j LSMT: f t = σ (W f [h t, x t ]+ b f i t = σ (W i [h t, x t ]+ b i C ~ t = tanh(w C [h t, x t ]+ b C C t = f t C t + i t C ~ t o t = σ (W o [h t, x t ]+ b o h t = o t tanh(c t h t = o t tanh(c t batch t output Loss function: L = log P(y x;θ + λ ( α ti i patch

Neural Machine Translation with Attention: Training Visualizing Attention Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio : Neural Machine Translation by Jointly Learning to Align and Translate

Image Captioning with Attention Neural Machine Translation is in the kitchen.32.64.32.64 <eos>.32.64.32.64.32.64.32.64..2..2..2 dans..2 la..2 cuisine..2 <eos> Encoder Decoder

Image Captioning with Attention dans la cuisine <eos> Image Captioning Encoder..2..2..2..2..2..2 image Decoder

Image Captioning with Attention a bird flying over a. Image Captioning with no attention Encoder..2..2..2..2..2..2 CNN image Decoder Vinyals et al. : Show and Tell: A Neural Image Caption Generator

Image Captioning with Attention Image Captioning with attention A bird is flying. Encoder..2..2..2..2 CNN {a,a2,...,al}. s s s 2 3 s 4 average. c s 2 s 3 s c c c 3 2 4 Attention mechanism Attention mechanism Attention mechanism Attention mechanism. a a 2 a 3 a 4 a 5 a 6 image Decoder Vinyals et al. : Show and Tell: A Neural Image Caption Generator Kelvin Xu et al. : Show, Attend and Tell: Neural Image Caption Generation with Visual Attention