Neural Networks Language Models

Similar documents
Long-Short Term Memory and Other Gated RNNs

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Sequence Modeling with Neural Networks

CSCI 315: Artificial Intelligence through Deep Learning

Natural Language Processing and Recurrent Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Recurrent Neural Networks. Jian Tang

Deep Learning Recurrent Networks 2/28/2018

Neural Network Language Modeling

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Slide credit from Hung-Yi Lee & Richard Socher

EE-559 Deep learning LSTM and GRU

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

arxiv: v3 [cs.lg] 14 Jan 2018

NEURAL LANGUAGE MODELS

Lecture 11 Recurrent Neural Networks I

Introduction to RNNs!

Recurrent and Recursive Networks

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Stephen Scott.

Lecture 6: Neural Networks for Representing Word Meaning

Conditional Language Modeling. Chris Dyer

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Introduction to Neural Networks

RECURRENT NETWORKS I. Philipp Krähenbühl

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Lecture 17: Neural Networks and Deep Learning

Lecture 11 Recurrent Neural Networks I

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Introduction to Neural Networks

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Recurrent Neural Networks

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

CSC321 Lecture 15: Exploding and Vanishing Gradients

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Lecture 15: Exploding and Vanishing Gradients

Learning Long-Term Dependencies with Gradient Descent is Difficult

Lecture 5 Neural models for NLP

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Recurrent Neural Network

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Speech and Language Processing

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Language Models. Philipp Koehn. 11 September 2018

Structured Neural Networks (I)

Language Modeling. Hung-yi Lee 李宏毅

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

CSC321 Lecture 16: ResNets and Attention

Naïve Bayes, Maxent and Neural Models

Natural Language Processing

Lecture 5: Recurrent Neural Networks

GloVe: Global Vectors for Word Representation 1

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Conditional Language modeling with attention

Natural Language Processing. Statistical Inference: n-grams

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Neural Architectures for Image, Language, and Speech Processing

Recurrent neural networks

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Deep Learning. Language Models and Word Embeddings. Christof Monz

Based on the original slides of Hung-yi Lee

CSC321 Lecture 10 Training RNNs

Convolutional Neural Networks

Learning to translate with neural networks. Michael Auli

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Deep Learning for Computer Vision

text classification 3: neural networks

An overview of word2vec

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Long-Short Term Memory

Deep Learning Recurrent Networks 10/11/2017

Applied Natural Language Processing

Deep Learning for Natural Language Processing

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Introduction to Deep Neural Networks

Better Conditional Language Modeling. Chris Dyer

ANLP Lecture 6 N-gram models and smoothing

Integrating Order Information and Event Relation for Script Event Prediction

Transcription:

Neural Networks Language Models Philipp Koehn 10 October 2017

N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,..., w i 1 )... and limiting the history (Markov order) p(w i w 1,..., w i 1 ) p(w i w i 4, w i 3, w i 2, w i 1 ) Each p(w i w i 4, w i 3, w i 2, w i 1 ) may not have enough statistics to estimate we back off to p(w i w i 3, w i 2, w i 1 ), p(w i w i 2, w i 1 ), etc., all the way to p(w i ) exact details of backing off get complicated interpolated Kneser-Ney

Refinements 2 A whole family of back-off schemes Skip-n gram models that may back off to p(w i w i 2 ) Class-based models p(c(w i ) C(w i 4 ), C(w i 3 ), C(w i 2 ), C(w i 1 )) We are wrestling here with using as much relevant evidence as possible pooling evidence between words

First Sketch 3 Word 1 Word 2 Word 3 Word 4 Hidden Layer Word 5

Representing Words 4 Words are represented with a one-hot vector, e.g., dog = (0,0,0,0,1,0,0,0,0,...) cat = (0,0,0,0,0,0,0,1,0,...) eat = (0,1,0,0,0,0,0,0,0,...) That s a large vector! Remedies limit to, say, 20,000 most frequent words, rest are OTHER place words in n classes, so each word is represented by 1 class label 1 word in class label

Word Classes for Two-Hot Representations 5 WordNet classes Brown clusters Frequency binning sort words by frequency place them in order into classes each class has same token count very frequent words have their own class rare words share class with many other words Anything goes: assign words randomly to classes

Second Sketch 6 Word 1 Word 2 Word 3 Word 4 Hidden Layer Word 5

7 word embeddings

Add a Hidden Layer 8 Word 1 Word 2 Word 3 Word 4 C C C C Hidden Layer Word 5 Map each word first into a lower-dimensional real-valued space Shared weight matrix C

Details (Bengio et al., 2003) 9 Add direct connections from embedding layer to output layer Activation functions input embedding: none embedding hidden: tanh hidden output: softmax Training loop through the entire corpus update between predicted probabilities and 1-hot vector for output word

Word Embeddings 10 Word Embedding C By-product: embedding of word into continuous space Similar contexts similar embedding Recall: distributional semantics

Word Embeddings 11

Word Embeddings 12

Are Word Embeddings Magic? 13 Morphosyntactic regularities (Mikolov et al., 2013) adjectives base form vs. comparative, e.g., good, better nouns singular vs. plural, e.g., year, years verbs present tense vs. past tense, e.g., see, saw Semantic regularities clothing is to shirt as dish is to bowl evaluated on human judgment data of semantic similarities

14 recurrent neural networks

Recurrent Neural Networks 15 1 Word 1 C E H Word 2 Start: predict second word from first Mystery layer with nodes all with value 1

Recurrent Neural Networks 16 1 Word 1 C E H Word 2 copy values H Word 2 C E H Word 3

Recurrent Neural Networks 17 1 Word 1 C E H Word 2 copy values H Word 2 C E H Word 3 copy values H Word 3 C E H Word 4

Training 18 1 Word 1 E H Word 2 Process first training example Update weights with back-propagation

Training 19 H Word 2 E H Word 3 Process second training example Update weights with back-propagation And so on... But: no feedback to previous history

Back-Propagation Through Time 20 H Word 1 E H Word 2 Word 2 E H Word 3 Word 3 E H Word 4 After processing a few training examples, update through the unfolded recurrent neural network

Back-Propagation Through Time 21 Carry out back-propagation though time (BPTT) after each training example 5 time steps seems to be sufficient network learns to store information for more than 5 time steps Or: update in mini-batches process 10-20 training examples update backwards through all examples removes need for multiple steps for each training example

22 long short term memory

Vanishing Gradients 23 Error is propagated to previous steps Updates consider prediction at that time step impact on future time steps Vanishing gradient: propagated error disappears

Recent vs. Early History 24 Hidden layer plays double duty memory of the network continuous space representation used to predict output words Sometimes only recent context important After much economic progress over the years, the country has Sometimes much earlier context important The country which has made much economic progress over the years still has

Long Short Term Memory (LSTM) 25 Design quite elaborate, although not very complicated to use Basic building block: LSTM cell similar to a node in a hidden layer but: has a explicit memory state Output and memory state change depends on gates input gate: how much new input changes memory state forget gate: how much of prior memory state is retained output gate: how strongly memory state is passed on to next layer. Gates can be not just be open (1) and closed (0), but slightly ajar (e.g., 0.2)

LSTM Cell 26 LSTM Layer Time t-1 m Preceding Layer X i input gate m forget gate output gate o h Next Layer Y LSTM Layer Time t

LSTM Cell (Math) 27 Memory and output values at time step t memory t = gate input input t + gate forget memory t 1 output t = gate output memory t Hidden node value h t passed on to next layer applies activation function f h t = f(output t ) Input computed as input to recurrent neural network node given node values for prior layer x t = (x t 1,..., x t X ) given values for hidden layer from previous time step h t 1 = (h t 1 1,..., h t 1 H ) input value is combination of matrix multiplication with weights w x and w h and activation function g ( X ) H input t = g wi x x t i + wi h h t 1 i i=1 i=1

Values for Gates 28 Gates are very important How do we compute their value? with a neural network layer! For each gate a (input, forget, output) weight matrix W xa to consider node values in previous layer x t weight matrix W ha to consider hidden layer h t 1 at previous time step weight matrix W ma to consider memory at previous time step memory t 1 activation function h gate a = h ( X i=1 w xa i x t i + H i=1 wi ha h t 1 i + H i=1 wi ma memory t 1 i )

Training 29 LSTM are trained the same way as recurrent neural networks Back-propagation through time This looks all very complex, but: all the operations are still based on matrix multiplications differentiable activation functions we can compute gradients for objective function with respect to all parameters we can compute update functions

What is the Point? 30 (from Tran, Bisazza, Monz, 2016) Each node has memory memory i independent from current output h i Memory may be carried through unchanged (gate i input = 0, gatei memory = 1) can remember important features over long time span (capture long distance dependencies)

Visualizing Individual Cells 31 Karpathy et al. (2015): Visualizing and Understanding Recurrent Networks Philipp Koehn Machine Translation: Neural Networks 10 October 2017

Visualizing Individual Cells 32

Gated Recurrent Unit (GRU) 33 GRU Layer Time t-1 h reset gate Preceding Layer X x update gate h Next Layer Y GRU Layer Time t

Gated Recurrent Unit (Math) 34 Two Gates update t = g(w update input t + U update state t 1 + bias update ) reset t = g(w reset input t + U reset state t 1 + bias reset ) Combination of input and previous state (similar to traditional recurrent neural network) combination t = f(w input t + U(reset t state t 1 )) Interpolation with previous state state t =(1 update t ) state t 1 + update t combination t ) + bias

35 deeper models

Deep Learning? 36 Input Hidden Layer Output Not much deep learning so far Between prediction from input to output: only 1 hidden layer How about more hidden layers?

Deep Models 37 Input Input Hidden Layer 1 Hidden Layer 1 Input Hidden Layer 2 Hidden Layer 2 Hidden Layer Hidden Layer 3 Hidden Layer 3 Output Output Output Shallow Deep Stacked Deep Transition

38 questions?