NEURAL LANGUAGE MODELS

Similar documents
Lecture 5 Neural models for NLP

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural Network Language Modeling

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Lecture 17: Neural Networks and Deep Learning

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Natural Language Processing

Natural Language Processing

ECE521 Lecture 7/8. Logistic Regression

Logistic Regression & Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Sequence Modeling with Neural Networks

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Natural Language Processing and Recurrent Neural Networks

Deep Learning for NLP

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Slide credit from Hung-Yi Lee & Richard Socher

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Lecture 6: Neural Networks for Representing Word Meaning

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Deep Neural Networks (1) Hidden layers; Back-propagation

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Recurrent Neural Networks. COMP-550 Oct 5, 2017

DISTRIBUTIONAL SEMANTICS

Neural Networks Language Models

Part-of-Speech Tagging + Neural Networks 2 CS 287

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Task-Oriented Dialogue System (Young, 2000)

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Introduction to RNNs!

ECE521 Lectures 9 Fully Connected Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

EE-559 Deep learning Recurrent Neural Networks

Natural Language Processing with Deep Learning CS224N/Ling284

Structured Neural Networks (I)

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Deep Learning Recurrent Networks 2/28/2018

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Deep Learning for NLP

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Neural Networks in Structured Prediction. November 17, 2015

arxiv: v3 [cs.lg] 14 Jan 2018

Conditional Language modeling with attention

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Assignment 1. Learning distributed word representations. Jimmy Ba

CSC321 Lecture 16: ResNets and Attention

Long-Short Term Memory and Other Gated RNNs

Artificial Neural Networks. MGS Lecture 2

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Neural Networks 2. 2 Receptive fields and dealing with image inputs

An overview of word2vec

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

CSC321 Lecture 15: Exploding and Vanishing Gradients

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Recurrent Neural Networks. Jian Tang

Feedforward Neural Networks. Michael Collins, Columbia University

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Feed-forward Network Functions

Course 395: Machine Learning - Lectures

Neural Architectures for Image, Language, and Speech Processing

NLP Programming Tutorial 8 - Recurrent Neural Nets

Stephen Scott.

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Natural Language Processing

Introduction to Deep Neural Networks

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Artificial Neuron (Perceptron)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Artificial Neural Networks

Neural Networks and Deep Learning

Artificial Intelligence

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Speech and Language Processing

Deep Learning for NLP Part 2

Dimensionality Reduction and Principle Components Analysis

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Better Conditional Language Modeling. Chris Dyer

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC321 Lecture 15: Recurrent Neural Networks

Midterm sample questions

CSC321 Lecture 4: Learning a Classifier

Lecture 15: Exploding and Vanishing Gradients

Recurrent and Recursive Networks

Stochastic gradient descent; Classification

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Introduction to Neural Networks

text classification 3: neural networks

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

Transcription:

COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS

LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g., n = 3, a trigram model!(# $, # %,.. # & ) = & *+$!(# * # *-% # *-$ ) Training (estimation) from frequency counts Difficulty with rare events smoothing But are there better solutions? Neural networks (deep learning) a popular alternative.

OUTLINE Neural network fundamentals Feed-forward neural language models Recurrent neural language models

LMS AS CLASSIFIERS LMs can be considered simple classifiers, e.g. trigram model! " # " #$% = ()", w -$. = /0ts ) classifies the likely next word in a sequence. Has a parameter for every combination of " #$%, w -$., " # Can think of this as a specific type of classifier one with a very simple parameterisation.

BIGRAM LM IN PICTURES 1 1 V eats P(grass eats) w i-1 Each word type assigned an id number between 1..V (vocabulary size) V grass Matrix stores bigram LM parameters, i.e., conditional probabilities w i Total of V 2 parameters

INFERENCE AS MATRIX MULTIPLICATION Using parameter matrix, can query with simple linear algebra eats as a 1-hot vector x P(grass eats) = P(w i w i-1 =eats) Parameters

FROM BIGRAMS TO TRIGRAMS Problems too many parameters (V 3 ) to be learned from data (growing with LM order) parameters completely separate for different trigrams even when sharing words, e.g., (cow, eats) vs (farmer, eats) [motivating smoothing] doesn t recognise similar words e.g., cow vs sheep, eats vs chews w i-2 sheep cow farmer grass eats P(grass cow,eats) w i-1 chews

CAN T WE USE WORD EMBEDDINGS? eats as a 1-hot vector V x P(grass eats) = P(w i w i-1 =eats) V Parameters V x V embedding of eats (dense) d x = d unembed, softmax P(w i w i-1 =eats) V Parameters d x d

LOG-BILINEAR LM Parameters: embedding matrix (or 2 matrices, for input & output) of size V x d weights now d x d If d V then considerable saving vs V 2 d chosen as parameter typically in [100, 1000] Model is called the log-bilinear LM, and is closely related to word2vec

FEED FORWARD NEURAL NET LMS Neural networks more general approach, based on same principle embed input context words transform in hidden space un-embed to prob over full vocab Neural network used to define transformations e.g., feed forward LM (FFLM) w i-1 as 1-hot vector w i-2 as 1-hot vector V V embedding (lookup) embedding (lookup) d d a function d T unembed, softmax P(w i ) V

INTERIM SUMMARY Ngram LMs cheap to train (just compute counts) but too many parameters, problems with sparsity and scaling to larger contexts don t adequately capture properties of words (grammatical and semantic similarity), e.g., film vs movie NNLMs more robust force words through low-dimensional embeddings automatically capture word properties, leading to more robust estimates

NEURAL NETWORKS Deep neural networks provide mechanism for learning richer models. Based on vector embeddings and compositional functions over these vectors. Word embeddings capture grammatical and semantic similarity cows ~ sheep, eats ~ chews etc. Vector composition can allow for combinations of features to be learned (e.g., humans consume meat) Limit size of vector representation to keep model capacity under control.

COMPONENTS OF NN CLASSIFIER NN = Neural Network a.k.a. artificial NN, deep learning, multilayer perceptron Composed of simple functions of vector-valued inputs y 1 y 2 y dout U h 1 h 2 h 3 h dh W b x 1 x 2 x din +1 Fig JM3 Ch 8

<latexit sha1_base64="vvtrjdgkfur43jqf10s1hlwks/e=">aaace3icbvdlsgmxfm34rpu16tjnsaivqpmkoc6eohuxfawtdiyhk2y6sznmknxrs+lhupfx3lhqcevgnx9j+lho64edh3pujbknsaxx4djf1tz8wulscm4lv7q2vrfpb23f6crtlnvpihlvdihmgktwbw6cnvpfsbwi1gi6f8o8cceu5om8hl7kvjh0ja85jwas3y5f+ay7qgsexcfckgjxz7f/i+8nhwxlomcu4p0idny74jsdefcsqexeau1q8+0vt53qlgysqcbatypocl6fkobuseheztrlce2sdmszkunmtncfhtxa+8zp4zbrhhlwyp290sex1r04mjmxguhpz0pzv6yvqxji9blmm2csjh8km4ehwcogcjsrrkh0jcbucfnxtcoicaxty96uujk+evbud8unzefqqfa9n7srq7todxvrbr2jkrpenvrhfd2iz/sk3qwn68v6tz7go3pwzgch/yh1+qo67jxs</latexit> <latexit sha1_base64="vvtrjdgkfur43jqf10s1hlwks/e=">aaace3icbvdlsgmxfm34rpu16tjnsaivqpmkoc6eohuxfawtdiyhk2y6sznmknxrs+lhupfx3lhqcevgnx9j+lho64edh3pujbknsaxx4djf1tz8wulscm4lv7q2vrfpb23f6crtlnvpihlvdihmgktwbw6cnvpfsbwi1gi6f8o8cceu5om8hl7kvjh0ja85jwas3y5f+ay7qgsexcfckgjxz7f/i+8nhwxlomcu4p0idny74jsdefcsqexeau1q8+0vt53qlgysqcbatypocl6fkobuseheztrlce2sdmszkunmtncfhtxa+8zp4zbrhhlwyp290sex1r04mjmxguhpz0pzv6yvqxji9blmm2csjh8km4ehwcogcjsrrkh0jcbucfnxtcoicaxty96uujk+evbud8unzefqqfa9n7srq7todxvrbr2jkrpenvrhfd2iz/sk3qwn68v6tz7go3pwzgch/yh1+qo67jxs</latexit> <latexit sha1_base64="vvtrjdgkfur43jqf10s1hlwks/e=">aaace3icbvdlsgmxfm34rpu16tjnsaivqpmkoc6eohuxfawtdiyhk2y6sznmknxrs+lhupfx3lhqcevgnx9j+lho64edh3pujbknsaxx4djf1tz8wulscm4lv7q2vrfpb23f6crtlnvpihlvdihmgktwbw6cnvpfsbwi1gi6f8o8cceu5om8hl7kvjh0ja85jwas3y5f+ay7qgsexcfckgjxz7f/i+8nhwxlomcu4p0idny74jsdefcsqexeau1q8+0vt53qlgysqcbatypocl6fkobuseheztrlce2sdmszkunmtncfhtxa+8zp4zbrhhlwyp290sex1r04mjmxguhpz0pzv6yvqxji9blmm2csjh8km4ehwcogcjsrrkh0jcbucfnxtcoicaxty96uujk+evbud8unzefqqfa9n7srq7todxvrbr2jkrpenvrhfd2iz/sk3qwn68v6tz7go3pwzgch/yh1+qo67jxs</latexit> NN UNITS Each unit is a function given input x, computes real-value (scalar) h 0 1 h = tanh @ X j w j x j + ba scales input (with weights, w) and adds offset (bias, b) applies non-linear function, such as logistic sigmoid, hyperbolic sigmoid (tanh), or rectified linear unit

<latexit sha1_base64="afm6umpq4lc1x8gat7ozmgrc8+k=">aaacgnicbzbns8naeiy3flu/qh69lbzbeupabpugfl14vlbwaerzbcfn0s0m7e6kjer/epgvepgg4k28+g/cfhzuordw8l47zmzrj1iydn0vz2z2bn5hcwm5slk6tr5r3ny6nxgqodr5lgn95zmduiioo0ajd4kgfvksgn7vyug3+qcninundhjorayrrca4qyu1i1wvdzwlc3pgpwqqpj6eapdpg46m+5wejsnpqadfn8sddrhklt1r0wmotkbejnxvln54nzinesjkkhntrlgjtjkmuxajecfldssm91gxmhyvi8c0stftod2zsocgsbzpir2ppzsyfhkziox6exhd0pz1huj/xjpf4ksvczwkciqpbwwppbjtyvc0izrwlamljgthd6u8zjpxthewbaivvydpq71api2710el2vkkjswyq3bjpqmqy1ijl+sk1aknd+sjvjbx59f5dt6c9/hxgwfss01+lfp5dw/roai=</latexit> <latexit sha1_base64="afm6umpq4lc1x8gat7ozmgrc8+k=">aaacgnicbzbns8naeiy3flu/qh69lbzbeupabpugfl14vlbwaerzbcfn0s0m7e6kjer/epgvepgg4k28+g/cfhzuordw8l47zmzrj1iydn0vz2z2bn5hcwm5slk6tr5r3ny6nxgqodr5lgn95zmduiioo0ajd4kgfvksgn7vyug3+qcninundhjorayrrca4qyu1i1wvdzwlc3pgpwqqpj6eapdpg46m+5wejsnpqadfn8sddrhklt1r0wmotkbejnxvln54nzinesjkkhntrlgjtjkmuxajecfldssm91gxmhyvi8c0stftod2zsocgsbzpir2ppzsyfhkziox6exhd0pz1huj/xjpf4ksvczwkciqpbwwppbjtyvc0izrwlamljgthd6u8zjpxthewbaivvydpq71api2710el2vkkjswyq3bjpqmqy1ijl+sk1aknd+sjvjbx59f5dt6c9/hxgwfss01+lfp5dw/roai=</latexit> <latexit sha1_base64="afm6umpq4lc1x8gat7ozmgrc8+k=">aaacgnicbzbns8naeiy3flu/qh69lbzbeupabpugfl14vlbwaerzbcfn0s0m7e6kjer/epgvepgg4k28+g/cfhzuordw8l47zmzrj1iydn0vz2z2bn5hcwm5slk6tr5r3ny6nxgqodr5lgn95zmduiioo0ajd4kgfvksgn7vyug3+qcninundhjorayrrca4qyu1i1wvdzwlc3pgpwqqpj6eapdpg46m+5wejsnpqadfn8sddrhklt1r0wmotkbejnxvln54nzinesjkkhntrlgjtjkmuxajecfldssm91gxmhyvi8c0stftod2zsocgsbzpir2ppzsyfhkziox6exhd0pz1huj/xjpf4ksvczwkciqpbwwppbjtyvc0izrwlamljgthd6u8zjpxthewbaivvydpq71api2710el2vkkjswyq3bjpqmqy1ijl+sk1aknd+sjvjbx59f5dt6c9/hxgwfss01+lfp5dw/roai=</latexit> NEURAL NETWORK COMPONENTS Typically have several hidden units, i.e., 0 1 h i = tanh @ X w ij x j + b i A j each with own weights (w i ) and bias term (b i ) can be expressed using matrix & vector operators where W is a matrix comprising the unit weight vectors, and b is a vector of all the bias terms and tanh applied element-wise to a vector Very efficient and simple implementation ~ h = tanh W ~x + ~ b

ANN IN PICTURES Pictorial representation of a single unit, for computing y from x z y a σ w 1 w 2 w 3 x 1 x 2 x 3 b +1 Typical networks have several units, and additional layers E.g., output layer, for classification target W U h 1 h 2 y 1 y 2 x 1 x 2 h 3 x din y dout h dh +1 b Figs JM3 Ch 8

BEHIND THE NON-LINEARITY Non-linearity lends expressive power beyond logistic regression (linear ANN is otherwise equivalent) y 2 1 0-4 -3-2 -1 0 1 2 3 4-1 -2 step(x) y 2 2 1 1 0 0-4 -3-2 -1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4-1 -1 y -2-2 tanh(x) relu(x) Non-linearity allows for complex functions to be learned e.g., logical operations: AND, OR, NOT etc. single hidden layer ANN is universal approximator: can represent any function with sufficiently large hidden state Fig from Neubig (2017)

EXAMPLE NETWORKS What do these units do? using the step activation function consider binary values for x 1 and x 2 h 1 h 2-1 1 1 1 1 0 x 1 x 2 1

<latexit sha1_base64="ni/vzggiemypyfsdk3gv91wspla=">aaacw3icbvhratrafj3earex2lxxyzels6uilfkrqg9c0rcfkxi3sfmwyeznzuhkjszcfjeqn/stl/6ktripaoufgcm5c+beeyarlhqux9dbeg/v/op90uh08pdr46pxk6ffnamtweqyzexfxh0qqtehsqovkou8zbtos8vpnt6/quuk0d9ow+gy5bstcyk4ewo1dukviqzo4dvhsinralkfoz3ahhrlrwtvdihribvyu9brsnoop7y7m6nqcjjw8xibz3iqubcn7yrw0+hghq3gk3ga9wv3wwwaezbu+wr8m10buzeossju3giwv7rsucupflzrwjusuljkg1x42m3hlk0ftgvhnlldbqw/mqbn/3y0vhruw/r1jktohbutdet/tevn+ftli3vve2qxa5txcshalzsspuvbausbf1b6wueu3hjb/j+6ega3v74lkrftd9p467vj2achjrf7wv6yezzjp+ymfwhnlggcxbpfwsg4ch6fe2euhu6uhshgecb+qfd5hyd2s3y=</latexit> COUPLING THE OUTPUT LAYER To make this into a classifier, need to produce a classification output e.g., vector of probabilities for the next word Add another layer, which takes h as input, and maps to classification target (e.g., V) to ensure probabilities, apply softmax transform softmax applies exponential, and normalises, i.e., applied to vector v apple exp(v1 ) P i exp(v i), exp(v 2 ) P i exp(v i),... exp(v m ) P i exp(v i) W U y 1 h 1 h 2 x 1 x 2 y 2 h 3 x din y dout ~ h = tanh W ~x + ~ b ~y = softmax U ~ h h dh +1 b

DEEP STRUCTURES Can stack several hidden layers; e.g., 1. map from 1-hot words, w, to word embeddings, e (lookup) 2. transform e to hidden state h 1 (with non-linearity) 3. transform h 1 to hidden state h 2 (with non-linearity) 4. repeat 5. transform h n, to output classification space y (with softmax) Each layer typically fully-connected to next lower layer, i.e., each unit is connected to all input elements Depth allows for more complex functions, e.g., longer logical expressions

LEARNING WITH SGD How to learn the parameters from data? parameters = sets of weights, bias, embeddings Consider how well the model fits the training data, in terms of the probability it assigns to the correct output e.g., % "#$ &(( " ( "*+ ( "*$ ) for sequence of m words want to maximise this probability, equivalently minimise its negative log -log P is known as the loss function Trained using gradient based methods, much like logistic regression (tools like tensorflow, theano, torch etc use autodiff to compute gradients automatically)

FFNNLM Application to language modelling i-th output = P(w t = i context) softmax...... most computation here tanh...... C(w t n+1 )... Table look up in C... C(w t 2 ) C(w t 1 )...... Matrix C shared parameters across words w t n+1 w t 2 index for index for index for w t 1 Bengio et al, 2003

RECURRENT NNLMS What if we structure the network differently, e.g., according to sequence with Recurrent Neural Networks (RNNs)

RECURRENT NNLMS Start with initial hidden state h 0 For each word, w i, in order i=1..m embed word to produce vector, e i compute hidden! " = tanh() * " +,! "-. + /) compute output P(2 "3. ) = softmax(9! " + :) Train such to minimise " log P(2 " ) to learn parameters W, V, U, b, c, h 0

RNNS Can results in very deep networks, difficult to train due to gradient explosion or vanishing variant RNNs designed to behave better, e.g., GRU, LSTM RNNs used widely as sentence encodings RNN processes sentence, word at a time, use final state as fixed dimensional representation of sentence (of any length) Can also run another RNN over reversed sentence, and concatenate both final representations Used in translation, summarisation, generation, text classification, and more

FINAL WORDS NNet models Robust to word variation, typos, etc Excellent generalization, especially RNNs Flexible forms the basis for many other models (translation, summarization, generation, tagging, etc) Cons Much slower than counts but hardware acceleration Need to limit vocabulary, not so good with rare words Not as good at memorizing fixed sequences Data hungry, not so good on tiny data sets

REQUIRED READING J&M3 Ch. 8 Neubig 2017, Neural Machine Translation and Sequence-to-sequence Models: A Tutorial, Sections 5 & 6