COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS
LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g., n = 3, a trigram model!(# $, # %,.. # & ) = & *+$!(# * # *-% # *-$ ) Training (estimation) from frequency counts Difficulty with rare events smoothing But are there better solutions? Neural networks (deep learning) a popular alternative.
OUTLINE Neural network fundamentals Feed-forward neural language models Recurrent neural language models
LMS AS CLASSIFIERS LMs can be considered simple classifiers, e.g. trigram model! " # " #$% = ()", w -$. = /0ts ) classifies the likely next word in a sequence. Has a parameter for every combination of " #$%, w -$., " # Can think of this as a specific type of classifier one with a very simple parameterisation.
BIGRAM LM IN PICTURES 1 1 V eats P(grass eats) w i-1 Each word type assigned an id number between 1..V (vocabulary size) V grass Matrix stores bigram LM parameters, i.e., conditional probabilities w i Total of V 2 parameters
INFERENCE AS MATRIX MULTIPLICATION Using parameter matrix, can query with simple linear algebra eats as a 1-hot vector x P(grass eats) = P(w i w i-1 =eats) Parameters
FROM BIGRAMS TO TRIGRAMS Problems too many parameters (V 3 ) to be learned from data (growing with LM order) parameters completely separate for different trigrams even when sharing words, e.g., (cow, eats) vs (farmer, eats) [motivating smoothing] doesn t recognise similar words e.g., cow vs sheep, eats vs chews w i-2 sheep cow farmer grass eats P(grass cow,eats) w i-1 chews
CAN T WE USE WORD EMBEDDINGS? eats as a 1-hot vector V x P(grass eats) = P(w i w i-1 =eats) V Parameters V x V embedding of eats (dense) d x = d unembed, softmax P(w i w i-1 =eats) V Parameters d x d
LOG-BILINEAR LM Parameters: embedding matrix (or 2 matrices, for input & output) of size V x d weights now d x d If d V then considerable saving vs V 2 d chosen as parameter typically in [100, 1000] Model is called the log-bilinear LM, and is closely related to word2vec
FEED FORWARD NEURAL NET LMS Neural networks more general approach, based on same principle embed input context words transform in hidden space un-embed to prob over full vocab Neural network used to define transformations e.g., feed forward LM (FFLM) w i-1 as 1-hot vector w i-2 as 1-hot vector V V embedding (lookup) embedding (lookup) d d a function d T unembed, softmax P(w i ) V
INTERIM SUMMARY Ngram LMs cheap to train (just compute counts) but too many parameters, problems with sparsity and scaling to larger contexts don t adequately capture properties of words (grammatical and semantic similarity), e.g., film vs movie NNLMs more robust force words through low-dimensional embeddings automatically capture word properties, leading to more robust estimates
NEURAL NETWORKS Deep neural networks provide mechanism for learning richer models. Based on vector embeddings and compositional functions over these vectors. Word embeddings capture grammatical and semantic similarity cows ~ sheep, eats ~ chews etc. Vector composition can allow for combinations of features to be learned (e.g., humans consume meat) Limit size of vector representation to keep model capacity under control.
COMPONENTS OF NN CLASSIFIER NN = Neural Network a.k.a. artificial NN, deep learning, multilayer perceptron Composed of simple functions of vector-valued inputs y 1 y 2 y dout U h 1 h 2 h 3 h dh W b x 1 x 2 x din +1 Fig JM3 Ch 8
<latexit sha1_base64="vvtrjdgkfur43jqf10s1hlwks/e=">aaace3icbvdlsgmxfm34rpu16tjnsaivqpmkoc6eohuxfawtdiyhk2y6sznmknxrs+lhupfx3lhqcevgnx9j+lho64edh3pujbknsaxx4djf1tz8wulscm4lv7q2vrfpb23f6crtlnvpihlvdihmgktwbw6cnvpfsbwi1gi6f8o8cceu5om8hl7kvjh0ja85jwas3y5f+ay7qgsexcfckgjxz7f/i+8nhwxlomcu4p0idny74jsdefcsqexeau1q8+0vt53qlgysqcbatypocl6fkobuseheztrlce2sdmszkunmtncfhtxa+8zp4zbrhhlwyp290sex1r04mjmxguhpz0pzv6yvqxji9blmm2csjh8km4ehwcogcjsrrkh0jcbucfnxtcoicaxty96uujk+evbud8unzefqqfa9n7srq7todxvrbr2jkrpenvrhfd2iz/sk3qwn68v6tz7go3pwzgch/yh1+qo67jxs</latexit> <latexit sha1_base64="vvtrjdgkfur43jqf10s1hlwks/e=">aaace3icbvdlsgmxfm34rpu16tjnsaivqpmkoc6eohuxfawtdiyhk2y6sznmknxrs+lhupfx3lhqcevgnx9j+lho64edh3pujbknsaxx4djf1tz8wulscm4lv7q2vrfpb23f6crtlnvpihlvdihmgktwbw6cnvpfsbwi1gi6f8o8cceu5om8hl7kvjh0ja85jwas3y5f+ay7qgsexcfckgjxz7f/i+8nhwxlomcu4p0idny74jsdefcsqexeau1q8+0vt53qlgysqcbatypocl6fkobuseheztrlce2sdmszkunmtncfhtxa+8zp4zbrhhlwyp290sex1r04mjmxguhpz0pzv6yvqxji9blmm2csjh8km4ehwcogcjsrrkh0jcbucfnxtcoicaxty96uujk+evbud8unzefqqfa9n7srq7todxvrbr2jkrpenvrhfd2iz/sk3qwn68v6tz7go3pwzgch/yh1+qo67jxs</latexit> <latexit sha1_base64="vvtrjdgkfur43jqf10s1hlwks/e=">aaace3icbvdlsgmxfm34rpu16tjnsaivqpmkoc6eohuxfawtdiyhk2y6sznmknxrs+lhupfx3lhqcevgnx9j+lho64edh3pujbknsaxx4djf1tz8wulscm4lv7q2vrfpb23f6crtlnvpihlvdihmgktwbw6cnvpfsbwi1gi6f8o8cceu5om8hl7kvjh0ja85jwas3y5f+ay7qgsexcfckgjxz7f/i+8nhwxlomcu4p0idny74jsdefcsqexeau1q8+0vt53qlgysqcbatypocl6fkobuseheztrlce2sdmszkunmtncfhtxa+8zp4zbrhhlwyp290sex1r04mjmxguhpz0pzv6yvqxji9blmm2csjh8km4ehwcogcjsrrkh0jcbucfnxtcoicaxty96uujk+evbud8unzefqqfa9n7srq7todxvrbr2jkrpenvrhfd2iz/sk3qwn68v6tz7go3pwzgch/yh1+qo67jxs</latexit> NN UNITS Each unit is a function given input x, computes real-value (scalar) h 0 1 h = tanh @ X j w j x j + ba scales input (with weights, w) and adds offset (bias, b) applies non-linear function, such as logistic sigmoid, hyperbolic sigmoid (tanh), or rectified linear unit
<latexit sha1_base64="afm6umpq4lc1x8gat7ozmgrc8+k=">aaacgnicbzbns8naeiy3flu/qh69lbzbeupabpugfl14vlbwaerzbcfn0s0m7e6kjer/epgvepgg4k28+g/cfhzuordw8l47zmzrj1iydn0vz2z2bn5hcwm5slk6tr5r3ny6nxgqodr5lgn95zmduiioo0ajd4kgfvksgn7vyug3+qcninundhjorayrrca4qyu1i1wvdzwlc3pgpwqqpj6eapdpg46m+5wejsnpqadfn8sddrhklt1r0wmotkbejnxvln54nzinesjkkhntrlgjtjkmuxajecfldssm91gxmhyvi8c0stftod2zsocgsbzpir2ppzsyfhkziox6exhd0pz1huj/xjpf4ksvczwkciqpbwwppbjtyvc0izrwlamljgthd6u8zjpxthewbaivvydpq71api2710el2vkkjswyq3bjpqmqy1ijl+sk1aknd+sjvjbx59f5dt6c9/hxgwfss01+lfp5dw/roai=</latexit> <latexit sha1_base64="afm6umpq4lc1x8gat7ozmgrc8+k=">aaacgnicbzbns8naeiy3flu/qh69lbzbeupabpugfl14vlbwaerzbcfn0s0m7e6kjer/epgvepgg4k28+g/cfhzuordw8l47zmzrj1iydn0vz2z2bn5hcwm5slk6tr5r3ny6nxgqodr5lgn95zmduiioo0ajd4kgfvksgn7vyug3+qcninundhjorayrrca4qyu1i1wvdzwlc3pgpwqqpj6eapdpg46m+5wejsnpqadfn8sddrhklt1r0wmotkbejnxvln54nzinesjkkhntrlgjtjkmuxajecfldssm91gxmhyvi8c0stftod2zsocgsbzpir2ppzsyfhkziox6exhd0pz1huj/xjpf4ksvczwkciqpbwwppbjtyvc0izrwlamljgthd6u8zjpxthewbaivvydpq71api2710el2vkkjswyq3bjpqmqy1ijl+sk1aknd+sjvjbx59f5dt6c9/hxgwfss01+lfp5dw/roai=</latexit> <latexit sha1_base64="afm6umpq4lc1x8gat7ozmgrc8+k=">aaacgnicbzbns8naeiy3flu/qh69lbzbeupabpugfl14vlbwaerzbcfn0s0m7e6kjer/epgvepgg4k28+g/cfhzuordw8l47zmzrj1iydn0vz2z2bn5hcwm5slk6tr5r3ny6nxgqodr5lgn95zmduiioo0ajd4kgfvksgn7vyug3+qcninundhjorayrrca4qyu1i1wvdzwlc3pgpwqqpj6eapdpg46m+5wejsnpqadfn8sddrhklt1r0wmotkbejnxvln54nzinesjkkhntrlgjtjkmuxajecfldssm91gxmhyvi8c0stftod2zsocgsbzpir2ppzsyfhkziox6exhd0pz1huj/xjpf4ksvczwkciqpbwwppbjtyvc0izrwlamljgthd6u8zjpxthewbaivvydpq71api2710el2vkkjswyq3bjpqmqy1ijl+sk1aknd+sjvjbx59f5dt6c9/hxgwfss01+lfp5dw/roai=</latexit> NEURAL NETWORK COMPONENTS Typically have several hidden units, i.e., 0 1 h i = tanh @ X w ij x j + b i A j each with own weights (w i ) and bias term (b i ) can be expressed using matrix & vector operators where W is a matrix comprising the unit weight vectors, and b is a vector of all the bias terms and tanh applied element-wise to a vector Very efficient and simple implementation ~ h = tanh W ~x + ~ b
ANN IN PICTURES Pictorial representation of a single unit, for computing y from x z y a σ w 1 w 2 w 3 x 1 x 2 x 3 b +1 Typical networks have several units, and additional layers E.g., output layer, for classification target W U h 1 h 2 y 1 y 2 x 1 x 2 h 3 x din y dout h dh +1 b Figs JM3 Ch 8
BEHIND THE NON-LINEARITY Non-linearity lends expressive power beyond logistic regression (linear ANN is otherwise equivalent) y 2 1 0-4 -3-2 -1 0 1 2 3 4-1 -2 step(x) y 2 2 1 1 0 0-4 -3-2 -1 0 1 2 3 4-4 -3-2 -1 0 1 2 3 4-1 -1 y -2-2 tanh(x) relu(x) Non-linearity allows for complex functions to be learned e.g., logical operations: AND, OR, NOT etc. single hidden layer ANN is universal approximator: can represent any function with sufficiently large hidden state Fig from Neubig (2017)
EXAMPLE NETWORKS What do these units do? using the step activation function consider binary values for x 1 and x 2 h 1 h 2-1 1 1 1 1 0 x 1 x 2 1
<latexit sha1_base64="ni/vzggiemypyfsdk3gv91wspla=">aaacw3icbvhratrafj3earex2lxxyzels6uilfkrqg9c0rcfkxi3sfmwyeznzuhkjszcfjeqn/stl/6ktripaoufgcm5c+beeyarlhqux9dbeg/v/op90uh08pdr46pxk6ffnamtweqyzexfxh0qqtehsqovkou8zbtos8vpnt6/quuk0d9ow+gy5bstcyk4ewo1dukviqzo4dvhsinralkfoz3ahhrlrwtvdihribvyu9brsnoop7y7m6nqcjjw8xibz3iqubcn7yrw0+hghq3gk3ga9wv3wwwaezbu+wr8m10buzeossju3giwv7rsucupflzrwjusuljkg1x42m3hlk0ftgvhnlldbqw/mqbn/3y0vhruw/r1jktohbutdet/tevn+ftli3vve2qxa5txcshalzsspuvbausbf1b6wueu3hjb/j+6ega3v74lkrftd9p467vj2achjrf7wv6yezzjp+ymfwhnlggcxbpfwsg4ch6fe2euhu6uhshgecb+qfd5hyd2s3y=</latexit> COUPLING THE OUTPUT LAYER To make this into a classifier, need to produce a classification output e.g., vector of probabilities for the next word Add another layer, which takes h as input, and maps to classification target (e.g., V) to ensure probabilities, apply softmax transform softmax applies exponential, and normalises, i.e., applied to vector v apple exp(v1 ) P i exp(v i), exp(v 2 ) P i exp(v i),... exp(v m ) P i exp(v i) W U y 1 h 1 h 2 x 1 x 2 y 2 h 3 x din y dout ~ h = tanh W ~x + ~ b ~y = softmax U ~ h h dh +1 b
DEEP STRUCTURES Can stack several hidden layers; e.g., 1. map from 1-hot words, w, to word embeddings, e (lookup) 2. transform e to hidden state h 1 (with non-linearity) 3. transform h 1 to hidden state h 2 (with non-linearity) 4. repeat 5. transform h n, to output classification space y (with softmax) Each layer typically fully-connected to next lower layer, i.e., each unit is connected to all input elements Depth allows for more complex functions, e.g., longer logical expressions
LEARNING WITH SGD How to learn the parameters from data? parameters = sets of weights, bias, embeddings Consider how well the model fits the training data, in terms of the probability it assigns to the correct output e.g., % "#$ &(( " ( "*+ ( "*$ ) for sequence of m words want to maximise this probability, equivalently minimise its negative log -log P is known as the loss function Trained using gradient based methods, much like logistic regression (tools like tensorflow, theano, torch etc use autodiff to compute gradients automatically)
FFNNLM Application to language modelling i-th output = P(w t = i context) softmax...... most computation here tanh...... C(w t n+1 )... Table look up in C... C(w t 2 ) C(w t 1 )...... Matrix C shared parameters across words w t n+1 w t 2 index for index for index for w t 1 Bengio et al, 2003
RECURRENT NNLMS What if we structure the network differently, e.g., according to sequence with Recurrent Neural Networks (RNNs)
RECURRENT NNLMS Start with initial hidden state h 0 For each word, w i, in order i=1..m embed word to produce vector, e i compute hidden! " = tanh() * " +,! "-. + /) compute output P(2 "3. ) = softmax(9! " + :) Train such to minimise " log P(2 " ) to learn parameters W, V, U, b, c, h 0
RNNS Can results in very deep networks, difficult to train due to gradient explosion or vanishing variant RNNs designed to behave better, e.g., GRU, LSTM RNNs used widely as sentence encodings RNN processes sentence, word at a time, use final state as fixed dimensional representation of sentence (of any length) Can also run another RNN over reversed sentence, and concatenate both final representations Used in translation, summarisation, generation, text classification, and more
FINAL WORDS NNet models Robust to word variation, typos, etc Excellent generalization, especially RNNs Flexible forms the basis for many other models (translation, summarization, generation, tagging, etc) Cons Much slower than counts but hardware acceleration Need to limit vocabulary, not so good with rare words Not as good at memorizing fixed sequences Data hungry, not so good on tiny data sets
REQUIRED READING J&M3 Ch. 8 Neubig 2017, Neural Machine Translation and Sequence-to-sequence Models: A Tutorial, Sections 5 & 6