Generative Models for Sentences

Generative Models for Sentences Amjad Almahairi PhD student August 16 th 2014

Outline 1. Motivation Language modelling Full Sentence Embeddings 2. Approach Bayesian Networks Variational Autoencoders (VAE) VAE variants for modelling sentences 3. Preliminary results

Motivation 1: Language Modelling Traditional approaches for language modelling are mainly based on an approximation of the chain rule: P w 0, w 1,.., w n n i=0 P(w i w i 1,, w i C ) We end up learning a model of a word given its previous context Do we take into account the global coherence of the sentence?

Motivation 1: Language Modelling Traditional approaches for language modelling are mainly based on an approximation of the chain rule: P w 0, w 1,.., w n n P(w i w i 1,, w i C ) i=0 Intuitively, people map an internal semantic form Some idea into a syntactic form, which is then linearized into words word word word

Motivation 2: Sentence Embeddings Word embeddings have been very successful in many NLP tasks Train a model in a very general task such that it finds a good represention for words Use them in another task, and possibly fine tune them We would like to do the same for sentences Learn a fixed representation that encodes syntax and semantics This can be very useful for tasks that condition on the full sentence (e.g. machine translation) word embedding word word

Goals Learn a joint probabilistic model P(X, Z) of sentences and representations Query in both directions Given a representation Z, what is X~P(X Z)? generate new sentences Given a sentence X, what is Z~P Z X? Use Z for another task Or use Z to generate similar sentences from P(X Z) Find if a given X is probable under P X Or an estimate of P(X) X Z word embedding word word

Bayesian Networks with latent variables Directed probabilistic graphical models Causal model: models flow from cause to effect latent representation P x 1,, x n = i=1 n P x i Pa(x i )) Easy to generate unbiased samples Ancestral sampling But very hard to infer the state of latent variables or to sample from the posterior Consequently, learning is very hard too observed words

Variational Autoencoders (VAE) Kingma and Willing 2014 Defined for a very general setting X : observed variables (continuous/discrete) Z : latent variables (continuous) P θ Z X : intractable posterior φ Z X θ Deals with the inference problem by learning an approximate (but tractable) posterior q φ Z X Using q φ Z X we can define a lower bound on log p θ (X) : L x = D KL q φ Z X p θ (Z) + E qφ Z X [log p θ(x Z)]

Variational Autoencoders (VAE) Kingma and Willing 2014 The new idea here is the reparameterization trick : for Z ~q φ Z X, assume Z = g φ X, ε where ε ~ p ε an independent noise Now we can write: E qφ Z X log p θ(x Z) = E p(ε) log p θ (X g φ X, ε ) 1 L l=1 L log p θ (X g φ X, ε ) So can back-propagate through the model optimize the lower bound w.r.t θ and φ

Variational Autoencoders (VAE) Kingma and Willing 2014 In VAE, a neural network is used to parameterize q φ Z X and p θ (X Z) φ q φ Z X Z X θ p θ (X Z) In our case X is a variable-length sentence, and straight NNs can t deal with that

Solution 1: Tree-structured VAE Use a recursive NN Combine nodes in a tree structure according to the sentence parse tree Requires a pre-specified tree structure For inference and generation! q φ Z X A tree be very deep O(#words) Depth of the full model is depth(q φ Z X ) + depth(p θ (X Z)) p θ (X Z)

Solution 2: Pyramid-structured VAE Use the Gated Recursive Convolutional Network Recursively runs binary convolution Activation of a node is a weighted sum of: new activation, left, and right child Very deep (always #words-1) Gating can help by shortcutting paths q φ Z X p θ (X Z) Gating can be seen as a way to learn a (soft) tree structure

MSR Sentence completion task Given a sentence missing a word, select the correct replacement from five alternatives ``I have seen it on him, and could to it. 1. write 2. migrate 3. climb 4. swear 5. contribute Test set is a public dataset has 1040 sentences derived from 19 th century novels Training set is a collection of also 19 th century novels with 2.2M sentences and 46M words

Experiments on short phrases Dataset: 70K of 4-word length phrases from Holmes dataset (19 th century novels) Samples of training data: `` gracious goodness! `` of course! asked mr. swift. will you? asked Tom again. questioned the boy.

Experiments on short phrases Trained model #2 No tricks eat marcia umbrella '' it oh?! exclaimed out!. asked nonsense arnold! she you sat!

Experiments on short phrases Trained model #2 No tricks eat marcia umbrella '' it oh?! exclaimed out!. asked nonsense arnold! she you sat! Pretrain with an autoencoder Learn p θ (X Z) parameters Fix q φ Z X for first 20 epochs `` yes! '' `` what! '' said the man. do you? ' he went on. said the voice. he he spoke.

Experiments on short phrases Trained model #2 No tricks eat marcia umbrella '' it oh?! exclaimed out!. asked nonsense arnold! she you sat! Pretrain with an autoencoder Learn p θ (X Z) parameters Fix q φ Z X for first 20 epochs i whispered eagerly. i said again. the lady nodded. the lady asked. i cried indignantly. he said carelessly. `` no. ' `` yes. '' `` ah! '' `` oh! ' our eyes met. five minutes passed.

Solution 3: Tree-Convolutional VAE Combination of the previous two approaches Iterate: - convolutional layer - pooling layer This makes depth O(log #words) q φ Z X P θ (X Z)

Thank you!