Generative Models for Sentences

Similar documents
Variational Autoencoder

arxiv: v2 [cs.cl] 1 Jan 2019

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

The Origin of Deep Learning. Lili Mou Jan, 2015

Variational Autoencoders

Bayesian Semi-supervised Learning with Deep Generative Models

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Auto-Encoding Variational Bayes

Lecture 15. Probabilistic Models on Graph

Evaluating the Variance of

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Variational Attention for Sequence-to-Sequence Models

Unsupervised Learning

Probabilistic Graphical Models

Deep Generative Models for Graph Generation. Jian Tang HEC Montreal CIFAR AI Chair, Mila

Lecture 16 Deep Neural Generative Models

Sum-Product Networks: A New Deep Architecture

Variational Inference via Stochastic Backpropagation

Intelligent Systems:

Generative Adversarial Networks

Need for Sampling in Machine Learning. Sargur Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Improved Bayesian Compression

Probabilistic Graphical Models for Image Analysis - Lecture 1

Deep Learning Autoencoder Models

Generative models for missing value completion

STA 414/2104: Lecture 8

From perceptrons to word embeddings. Simon Šuster University of Groningen

Chapter 16. Structured Probabilistic Models for Deep Learning

Dynamic Approaches: The Hidden Markov Model

Variational Auto Encoders

Probabilistic Reasoning in Deep Learning

STA 4273H: Statistical Machine Learning

The connection of dropout and Bayesian statistics

Bayesian Machine Learning - Lecture 7

STA 414/2104: Machine Learning

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Latent Variable Models

Variational Autoencoders (VAEs)

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

STA 4273H: Statistical Machine Learning

Conditional Language modeling with attention

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

9 Forward-backward algorithm, sum-product on factor graphs

Probabilistic Graphical Models

Introduction to Deep Neural Networks

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Variational Autoencoders

A fast and simple algorithm for training neural probabilistic language models

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

JOINT PROBABILISTIC INFERENCE OF CAUSAL STRUCTURE

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Networks. Motivation

Expectation Maximization

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

ECE521 Lectures 9 Fully Connected Neural Networks

The Infinite PCFG using Hierarchical Dirichlet Processes

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Linear Dynamical Systems

Bayesian Deep Generative Models for Semi-Supervised and Active Learning

Deep Learning for NLP

GWAS IV: Bayesian linear (variance component) models

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Intelligent Systems (AI-2)

GAUSSIAN PROCESS REGRESSION

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

UNSUPERVISED LEARNING

Nonparametric Inference for Auto-Encoding Variational Bayes

Bayesian Machine Learning

Variational Inference and Learning. Sargur N. Srihari

Bayesian Deep Learning

Structured Prediction

word2vec Parameter Learning Explained

Probabilistic & Bayesian deep learning. Andreas Damianou

Chris Bishop s PRML Ch. 8: Graphical Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

A Unified View of Deep Generative Models

Using Graphs to Describe Model Structure. Sargur N. Srihari

Bayesian Networks (Part I)

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Machine Learning Summer School

Intelligent Systems (AI-2)

Bringing machine learning & compositional semantics together: central concepts

Variational Dropout via Empirical Bayes

2 : Directed GMs: Bayesian Networks

STA 4273H: Statistical Machine Learning

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Uncertainty and Bayesian Networks

Lecture 21: Spectral Learning for Graphical Models

Based on slides by Richard Zemel

Deep Learning For Mathematical Functions

Lecture 13: Structured Prediction

Transcription:

Generative Models for Sentences Amjad Almahairi PhD student August 16 th 2014

Outline 1. Motivation Language modelling Full Sentence Embeddings 2. Approach Bayesian Networks Variational Autoencoders (VAE) VAE variants for modelling sentences 3. Preliminary results

Motivation 1: Language Modelling Traditional approaches for language modelling are mainly based on an approximation of the chain rule: P w 0, w 1,.., w n n i=0 P(w i w i 1,, w i C ) We end up learning a model of a word given its previous context Do we take into account the global coherence of the sentence?

Motivation 1: Language Modelling Traditional approaches for language modelling are mainly based on an approximation of the chain rule: P w 0, w 1,.., w n n P(w i w i 1,, w i C ) i=0 Intuitively, people map an internal semantic form Some idea into a syntactic form, which is then linearized into words word word word

Motivation 2: Sentence Embeddings Word embeddings have been very successful in many NLP tasks Train a model in a very general task such that it finds a good represention for words Use them in another task, and possibly fine tune them We would like to do the same for sentences Learn a fixed representation that encodes syntax and semantics This can be very useful for tasks that condition on the full sentence (e.g. machine translation) word embedding word word

Goals Learn a joint probabilistic model P(X, Z) of sentences and representations Query in both directions Given a representation Z, what is X~P(X Z)? generate new sentences Given a sentence X, what is Z~P Z X? Use Z for another task Or use Z to generate similar sentences from P(X Z) Find if a given X is probable under P X Or an estimate of P(X) X Z word embedding word word

Bayesian Networks with latent variables Directed probabilistic graphical models Causal model: models flow from cause to effect latent representation P x 1,, x n = i=1 n P x i Pa(x i )) Easy to generate unbiased samples Ancestral sampling But very hard to infer the state of latent variables or to sample from the posterior Consequently, learning is very hard too observed words

Variational Autoencoders (VAE) Kingma and Willing 2014 Defined for a very general setting X : observed variables (continuous/discrete) Z : latent variables (continuous) P θ Z X : intractable posterior φ Z X θ Deals with the inference problem by learning an approximate (but tractable) posterior q φ Z X Using q φ Z X we can define a lower bound on log p θ (X) : L x = D KL q φ Z X p θ (Z) + E qφ Z X [log p θ(x Z)]

Variational Autoencoders (VAE) Kingma and Willing 2014 The new idea here is the reparameterization trick : for Z ~q φ Z X, assume Z = g φ X, ε where ε ~ p ε an independent noise Now we can write: E qφ Z X log p θ(x Z) = E p(ε) log p θ (X g φ X, ε ) 1 L l=1 L log p θ (X g φ X, ε ) So can back-propagate through the model optimize the lower bound w.r.t θ and φ

Variational Autoencoders (VAE) Kingma and Willing 2014 In VAE, a neural network is used to parameterize q φ Z X and p θ (X Z) φ q φ Z X Z X θ p θ (X Z) In our case X is a variable-length sentence, and straight NNs can t deal with that

Solution 1: Tree-structured VAE Use a recursive NN Combine nodes in a tree structure according to the sentence parse tree Requires a pre-specified tree structure For inference and generation! q φ Z X A tree be very deep O(#words) Depth of the full model is depth(q φ Z X ) + depth(p θ (X Z)) p θ (X Z)

Solution 2: Pyramid-structured VAE Use the Gated Recursive Convolutional Network Recursively runs binary convolution Activation of a node is a weighted sum of: new activation, left, and right child Very deep (always #words-1) Gating can help by shortcutting paths q φ Z X p θ (X Z) Gating can be seen as a way to learn a (soft) tree structure

MSR Sentence completion task Given a sentence missing a word, select the correct replacement from five alternatives ``I have seen it on him, and could to it. 1. write 2. migrate 3. climb 4. swear 5. contribute Test set is a public dataset has 1040 sentences derived from 19 th century novels Training set is a collection of also 19 th century novels with 2.2M sentences and 46M words

Experiments on short phrases Dataset: 70K of 4-word length phrases from Holmes dataset (19 th century novels) Samples of training data: `` gracious goodness! `` of course! asked mr. swift. will you? asked Tom again. questioned the boy.

Experiments on short phrases Trained model #2 No tricks eat marcia umbrella '' it oh?! exclaimed out!. asked nonsense arnold! she you sat!

Experiments on short phrases Trained model #2 No tricks eat marcia umbrella '' it oh?! exclaimed out!. asked nonsense arnold! she you sat! Pretrain with an autoencoder Learn p θ (X Z) parameters Fix q φ Z X for first 20 epochs `` yes! '' `` what! '' said the man. do you? ' he went on. said the voice. he he spoke.

Experiments on short phrases Trained model #2 No tricks eat marcia umbrella '' it oh?! exclaimed out!. asked nonsense arnold! she you sat! Pretrain with an autoencoder Learn p θ (X Z) parameters Fix q φ Z X for first 20 epochs i whispered eagerly. i said again. the lady nodded. the lady asked. i cried indignantly. he said carelessly. `` no. ' `` yes. '' `` ah! '' `` oh! ' our eyes met. five minutes passed.

Solution 3: Tree-Convolutional VAE Combination of the previous two approaches Iterate: - convolutional layer - pooling layer This makes depth O(log #words) q φ Z X P θ (X Z)

Thank you!