Towards Universal Sentence Embeddings

Size: px

Start display at page:

Download "Towards Universal Sentence Embeddings"

Charles Davis
5 years ago
Views:

1 Towards Universal Sentence Embeddings Towards Universal Paraphrastic Sentence Embeddings J. Wieting, M. Bansal, K. Gimpel and K. Livescu, ICLR 2016 A Simple But Tough-To-Beat Baseline For Sentence Embeddings S. Arora, Y. Liang and T. Ma, ICLR 2017 Presented by Chunyuan Li 1 / 16

2 Outline 1 Paragram-Phase Embeddings 2 2 / 16

3 Paragram-phrase Embeddings Goal of sentence embeddings: Embed sentences into a low-dimensional space such that cosine similarity in the space corresponds to the strength of the paraphrase relationship between the sentences. A word sequence x = x 1, x 2,, x n Model 1: Paragram-phrase (PP) embeddings g Paragram-phrase (x) = 1 n n i W x i w (1) where W x i w is the word embedding for word x i. 3 / 16

4 More Embeddings Model 2: Adding Projection g proj (x) = W p ( 1 n n i ) W x i w + b (2) where W p is the projection matrix and b is a bias vector. Model 3: Generalization of M1 and M2 to multiple layers as well as nonlinear activation functions, ie, deep-averaging network (DAN). 4 / 16

5 More Embeddings Model 4: Standard RNN h t = f(w xw x i w + W h h t 1 + b) (3) g RNN (x) = h 1 (4) where {W x, W h, b} are parameters of standard RNN, f is activation function, and h 1 is hidden vector of the last token. embeddings. Model 5: Identity-RNN (irnn) (W x = W h = I, b = 0, f(x) = x) Averaging output: h 1 /n Intuition: richer architecture and can take into account word order Model 6: Replace standard RNN module in M4 with LSTM 5 / 16

6 Training Notations: 1 W w: Trainable word embedding parameters 2 W c: All other trainable parameters (or compositional parameters ) 3 Training data: a set X of phrase pairs x 1, x 2 ; t 1 and t 2 are carefully-selected negative examples (explained later) Training Objective: Intuitions: 1 First two terms: Two phrases to be more similar to each other (cos(g(x 1 ), g(x 2 )) ) than either is to their respective negative examples t 1 and t 2, by a margin of at least δ. 2 Third term on W c: weight decay 3 Last term on W w: word embedding should not be far from the pretrained embedding W winitial from large corpora Dataset: Paraphrase Database (PPDB; Ganitkevitch et al., 2013). 6 / 16

7 Select negative examples Two methods: 1 MAX: chooses the most similar phrase t 1 in some set of phrases (other than those in the given phrase pair x 1, x 2 ). where X b X is the current mini-batch t 1 = arg max cos(g(x 1 ), g(t)) (5) t: t, X b \{ x 1,x 2 } 2 MIX: selects negative examples using MAX with probability 0.5 and selects them randomly from the mini-batch otherwise. 7 / 16

8 Results on transfer learning Figure: textual similarity (Pearson s r 100). Performance Ranking: PP Proj. > irnn > LSTM > others 8 / 16

Results as initialization and regularization Tasks: The SICK similarity task, the SICK entailment task, and the Stanford Sentiment Treebank (SST) binary classification task

9 Results as initialization and regularization Tasks: The SICK similarity task, the SICK entailment task, and the Stanford Sentiment Treebank (SST) binary classification task Initialize each respective model to the learned parameters from PPDB Regularize the task-depedent objective using learned parameters Embeddings as features, without updating. 9 / 16

10 Re-thinking Paragram-phase embeddings Pros: Simply averaging word embedding achieves impressive results Cons: Paired phases are needed for training. Towards universal sentence embeddings with large unlabeled training sets? 10 / 16

11 Random walk model for word embeddings Latent variable generative model for text (Arora et al., 2016). Latent variables: 1 A discourse vector c t R d represent what is being talked about ; 2 Vector representation v w of word w The probability of observing a word w at time t P r[w emitited at time t c t] exp( c t, v w ) (6) The discourse vector c t does a slow random walk, so that nearby words are generated under similar discourses. It generates behavior that fits empirical works like word2vec, in terms of word-word cooccurrence probabilities 11 / 16

12 Random walk model for sentence embeddings A single discourse vector c s governs a sentence The probability of observing a word w in sentence s: exp( cs, vw ) P r[w emitited in sentence s c s] = αp(w) + (1 α) (7) Z cs where c s = βc 0 + (1 β)c s, c 0 c s; Z cs is the normalizing constant Two types of smoothing term to allows a word w unrelated to the discourse c s to be emitted (α = β = 0 reduces to original model): 1 c 0 is introduces as a common discourse, accouting for some frequent words (presumably the, and etc.) 2 p(w) allows that some words occur out of context, even if they have low inner products with c s 12 / 16

13 Computing sentence embeddings Assume that the word v w s are roughly uniformly dispersed, then Z cs is roughly the same, denoted as Z for all c s. L = log p[s c s] = log p(w c s) (8) w s = w s By Taylor expansion [ log αp(w) + (1 α) f w( c s) constant + ] exp( cs, vw ) = Z log f w( c s) (9) w s (1 α)/(αz) cs, vw (10) p(w) + (1 α)/(αz) Maximum likelihood estimator for c s: c s = arg max w s a f w( c s) p(w) + a vw, w s where a = 1 α αz (11) c 0 by computing the first principal component of c s s 13 / 16

14 Algorithm 1 Weighted average of the word vectors, using smooth inverse frequency (SIF). 2 Remove the projections of the average vectors on their first principal component ( common component removal ) 14 / 16

15 Results Transfer learning and supervised learning Results (a) Random Walk (WR)-based sentence embeddings is a more effective way to use word embeddings, and can achieve state-of-the-art (b) WR-based initialization is significantly better for downstream supervised tasks. 15 / 16

16 Summary Simple manipulation of word embeddings leads to state-of-the-art sentence embeddings 1 Averaging (perhaps with linear projection) 2 Weighted Averaging smooth incerse frequency common compoent removal 16 / 16

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the