Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Size: px

Start display at page:

Download "Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287"

Brittney Poole
5 years ago
Views:

1 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

2 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the dense representation in R 1 d in W 1 R d in d hid, b 1 R 1 d hid; first affine transformation W 2 R d hid d out, b 2 R 1 d out ; second affine transformation g : R d hid d hid is an activation non-linearity (often pointwise) g(xw 1 + b 1 ) is the hidden layer

3 Review: Non-Linearities Tanh Hyperbolic Tangeant: tanh(t) = exp(t) exp( t) exp(t) + exp( t) Intuition: Similar to sigmoid, but range between 0 and -1.

4 Review: Backpropagation f i (... f 1 (x 0 )) f i+1 (f i (... f 1 (x 0 ))) L f i (... f 1 (x 0 )) f i+1 ( ; θ i+1 ) L θ i+1 L f i+1 (... f 1 (x 0 ))

5 Quiz One common class of operations in neural network models is known as pooling. Informally a pooling layer consists of aggregation unit, typically unparameterized, that reduces the input to a smaller size. Consider three pooling functions of the form f : R n R, 1. f (x) = max i x i 2. f (x) = min i x i 3. f (x) = i x i /n What action do each of these functions have? What are their gradients? How would you implement backpropagation for these units?

6 Quiz Max pooling: f (x) = max i x i Keeps only the most activated input Fprop is simple; however must store arg max ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Min pooling: f (x) = min i x i Keeps only the least activated input Fprop is simple; however must store arg min ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Avg pooling: f (x) = i x i /n Keeps the average activation input Fprop is simply mean. Gradoutput is averaged and passed to all inputs.

7 Quiz Max pooling: f (x) = max i x i Keeps only the most activated input Fprop is simple; however must store arg max ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Min pooling: f (x) = min i x i Keeps only the least activated input Fprop is simple; however must store arg min ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Avg pooling: f (x) = i x i /n Keeps the average activation input Fprop is simply mean. Gradoutput is averaged and passed to all inputs.

8 Quiz Max pooling: f (x) = max i x i Keeps only the most activated input Fprop is simple; however must store arg max ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Min pooling: f (x) = min i x i Keeps only the least activated input Fprop is simple; however must store arg min ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Avg pooling: f (x) = i x i /n Keeps the average activation input Fprop is simply mean. Gradoutput is averaged and passed to all inputs.

9 Contents Embedding Motivation C&W Embeddings word2vec Evaluating Embeddings

10 1. Use dense representations instead of sparse 2. Use windowed area instead of sequence models 3. Use neural networks to model windowed interactions

11 What about rare words?

12 Word Embeddings Embedding layer, x 0 W 0 x 0 R 1 d 0 one-hot word. W 0 R d 0 d in, d 0 = V Notes: d 0 >> d in, e.g. d 0 = 10000, d in = 50

13 Pretraining Representations We would strong shared representations of words However, PTB only 1M labeled words, relatively small Collobert et al. (2008, 2011) use semi-supervised method. (Close connection to Bengio et al (2003), next topic)

14 Semi-Supervised Training Idea: Train representations separately on more data 1. Pretrain word embeddings W 0 first. 2. Substitute them in as first NN layer 3. Fine-tune embeddings for final task Modify the first layer based on supervised gradients Optional, some work skips this step

15 Large Corpora To learn rare word embeddings, need many more tokens, C&W English Wikipedia (631 million words tokens) Reuters Corpus (221 million word tokens) word2vec Total vocabulary size: 130,000 word types Google News (6 billion word tokens) Total vocabulary size: 1M word types But this data has no labels...

16 Contents Embedding Motivation C&W Embeddings word2vec Evaluating Embeddings

17 C&W Embeddings Assumption: Text in Wikipedia is coherent (in some sense). Most randomly corrupted text is incoherent. Embeddings should distinguish coherence. Common idea in unsupervised learning (distributional hypothesis).

18 C&W Setup Let V be the vocabulary of English and let s score any window of size d win = 5, if we see the phrase [ the dog walks to the ] It should score higher by s than [ the dog house to the ] [ the dog cats to the ] [ the dog skips to the ]...

19 C&W Setup Can estimate score s as a windowed neural network. s(w 1,..., w dwin ) = hardtanh(xw 1 + b 1 )W 2 + b with x = [v(w 1 ) v(w 2 )... v(w dwin )] d in = d win 50, d hid = 100, d win = 11, d out = 1! Example: Function s x = [v(w 3 ) v(w 4 ) v(w 5 ) v(w 6 ) v(w 7 )] d in /d win d in /d win x d in /d win d in /d win d in /d win

20 Training? Different setup than previous experiments. No direct supervision y Train to rank good examples better.

21 Ranking Loss Given only example {x 1,..., x n } and for each example have set D(x) of alternatives. L(θ) = i Example: C&W ranking x D(x) L ranking (s(x i ; θ), s(x ; θ)) L ranking (y, ŷ) = max{0, 1 (y ŷ)} x = [the dog walks to the] D(x) = { [the dog skips to the], [the dog in to the],... } (Torch nn.rankingcriterion) Note: slightly different setup.

22 C&W Embeddings in Practice Vocabulary size D(x) > 100, 000 Training time for 4 weeks (Collobert is main an author of Torch)

23 Sampling (Sketch of Wsabie (Weston, 2011)) Observation: in many contexts L ranking (y, ˆ y) = max{0, 1 (y ŷ)} = 0 Particularly true later in training. For difficult contexts, may be easy to find L ranking (y, ˆ y) = max{0, 1 (y ŷ)} = 0 We can therefore sample from D(x) to find an update.

24 C&W Results 1. Use dense representations instead of sparse 2. Use windowed area instead of sequence models 3. Use neural networks to model windowed interactions 4. Use semi-supervised learning to pretrain representations.

25 Contents Embedding Motivation C&W Embeddings word2vec Evaluating Embeddings

26 word2vec Contributions: Scale embedding process to massive sizes Experiments with several architectures Empirical evaluations of embeddings Influential release of software/data. Differences with C&W Instead of MLP uses (bi)linear model (linear in paper) Instead of ranking model, directly predict word (cross-entropy) Various other extensions. Two different models 1. Continuous Bag-of-Words (CBOW) 2. Continuous Skip-gram

27 word2vec Contributions: Scale embedding process to massive sizes Experiments with several architectures Empirical evaluations of embeddings Influential release of software/data. Differences with C&W Instead of MLP uses (bi)linear model (linear in paper) Instead of ranking model, directly predict word (cross-entropy) Various other extensions. Two different models 1. Continuous Bag-of-Words (CBOW) 2. Continuous Skip-gram

28 word2vec (Bilinear Model) Back to pure bilinear model, but with much bigger output space ŷ = softmax(( i x 0 i W0 d win 1 )W1 ) x 0 i R 1 d 0 input words one-hot vectors. W 0 R d 0 d in ; d 0 = V, word embeddings W 1 R d in d out ; d out = V output embeddings Notes: Bilinear parameter interaction. d 0 >> d in, e.g. 50 d in 1000, V 1M or more

29 word2vec (Mikolov, 2013)

30 Continuous Bag-of-Words (CBOW) ŷ = softmax(( i x 0 i W0 d win 1 )W1 ) Attempt to predict the middle word Example: CBOW [ the dog walks to the ] x = v(w 3) + v(w 4 ) + v(w 6 ) + v(w 7 ) d win 1 y = δ(w 5 ) W 1 is no longer partitioned by row (order is lost)

31 Continous Skip-gram ŷ = softmax(x 0 W 0 )W 1 ) Also a bilinear model Attempt to predict each context-word from middle [ the dog ] Example: Skip-gram x = v(w 5 ) Done for each word in window. y = δ(w 3 )

32 Additional aspects The window d win is sampled for each SGD step SGD is done less for frequent words. We have slightly simplified the training objective.

33 Softmax Issues Use a softmax to force a distribution, Issue: class C is huge. softmax(z) = c C exp(z) exp(z c ) log softmax(z) = z log exp(z c ) c C For C&W, 100,000, for word2vec 1,000,000 types Note largest dataset is 6 billion words

34 Two-Layer Softmax First, clustering words into hard classes (for instance Brown clusters) Groups words into classes based on word-context dog cat horse... car truck motorcycle...

35 Two-Layer Softmax Assume that we first generate a class C and then a word, p(y X ) P(Y C, X ; θ)p(c X ; θ) Estimate distributions with a shared embedding layer, P(C X ; θ) ŷ 1 = softmax((x 0 W 0 )W 1 + b) P(Y C = class, X ; θ) ŷ 2 = softmax((x 0 W 0 )W class + b))

36 Softmax as Tree dog cat horse... car truck motorcycle... ŷ (1) = softmax((x 0 W 0 )W 1 + b) ŷ (2) = softmax((x 0 W 0 )W class + b)) L 2SM (y (1), y (2), ŷ (1), ŷ (2) ) = log p(y x, class(y)) log p(class(y) x) = log ŷ (1) c 1 log ŷ (2) c 2

37 Speed dog cat horse... car truck motorcycle... Computing loss only requires walking path. Two-layer a balanced tree. Computing loss requires O( V ) (Note: computing full distribution requires O( V ))

38 Hierarchical Softmax(HSM) Build multiple layer tree L HSM (y (1),..., y (C ), ŷ (1),..., ŷ (C ) ) = log ŷ (i) c i i Balanced tree only requires O(log 2 V ) Experiments on website (Mnih and Hinton, 2008)

39 HSM with Huffman Encoding Requires O(log 2 perp(unigram)) Reduces time to only 1 day for 1.6 million tokens

41 Contents Embedding Motivation C&W Embeddings word2vec Evaluating Embeddings

42 How good are embeddings? Qualitative Analysis/Visualization Analogy task Extrinsic Metrics

43 Metrics Dot-product x cat x dog Cosine Similarity x cat x dog x cat x dog

44 k-nearest neighbors (cosine sim) dog cat dogs horse puppy pet rabbit pig snake Intuition: trained to match words that act the same.

45 Empirical Measures: Analogy task Analogy questions: A:B::C: 5 types of semantic questions, 9 types of syntactic

46 Embedding Tasks

47 Analogy Prediction A:B::C: Project to the closest word, Code example x = x B x A + x C arg max D V x D x x D x

48 Extrinsic Tasks Text classification Part-of-speech tagging Many, many others over last couple years

49 Conclusion Word Embeddings Scaling issues and tricks Next Class: Language Modeling

Part-of-Speech Tagging + Neural Networks 2 CS 287

Part-of-Speech Tagging + Neural Networks 2 CS 287 Review: Bilinear Model Bilinear model, ŷ = f ((x 0 W 0 )W 1 + b) x 0 R 1 d 0 start with one-hot. W 0 R d 0 d in, d 0 = F W 1 R d in d out, b R 1 d out