Sparse Word Embeddings Using l 1 Regularized Online Learning

Size: px

Start display at page:

Download "Sparse Word Embeddings Using l 1 Regularized Online Learning"

Lydia Dickerson
5 years ago
Views:

1 Sparse Word Embeddings Using l 1 Regularized Online Learning Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng July 14, 2016 ofey.sunfei@gmail.com, {guojiafeng, lanyanyan,junxu, cxq}@ict.ac.cn CAS Key Lab of Network Data Science and Technology Institute of Computing Technology, Chinese Academy of Sciences

2 Distributed Word Representation Machine Translation [Kalchbrenner and Blunsom, 2013] Sentiment Analysis [Maas et al., 2011] Language Modeling [Bengio et al., 2003] Distributed Word Representation POS Taging [Collobert et al., 2011] Parsing [Socher et al., 2011] Word-Sense Disambiguation [Collobert et al., 2011] Distributed word representation is so hot in NLP community. 1

3 w(t-2) w(t-1) w(t+1) w(t+2) INPUT PROJECTION OUTPUT SUM w(t) w(t) INPUT PROJECTION OUTPUT w(t-2) w(t-1) w(t+1) w(t+2) n 2 hu = #tags word of interest d Models i-th output = P(wt = i context) softmax most computation here Input Window Text cat sat o the mat Feature 1 w1 1 w w N Feature K w1 K w2 K... wn K Lookup Table tanh LT W 1. Linear concat C(wt n+1) C(wt 2) C(wt 1) Table Matrix C look up shared parameters in C across words index for wt n+1 index for wt 2 index for wt 1 NPLM LBL score M 1 n 1 HardTanh Linear M 2 C&W CBOW Skip-gram Word2Vec scorel sum... he walks to the bank... Huang scoreg Document river play shore weighted average global semantic vector water State-Of-The-Art: CBOW and SG. 2

4 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] 1 Vectors from GoogleNews-vectors-negative300.bin. 3

5 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] Which dimension represents the gender of man and woman? 1 Vectors from GoogleNews-vectors-negative300.bin. 3

6 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] Which dimension represents the gender of man and woman? What sort of value indicates male or female? 1 Vectors from GoogleNews-vectors-negative300.bin. 3

7 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] Which dimension represents the gender of man and woman? What sort of value indicates male or female? Gender dimension(s) would be active in all the word vectors including irrelevant words like computer. 1 Vectors from GoogleNews-vectors-negative300.bin. 3

8 Dense Representation and Interpretability Example 1 man [ ,..., ,..., ] woman [ ,..., ,..., ] dog [ ,..., ,..., ] computer [ ,..., ,..., ] Which dimension represents the gender of man and woman? What sort of value indicates male or female? Gender dimension(s) would be active in all the word vectors including irrelevant words like computer. Difficult in interpretation and uneconomic in storage. 1 Vectors from GoogleNews-vectors-negative300.bin. 3

9 Sparse Word Representation Non-Negative Sparse Embedding (NNSE) [Murphy et al., 2012] Sparse Coding (Word2Vec) [Faruqui et al., 2015] arg min A R m k D R k n m ) ( X i A i D 2 +λ A i 1 i=1 where, A i,j 0, 1 i m, 1 j k D i D i 1, 1 i k V L X Sparse coding Initial dense vectors K D K D V x A K Sparse overcomplete vectors V x Projection V B K Non-negative sparse coding Sparse, binary overcomplete vectors Introduce sparse and non-negative constraints into MF. Convert dense vector using sparse coding in a post-processing way. 4

10 Sparse Word Representation Non-Negative Sparse Embedding (NNSE) [Murphy et al., 2012] Sparse Coding (Word2Vec) [Faruqui et al., 2015] arg min A R m k D R k n m ) ( X i A i D 2 +λ A i 1 i=1 where, A i,j 0, 1 i m, 1 j k D i D i 1, 1 i k V L X Sparse coding Initial dense vectors K D K D V x A K Sparse overcomplete vectors V x Projection V B K Non-negative sparse coding Sparse, binary overcomplete vectors Introduce sparse and non-negative constraints into MF. Convert dense vector using sparse coding in a post-processing way. Thay are difficult to train on large-scale data for heavy memory usage! 4

11 Our Motivation Good Performance Fast Large Scale Interpretable 5

12 Our Motivation Good Performance Fast Large Scale Interpretable Word2Vec 5

13 Our Motivation Good Performance Fast Large Scale Interpretable Word2Vec Sparse 5

14 Our Motivation Good Performance Fast Large Scale Interpretable Word2Vec Sparse Sparse CBOW Directly apply the sparse constraint to CBOW 5

15 CBOW c i 2 c i 1 c i+1 c i+2. w i L cbow = N log p(w i h i ) i=1 exp( w # i # h i ) p(w i h i ) = w W exp(# w # h i ) # h i = 1 i+l # c j 2l. CBOW j=i l j i L ns cbow = N ( log σ( w # i # h i ) + k E w P log σ( # w W # ) h i ) i=1 6

16 Sparse CBOW L ns s cbow = L ns cbow λ w # 1 w W 7

17 Sparse CBOW L ns s cbow = L ns cbow λ w # 1 w W Online, SGD 7

18 Sparse CBOW L ns s cbow = L ns cbow λ w # 1 w W Online, SGD failed 7

19 Sparse CBOW L ns s cbow = L ns cbow λ w # 1 w W Online, SGD failed Regularized Dual Averaging (RAD) [Xiao, 2009] Truncating using online average subgradients 7

20 RDA algorithm for Sparse CBOW 1: procedure SparseCBOW(C) 2: Initialize: w #, w W, # c, c C, ḡ 0 # w = # 0, w W 3: for i = 1, 2, 3,... do 4: t update time of word w i # i+l 5: h i = 1 2l # c j j=i l j i [ 6: g t w # i = 1 h (w i ) σ( w # t i # # h h i )] i 7: ḡ t w # i = t 1 ḡ t 1 t # w i + 1 g t # t w i Keeping track of the online average subgradients 8: Update w # { i element-wise according to w # 0 if ḡ t # w t+1 ij = λ, ij #(w i ) 9: ηt ( ḡ t w # ij λ sgn(ḡ t # #(w i ) w ij ) ) otherwise, Truncating where, j = 1, 2,..., d 10: for k = l,..., 1, 1,..., l do 11: update # c i+k according to 12: # c i+k := # [ c i+k + α 1 2l hi (w i ) σ( w # t i # h i )] # w t i 13: end for 14: end for 15: end procedure 8

21 Evaluation

22 Baselines & Tasks Baseline Dense representation models GloVe [Pennington et al., 2014] CBOW and SG [Mikolov et al., 2013] Sparse representation models Sparse Coding (SC) [Faruqui et al., 2015] Positive Pointwise Mutual Information (PPMI) [Bullinaria and Levy, 2007] NNSE [Murphy et al., 2012] Tasks Interpretability Word Intrusion Expressive Power Word Analogy Word Similarity 9

23 Experimental Settings Corpus: Wikipedia 2010 (1B words) Parameters Setting: window negative iteration λ learning rate noise distribution grid search 0.05 #(w) 0.75 Baseline: Model Setting GloVe, CBOW, SG same setting with released tools SC Embeddings of CBOW as input NNSE PPMI matrix, 4,0000 words, SPAMS

24 Interpretability

25 Word Intrusion [Chang et al., 2009] Sort in descending dim vocab markov i 0.27 Top 5 poisson parametric markov Pick out bayesian 0.23 jodel poisson Sorted List bayesian stochastic jodel poisson parametric jodel bayesian stochastic markov 1. Sort dimension i in descending order. 2. A set: {top 5 words, 1 word (bottom 50% in i & top 10% in j, j i)} 3. Pick out the intruder word More interpretable, more easy to pick out. 11

26 Evaluation Metric Definition traditional metric: human assignment, subjective, costly. IntraDist i = InterDist i = DistRatio = 1 d w j top k (i) w j top k (i) d i=1 w k top k (i) w k w j dist(w j, w bi ) k InterDist i IntraDist i dist(w j, w k ) k(k 1) IntraDist i : Average distance between top k words InterDist i : Average distance between top k words and bottom word Intuition: The intruder word should be dissimilar to the top words while those top words should be similar to each other. In this work, we use euclidean distance as dist(, ) 12

27 Word Intrusion Results Table 1: 300 dimension, running 10 times. Model Sparsity DistRatio GloVe 0% 1.07 CBOW 0% 1.09 SG 0% 1.12 NNSE (PPMI) 89.15% 1.55 SC (CBOW) 88.34% 1.24 Sparse CBOW 90.06% 1.39 Sparse is better than dense. Sparse CBOW VS. SC (CBOW): information loss in a separate sparse coding step. Non-negative constraint might also be a good choice. 13

28 Case Study Model CBOW Sparse CBOW Top 5 Words beat, finish, wedding, prize, read rainfall, footballer, breakfast, weekdays, angeles landfall, interview, asked, apology, dinner becomes, died, feels, resigned, strained best, safest, iucn, capita, tallest poisson, parametric, markov, bayesian, stochastic statistical learning ntfs, gzip, myfile, filenames, subdirectories file system hugely, enormously, immensely, wildly, tremendously adverb for degree earthquake, quake, uprooted, levees, spectacularly disasters bosons, accretion, higgs, neutrinos, quarks particles The dimensions of Sparse CBOW reveal some clear and consistent semantic meanings. 14

29 Expressive Power

30 Expressive Power Task Word Analogy Word Similarity Rare Word (RW) [Luong et al., 2013] Testset Google [Mikolov et al., 2013] WordSim-353 (WS-353) [Finkelstein et al., 2002] SimLex-999 (SL-999) [Hill et al., 2015] Example Beijing: China Paris:? big: bigger deep:? (tiger cat 7.35) Solution 3CosMul [Levy and Goldberg, 2014] cos Metric Percentage Spearman Rank Correlation 15

31 Results Table 2: Precision (%) for analogy and spearman correlation for similarity. Model Dim Sparsity Sem Syn Total WS-353 SL-999 RW GloVe 300 0% CBOW 300 0% SG 300 0% PPMI(W-C) % PPMI(W-C) % NNSE (PPMI) % SC (CBOW) % SC (CBOW) % Sparse CBOW % Sparse CBOW is a competitive model using much less memory. Similar performance on analogy if we reduce its sparsity level < 85%. 2 The input matirx of NNSE is the dimensional representations of PPMI in fourth row. 16

32 Summary A sparse word representation model. A new evaluation metric for word intrusion task. Improvement in word vector interpretability. Similar performance with less memory usage. 17

33 Thanks! Q & A 17

34 References I Bullinaria, J. A. and Levy, J. P. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39(3): Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., and Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In NIPS, pages Curran Associates, Inc. Faruqui, M., Tsvetkov, Y., Yogatama, D., Dyer, C., and Smith, N. A. (2015). Sparse overcomplete word vector representations. In Proceedings of ACL, pages Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., andgadi Wolfman, Z. S., and Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Trans. Inf. Syst., 20(1):

35 References II Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, pages Levy, O. and Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In Proceedings of CoNLL, pages Luong, M.-T., Socher, R., and Manning, C. D. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of CoNLL, pages Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of ICLR. Murphy, B., Talukdar, P., and Mitchell, T. (2012). Learning effective and interpretable semantic models using non-negative sparse embedding. In Proceedings of COLING, pages

36 References III Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of EMNLP, pages Xiao, L. (2009). Dual averaging method for regularized stochastic learning and online optimization. In Proceedings of NIPS, pages

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng CAS Key Lab of Network Data Science and Technology Institute