Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Size: px

Start display at page:

Download "Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding."

Stanley Carpenter
5 years ago
Views:

1 c Word Embedding Embedding Word2Vec Embedding Word EmbeddingWord2Vec 1. Embedding 1.1 BEDORE 0 1 BEDORE F y katayama@bedore.jp Word Embedding Embedding 1.2 Embedding Embedding Word Embedding Embedding Word Embedding Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited

2 2. Embedding 1 Word2Vec 2013 Word2Vec [1] Word2Vec Word Embedding queen woman + man = king Embedding 1 Word2Vec Word2Vec Word2Vec Embedding Word2Vec Embedding 3 Embedding Embedding 1.3 Embedding Embedding Embedding Skip-thought Vectors [2] Recurrent Neural Network (RNN) Word2Vec Embedding Embedding 2.1 One-hot Embedding Onehot One-hot 1 One-hot Bag-of-words BoW BoW One-hot 2 One-hot BoW 1 One-hot 2.2 One-hot = (Distributional Hypothesis) [3] [4] NICT Dice count-based predictive [5] count-based Wikipedia n n-gram v c v c Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

3 1 2 One-hot The pen is mightier than the sword BoW I wear my pen as others do their sword BoW the the pen pen be be mighty mighty than than sword sword I I wear wear my my as as others other do do their their X c count-based predictive One-hot manyto-many Hinton (Local Representation) (Distributed Representation) Distributional Hypothesis count-based Distributional Representation predictive Distributed Representation [6] LSI count-based Latent Semantic Indexing (LSI) [7] X X ij i j LSI X X r v r U c r V r r Σ X = UΣV r u i v j r m X X r m m LSI LSI probabilistic LSI (plsi) [8] Latent Dirichlet Allocation (LDA) [9] m Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited

4 v m U c m V v j j LSI plsi U V LSI LSI U u i plsi i plsi (i) (ii) (iii) (i) d d,u,v plsi LDA U, V U plsi LDA Word2Vec predictive Word2Vec Word2Vec 2013 Mikolov Skip-gram with Negative Sampling (SGNS) Skip-gram w c 1 T T t=1 c j c j 0 log P (w t+j w t) (1) w 2 v w input v w output input output Python gensim [10] input Skip-gram V w I w O P (w O w I)= ( ) exp v wo v wi ) (2) V w=1 (v exp w v wi h Skip-gram 2 (W, b) f f(w x + b) 1 b = 0 1 One-hot input W One-hot input v h V 1 w k One-hot v wk 2 W output v V h softmax softmax softmax (x) i = exp xi i exp(xi) g g(w k ) l g(w k ) l = ( ) exp v wl v wk ) V w=1 (v exp w v wk w k = w I,w l = w O g(w k ) l = P (w O w I) (2) V softmax Negative Sampling [11] softmax Skipgram sigmoid σ (x) = log P (wo wi) 1 1+exp ( x) Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

5 ( ) log σ v wo v wi + k i=1 E wi P n(w) [ ( )] log σ v wi v wi (3) P n (w) Negative Sampling 3/4 (3) softmax k Word2Vec Word2Vec Embedding 2017 Word2Vec Embedding Word2Vec Glove Word2Vec 2014 Glove [12] Glove count-based predictive V i,j=1 ( ) f (X ij) v wj v wi + b i + b j log X ij b Word2Vec f X X ij w i window w j SGNS shifted PMI [13] count-based predictive Pointwise Mutual Information (PMI) PMI(x, y) =log P (x, y) P (x) P (y) PMI shifted PMI PMI PMI Embedding LexVec [14] fasttext Embedding Facebook 2016 fasttext [15] n-gram (character n-gram) sub-word subword 3-gram 6-gram sub-word fasttext egg 3-gram eg egg gg 4-gram egg egg 5-gram egg sub-word fast- Text sub-word english-born fasttext british-born polish-born skip-gram most-capped ex-scotland fasttext fasttext sub-word sub-word Character-based Embedding sub-word Embedding (Character-based Embedding) RNN 10 2 ASCII 128 One-hot 2 Embedding Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited

6 2 MeCab Convolutional Neural Network Embedding [16] Word Embedding Embedding Embedding Embedding Word2Vec Embedding [17] Word Embedding Meta Embedding [18] WordNet Embedding Word2Vec WordNet WordNet AutoExtend [19] WordNet Poincaré Embeddings [20] 3. Word Embedding 3.1 Embedding Embedding Embedding Wikipedia Word Embedding Embedding gensim ipadic neologd ChaSen JUMAN MeCab MeCab MeCab (ipadic) mecab-ipadic-neologd [21] 2 mecab-ipadic-neologd ipadic Word Embedding neologd 3.2 Embedding Embedding Word2Vec One-hot Embedding Embedding Embed Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

7 ding Embedding Wikipedia Embedding 3.3 Embedding Embedding Embedding Embedding SGNS Negative Sampling Embedding SGNS Kaji and Kobayashi [22] SGNS 3.4 Embedding Word2Vec fast- Text fold BEDORE FAQ 400 FAQ 3 Word2Vec+LSTM 0.39 fasttext+lstm 0.41 fine-tuned Word2Vec+LSTM 0.43 fine-tuned fasttext+lstm 0.42 BoW+NN 0.32 FAQ FAQ 905 Wikipedia BEDORE MeCab/mecab-ipadic-neologd Word2Vec fasttext 300 Word2Vec fasttext TF-IDF RNN Long Short-Term Memory (LSTM) LSTM Embedding LSTM BoW (BoW NN) softmax BoW Word2Vec fasttext 3 Word2Vec fasttext fasttext BoW Embedding Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited

8 BoW Embedding Embedding 4. Embedding Embedding Word2Vec Embedding Embedding [1]T.Mikolov,K.Chen,G.CorradoandJ.Dean, Efficient estimation of word representations in vector space, arxiv: , [2]R.Kiros,Y.Zhu,R.R.Salakhutdinov,R.Zemel, R. Urtasun, A. Torralba and S. Fidler, Skip-thought vectors, In Advances in Neural Information Processing Systems, 28, pp , [3] M. Sahlgren, The distributional hypothesis, Italian Journal of Linguistics, 20, pp , [4] ALAGIN, Advanced LAn- Guage INfomation Forum, [5] M. Baroni, G. Dinu and G. Kruszewski, Don t count, predict!: A systematic comparison of contextcounting vs. context-predicting semantic vectors, In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp , [6] J. Turian, L. Ratinov and Y. Bengio, Word representations: A simple and general method for semisupervised learning, In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp , [7]S.Deerwester,S.T.Dumais,G.W.Furnas,T.K. Landauer and R. Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science, 41, pp , [8] T. Hofmann, Probabilistic latent semantic indexing, In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp , [9] D. M. Blei, A. Y. Ng and M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research, 3, pp , [10] R. Řehůřek and P. Sojka, Software framework for topic modelling with large corpora, In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp , [11] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean, Distributed representations of words and phrases and their compositionality, In Advances in Neural Information Processing Systems, 26, pp , [12] J. Pennington, R. Socher and C. D. Manning, Glove: Global vectors for word representation, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp , [13] O. Levy and Y. Goldberg, Neural word embedding as implicit matrix factorization, In Advances in Neural Information Processing Systems, 27, pp , [14] A. Salle, A. Villavicencio and M. Idiart, Matrix factorization using window sampling and negative sampling for improved word representations, In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp , [15] P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics, 5, pp , [16] F. Liu, H. Lu, C. Lo and G. Neubig, Learning character-level compositionality with visual features, arxiv: , [17] J. Garten, K. Sagae, V. Ustun and M. Dehghani, Combining distributed vector representations for words, In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp , [18] W. Yin and H. Schütze, Learning word metaembeddings, In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp , [19] S. Rothe and H. Schütze, Autoextend: Extending word embeddings to embeddings for synsets and lexemes, arxiv: , [20] M. Nickel and D. Kiela, Poincaré embeddings for learning hierarchical representations, arxiv: , [21] T. Sato, Neologism dictionary based on the language resources on the web for mecab, com/neologd/mecab-ipadic-neologd, [22] N. Kaji and H. Kobayashi, Incremental skip-gram model with negative sampling, arxiv: , Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

DISTRIBUTIONAL SEMANTICS

COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.