Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Similar documents
DISTRIBUTIONAL SEMANTICS

word2vec Parameter Learning Explained

Deep Learning. Ali Ghodsi. University of Waterloo

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

GloVe: Global Vectors for Word Representation 1

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Deep Learning for NLP Part 2

Embeddings Learned By Matrix Factorization

arxiv: v3 [cs.cl] 30 Jan 2016

Quantum-like Generalization of Complex Word Embedding: a lightweight approach for textual classification

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Bayesian Paragraph Vectors

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Neural Word Embeddings from Scratch

An overview of word2vec

Word Embeddings 2 - Class Discussions

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Neural Networks for NLP. COMP-599 Nov 30, 2016

Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations

Factorization of Latent Variables in Distributional Semantic Models

Data Mining & Machine Learning

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Semantics with Dense Vectors

Distributional Semantics and Word Embeddings. Chase Geigle

arxiv: v2 [cs.cl] 1 Jan 2019

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

From perceptrons to word embeddings. Simon Šuster University of Groningen

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

arxiv: v1 [stat.ml] 24 Mar 2018

Text mining and natural language analysis. Jefrey Lijffijt

Natural Language Processing and Recurrent Neural Networks

Improving Topic Models with Latent Feature Word Representations

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Natural Language Processing

Natural Language Processing

Natural Language Processing

Learning Features from Co-occurrences: A Theoretical Analysis

Deep Learning for Natural Language Processing

Evaluation of Japanese Text Information Features Based on the Readability

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

STA141C: Big Data & High Performance Statistical Computing

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Deep Learning: A Statistical Perspective

Neural Networks Language Models

A Unified Learning Framework of Skip-Grams and Global Vectors

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Lecture 6: Neural Networks for Representing Word Meaning

arxiv: v1 [cs.cl] 19 Nov 2015

Deep Learning For Mathematical Functions

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

arxiv: v1 [cs.cl] 1 Apr 2016

Natural Language Processing (CSE 517): Cotext Models

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Deep Learning Recurrent Networks 2/28/2018

STA141C: Big Data & High Performance Statistical Computing

Random Coattention Forest for Question Answering

Natural Language Processing (CSE 517): Cotext Models (II)

Nonparametric Spherical Topic Modeling with Word Embeddings

Distributed Negative Sampling for Word Embeddings

Latent Dirichlet Allocation Introduction/Overview

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

text classification 3: neural networks

Text Classification based on Word Subspace with Term-Frequency

SKIP-GRAM WORD EMBEDDINGS IN HYPERBOLIC

The representation of word and sentence

Machine Learning for Smart Learners

Bag of Words Meets Bags of Popcorn

NEURAL LANGUAGE MODELS

Incrementally Learning the Hierarchical Softmax Function for Neural Language Models

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Recurrent Attentional Topic Model

Topic Models and Applications to Short Documents

Deep Learning for NLP

Lecture 7: Word Embeddings

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Topic Modelling and Latent Dirichlet Allocation

arxiv: v2 [cs.cl] 11 May 2013

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

arxiv: v1 [cs.cl] 21 May 2017

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Lecture 17: Neural Networks and Deep Learning

topic modeling hanna m. wallach

Replicated Softmax: an Undirected Topic Model. Stephen Turner

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Supervised Word Mover s Distance

Count-Min Tree Sketch: Approximate counting for NLP

Latent Semantic Analysis. Hongning Wang

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Structured Neural Networks (I)

arxiv: v1 [cs.ir] 12 Oct 2018

A Document Descriptor using Covariance of Word Vectors

arxiv: v2 [cs.lg] 12 Sep 2016

Analysis of Railway Accidents Narratives Using Deep Learning

Topic Discovery Project Report

arxiv: v2 [cs.ai] 26 May 2017

Deep Learning for NLP

Transcription:

c Word Embedding Embedding Word2Vec Embedding Word EmbeddingWord2Vec 1. Embedding 1.1 BEDORE 0 1 BEDORE 113 0033 2 35 10 4F y katayama@bedore.jp Word Embedding Embedding 1.2 Embedding Embedding Word Embedding Embedding Word Embedding 2017 11 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 23 717

2. Embedding 1 Word2Vec 2013 Word2Vec [1] Word2Vec Word Embedding queen woman + man = king Embedding 1 Word2Vec Word2Vec Word2Vec Embedding 2013 2 Word2Vec Embedding 3 Embedding Embedding 1.3 Embedding Embedding Embedding Skip-thought Vectors [2] Recurrent Neural Network (RNN) Word2Vec Embedding Embedding 2.1 One-hot Embedding Onehot One-hot 1 One-hot Bag-of-words BoW BoW One-hot 2 One-hot BoW 1 One-hot 2.2 One-hot = (Distributional Hypothesis) [3] [4] NICT Dice count-based predictive [5] count-based Wikipedia n n-gram v c v c - 718 24 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

1 2 One-hot The pen is mightier than the sword BoW I wear my pen as others do their sword BoW the 1 0 0 0 0 1 0 2 the 0 0 0 0 0 0 0 0 0 0 pen 0 1 0 0 0 0 0 1 pen 0 0 0 1 0 0 0 0 0 1 be 0 0 1 0 0 0 0 1 be 0 0 0 0 0 0 0 0 0 0 mighty 0 0 0 1 0 0 0 1 mighty 0 0 0 0 0 0 0 0 0 0 than 0 0 0 0 1 0 0 1 than 0 0 0 0 0 0 0 0 0 0 sword 0 0 0 0 0 0 1 1 sword 0 0 0 0 0 0 0 0 1 1 I 0 0 0 0 0 0 0 0 I 1 0 0 0 0 0 0 0 0 1 wear 0 0 0 0 0 0 0 0 wear 0 1 0 0 0 0 0 0 0 1 my 0 0 0 0 0 0 0 0 my 0 0 1 0 0 0 0 0 0 1 as 0 0 0 0 0 0 0 0 as 0 0 0 0 1 0 0 0 0 1 others 0 0 0 0 0 0 0 0 other 0 0 0 0 0 1 0 0 0 1 do 0 0 0 0 0 0 0 0 do 0 0 0 0 0 0 1 0 0 1 their 0 0 0 0 0 0 0 0 their 0 0 0 0 0 0 0 1 0 1 X c count-based predictive One-hot manyto-many Hinton (Local Representation) (Distributed Representation) Distributional Hypothesis count-based Distributional Representation predictive Distributed Representation [6] 2.2.1 LSI count-based Latent Semantic Indexing (LSI) [7] X X ij i j LSI X X r v r U c r V r r Σ X = UΣV r u i v j r m X X r m m LSI LSI probabilistic LSI (plsi) [8] Latent Dirichlet Allocation (LDA) [9] m 2017 11 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 25 719

v m U c m V v j j LSI plsi U V LSI LSI U u i plsi i plsi (i) (ii) (iii) (i) d d,u,v plsi LDA U, V U plsi LDA 2.2.2 Word2Vec predictive Word2Vec Word2Vec 2013 Mikolov Skip-gram with Negative Sampling (SGNS) Skip-gram w c 1 T T t=1 c j c j 0 log P (w t+j w t) (1) w 2 v w input v w output input output Python gensim [10] input Skip-gram V w I w O P (w O w I)= ( ) exp v wo v wi ) (2) V w=1 (v exp w v wi h Skip-gram 2 (W, b) f f(w x + b) 1 b = 0 1 One-hot input W One-hot input v h V 1 w k One-hot v wk 2 W output v V h softmax softmax softmax (x) i = exp xi i exp(xi) g g(w k ) l g(w k ) l = ( ) exp v wl v wk ) V w=1 (v exp w v wk w k = w I,w l = w O g(w k ) l = P (w O w I) (2) V softmax Negative Sampling [11] softmax Skipgram sigmoid σ (x) = log P (wo wi) 1 1+exp ( x) 720 26 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

( ) log σ v wo v wi + k i=1 E wi P n(w) [ ( )] log σ v wi v wi (3) P n (w) Negative Sampling 3/4 (3) softmax k 5 20 10 5 2.3 Word2Vec Word2Vec Embedding 2017 Word2Vec Embedding Word2Vec 2.3.1 Glove Word2Vec 2014 Glove [12] Glove count-based predictive V i,j=1 ( ) f (X ij) v wj v wi + b i + b j log X ij b Word2Vec f X X ij w i window w j 2.3.2 SGNS shifted PMI [13] count-based predictive Pointwise Mutual Information (PMI) PMI(x, y) =log P (x, y) P (x) P (y) PMI shifted PMI PMI PMI Embedding LexVec [14] 2.3.3 fasttext Embedding Facebook 2016 fasttext [15] n-gram (character n-gram) sub-word subword 3-gram 6-gram sub-word fasttext egg 3-gram eg egg gg 4-gram egg egg 5-gram egg sub-word fast- Text sub-word english-born fasttext british-born polish-born skip-gram most-capped ex-scotland fasttext fasttext sub-word sub-word 2.3.4 Character-based Embedding sub-word Embedding (Character-based Embedding) RNN 10 2 ASCII 128 One-hot 2 Embedding 2017 11 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 27 721

2 MeCab Convolutional Neural Network Embedding [16] 2.3.5 Word Embedding Embedding Embedding Embedding Word2Vec Embedding [17] Word Embedding Meta Embedding [18] WordNet Embedding Word2Vec WordNet WordNet AutoExtend [19] WordNet Poincaré Embeddings [20] 3. Word Embedding 3.1 Embedding Embedding Embedding Wikipedia Word Embedding Embedding gensim ipadic neologd ChaSen JUMAN MeCab MeCab MeCab (ipadic) mecab-ipadic-neologd [21] 2 mecab-ipadic-neologd ipadic Word Embedding neologd 3.2 Embedding Embedding Word2Vec One-hot Embedding Embedding Embed- 722 28 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

ding Embedding Wikipedia Embedding 3.3 Embedding Embedding Embedding Embedding SGNS Negative Sampling Embedding SGNS Kaji and Kobayashi [22] SGNS 3.4 Embedding Word2Vec fast- Text 400 10 5 3-fold BEDORE FAQ 400 FAQ 3 Word2Vec+LSTM 0.39 fasttext+lstm 0.41 fine-tuned Word2Vec+LSTM 0.43 fine-tuned fasttext+lstm 0.42 BoW+NN 0.32 FAQ FAQ 905 Wikipedia BEDORE MeCab/mecab-ipadic-neologd Word2Vec fasttext 300 Word2Vec fasttext TF-IDF RNN Long Short-Term Memory (LSTM) LSTM Embedding LSTM 2 400 BoW 905 3 (BoW NN) softmax BoW Word2Vec fasttext 3 Word2Vec fasttext fasttext BoW Embedding 2017 11 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 29 723

BoW Embedding Embedding 4. Embedding Embedding Word2Vec Embedding Embedding [1]T.Mikolov,K.Chen,G.CorradoandJ.Dean, Efficient estimation of word representations in vector space, arxiv: 1301.3781, 2013. [2]R.Kiros,Y.Zhu,R.R.Salakhutdinov,R.Zemel, R. Urtasun, A. Torralba and S. Fidler, Skip-thought vectors, In Advances in Neural Information Processing Systems, 28, pp. 3294 3302, 2015. [3] M. Sahlgren, The distributional hypothesis, Italian Journal of Linguistics, 20, pp. 33 53, 2008. [4] ALAGIN, Advanced LAn- Guage INfomation Forum, 2011. [5] M. Baroni, G. Dinu and G. Kruszewski, Don t count, predict!: A systematic comparison of contextcounting vs. context-predicting semantic vectors, In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238 247, 2014. [6] J. Turian, L. Ratinov and Y. Bengio, Word representations: A simple and general method for semisupervised learning, In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384 394, 2010. [7]S.Deerwester,S.T.Dumais,G.W.Furnas,T.K. Landauer and R. Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science, 41, pp. 391 407, 1990. [8] T. Hofmann, Probabilistic latent semantic indexing, In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50 57, 1999. [9] D. M. Blei, A. Y. Ng and M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research, 3, pp. 993 1022, 2003. [10] R. Řehůřek and P. Sojka, Software framework for topic modelling with large corpora, In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45 50, 2010. [11] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean, Distributed representations of words and phrases and their compositionality, In Advances in Neural Information Processing Systems, 26, pp. 3111 3119, 2013. [12] J. Pennington, R. Socher and C. D. Manning, Glove: Global vectors for word representation, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532 1543, 2014. [13] O. Levy and Y. Goldberg, Neural word embedding as implicit matrix factorization, In Advances in Neural Information Processing Systems, 27, pp. 2177 2185, 2014. [14] A. Salle, A. Villavicencio and M. Idiart, Matrix factorization using window sampling and negative sampling for improved word representations, In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 419 424, 2016. [15] P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics, 5, pp. 135 146, 2017. [16] F. Liu, H. Lu, C. Lo and G. Neubig, Learning character-level compositionality with visual features, arxiv: 1704.04859, 2017. [17] J. Garten, K. Sagae, V. Ustun and M. Dehghani, Combining distributed vector representations for words, In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 95 101, 2015. [18] W. Yin and H. Schütze, Learning word metaembeddings, In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1351 1360, 2016. [19] S. Rothe and H. Schütze, Autoextend: Extending word embeddings to embeddings for synsets and lexemes, arxiv: 1507.01127, 2015. [20] M. Nickel and D. Kiela, Poincaré embeddings for learning hierarchical representations, arxiv: 1705.08039, 2017. [21] T. Sato, Neologism dictionary based on the language resources on the web for mecab, https://github. com/neologd/mecab-ipadic-neologd, 2015. [22] N. Kaji and H. Kobayashi, Incremental skip-gram model with negative sampling, arxiv: 1704.03956, 2017. 724 30 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.