Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Similar documents
Sparse Word Embeddings Using l 1 Regularized Online Learning

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Neural Word Embeddings from Scratch

Deep Learning for NLP Part 2

arxiv: v1 [cs.cl] 19 Nov 2015

GloVe: Global Vectors for Word Representation 1

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Deep Learning. Ali Ghodsi. University of Waterloo

A Unified Learning Framework of Skip-Grams and Global Vectors

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

Learning Features from Co-occurrences: A Theoretical Analysis

arxiv: v2 [cs.cl] 1 Jan 2019

Bayesian Paragraph Vectors

An overview of word2vec

DISTRIBUTIONAL SEMANTICS

Deep Learning For Mathematical Functions

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Deep Learning for NLP

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Semantics with Dense Vectors

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Natural Language Processing

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Incrementally Learning the Hierarchical Softmax Function for Neural Language Models

arxiv: v2 [cs.cl] 11 May 2013

From perceptrons to word embeddings. Simon Šuster University of Groningen

CS224n: Natural Language Processing with Deep Learning 1

Text Classification based on Word Subspace with Term-Frequency

SKIP-GRAM WORD EMBEDDINGS IN HYPERBOLIC

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Incorporating Pre-Training in Long Short-Term Memory Networks for Tweets Classification

Topic Models and Applications to Short Documents

Deep Learning: A Statistical Perspective

Distributional Semantics and Word Embeddings. Chase Geigle

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

SPECTRALWORDS: SPECTRAL EMBEDDINGS AP-

word2vec Parameter Learning Explained

STA141C: Big Data & High Performance Statistical Computing

Do Neural Network Cross-Modal Mappings Really Bridge Modalities?

arxiv: v3 [cs.cl] 23 Feb 2016

Explaining Predictions of Non-Linear Classifiers in NLP

Explaining Predictions of Non-Linear Classifiers in NLP

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

A Document Descriptor using Covariance of Word Vectors

arxiv: v3 [stat.ml] 1 Nov 2015

Supervised Word Mover s Distance

Web-Mining Agents. Multi-Relational Latent Semantic Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Neural Networks for NLP. COMP-599 Nov 30, 2016

Multi-View Learning of Word Embeddings via CCA

Optimizing Sentence Modeling and Selection for Document Summarization

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

The representation of word and sentence

Natural Language Processing

Context Selection for Embedding Models

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

STA141C: Big Data & High Performance Statistical Computing

Deep learning for Natural Language Processing and Machine Translation

WordRank: Learning Word Embeddings via Robust Ranking

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Recurrent Attentional Topic Model

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Latent Dirichlet Allocation Based Multi-Document Summarization

Natural Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Online Learning of Task-specific Word Representations with a Joint Biconvex Passive-Aggressive Algorithm

Hierarchical Distributed Representations for Statistical Language Modeling

Deep Learning for Natural Language Processing

arxiv: v2 [cs.cl] 28 Sep 2015

Word Embeddings 2 - Class Discussions

Learning Entity and Relation Embeddings for Knowledge Graph Completion

Improving Topic Models with Latent Feature Word Representations

CS224n: Natural Language Processing with Deep Learning 1

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

Neural Networks Language Models

Quantum-like Generalization of Complex Word Embedding: a lightweight approach for textual classification

Orthonormal Explicit Topic Analysis for Cross-lingual Document Matching

arxiv: v1 [cs.cl] 1 Apr 2016

Applied Natural Language Processing

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Bringing machine learning & compositional semantics together: central concepts

Nonparametric Spherical Topic Modeling with Word Embeddings

Data Mining & Machine Learning

Generative Models for Sentences

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

arxiv: v4 [cs.cl] 27 Sep 2016

Word Representations via Gaussian Embedding. Luke Vilnis Andrew McCallum University of Massachusetts Amherst

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Lecture 6: Neural Networks for Representing Word Meaning

arxiv: v1 [stat.ml] 24 Mar 2018

arxiv: v3 [cs.cl] 30 Jan 2016

arxiv: v1 [cs.cl] 22 Jun 2015

End-to-End Quantum-Like Language Models with Application to Question Answering

Deep Learning. Language Models and Word Embeddings. Christof Monz

Transcription:

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng CAS Key Lab of Network Data Science and Technology Institute of Computing Technology, Chinese Academy of Sciences July 20, 2015 Fei Sun WORD REPRESENTATION July 20, 2015 1 / 28

Word Representations Machine Translation [Kalchbrenner and Blunsom, 2013] Sentiment Analysis [Maas et al, 2011] language modeling [Bengio et al, 2003] Word Representations POS Taging [Collobert et al, 2011] Parsing [Socher et al, 2011] Word-Sense Disambiguation [Collobert et al, 2011] Fei Sun WORD REPRESENTATION July 20, 2015 2 / 28

Word Representations Models GloVe [Pennington et al, 2014] Word2Vec [Mikolov et al, 2013] BOW [Harris, 1954] LBL [Mnih and Hinton, 2007] NPLMs [Bengio et al, 2003] LDA [Blei et al, 2003] HAL [Deerwester et al, 1990] LSI [Lund et al, 1995] 1940 1960 1980 2000 2020 Fei Sun WORD REPRESENTATION July 20, 2015 3 / 28

Word Representations Models BOW [Harris, 1954] LDA [Blei et al, 2003] HAL [Deerwester et al, 1990] LSI [Lund et al, 1995] GloVe [Pennington et al, 2014] Word2Vec [Mikolov et al, 2013] LBL [Mnih and Hinton, 2007] NPLMs [Bengio et al, 2003] Relations? 1940 1960 1980 2000 2020 Fei Sun WORD REPRESENTATION July 20, 2015 3 / 28

Relations One Hypothesis Two Interpretation Fei Sun WORD REPRESENTATION July 20, 2015 4 / 28

The Distributional Hypothesis [Harris, 1954, Firth, 1957] You shall know a word by the company it keeps JR Firth Fei Sun WORD REPRESENTATION July 20, 2015 5 / 28

Syntagmatic and Paradigmatic Relations [Gabrilovich and Markovitch, 2007] syntagmatic Albert Einstein was a physicist paradigmatic Richard Feynman was a physicist syntagmatic Syntagmatic: words co-occur in the same text region Paradigmatic: words occur in the same context, may not at the same time Fei Sun WORD REPRESENTATION July 20, 2015 6 / 28

Syntagmatic was Einstein Albert a physicist was Feynman Richard a physicist syntagmatic syntagmatic Fei Sun WORD REPRESENTATION July 20, 2015 7 / 28

Syntagmatic syntagmatic Albert Einstein was a physicist Richard Feynman was a physicist syntagmatic d 1 d 2 Einstein 1 0 Feynman 0 1 physicist 1 1 Fei Sun WORD REPRESENTATION July 20, 2015 7 / 28

Syntagmatic syntagmatic Albert Einstein was a physicist Richard Feynman was a physicist syntagmatic d 1 d 2 Einstein 1 0 Feynman 0 1 physicist 1 1 Feynman physicist 1 Einstein 0 1 Fei Sun WORD REPRESENTATION July 20, 2015 7 / 28

Syntagmatic syntagmatic Albert Einstein was a physicist Richard Feynman was a physicist syntagmatic d 1 d 2 Einstein 1 0 Feynman 0 1 physicist 1 1 Feynman physicist 1 Einstein 0 1 LSI, LDA, PV-DBOW Fei Sun WORD REPRESENTATION July 20, 2015 7 / 28

Paradigmatic was Einstein Albert a physicist was Feynman Richard a physicist paradigmatic Fei Sun WORD REPRESENTATION July 20, 2015 8 / 28

Paradigmatic Albert Einstein was a physicist paradigmatic Richard Feynman was a physicist Einstein Feynman physicist Einstein 0 0 1 Feynman 0 0 1 physicist 1 1 0 Fei Sun WORD REPRESENTATION July 20, 2015 8 / 28

Paradigmatic Albert Einstein was a physicist paradigmatic Richard Feynman was a physicist Einstein Feynman physicist Einstein 0 0 1 Feynman 0 0 1 physicist 1 1 0 0 Einstein Feynman physicist 0 0 Fei Sun WORD REPRESENTATION July 20, 2015 8 / 28

Paradigmatic Albert Einstein was a physicist paradigmatic Richard Feynman was a physicist Einstein Feynman physicist Einstein 0 0 1 Feynman 0 0 1 physicist 1 1 0 0 Einstein Feynman physicist 0 0 NLMs, Word2Vec, GloVe Fei Sun WORD REPRESENTATION July 20, 2015 8 / 28

Motivation syntagmatic Albert Einstein was a physicist Word2Vec paradigmatic PV-DBOW Richard Feynman was a physicist syntagmatic Fei Sun WORD REPRESENTATION July 20, 2015 9 / 28

Model Fei Sun WORD REPRESENTATION July 20, 2015 10 / 28

Parallel Document Context Model (PDC) the Paradigmatic N l = n=1 w n i d n log p(w n i h n i ) cat on the sat Fei Sun WORD REPRESENTATION July 20, 2015 11 / 28

Parallel Document Context Model (PDC) the cat on Paradigmatic l = N n=1 w n i d n ( ) log p(w n i h n i )+ log p(w n i d n ) the sat the cat sat on the Syntagmatic Fei Sun WORD REPRESENTATION July 20, 2015 11 / 28

Parallel Document Context Model (PDC) the cat on the Paradigmatic sat l = l = N n=1 w n i d n Negative Sampling N n=1 w n i d n ( ) log p(w n i h n i )+ log p(w n i d n ) ( log σ( w n i h n i )+ log σ( w n i d n ) + k E w P nw log σ( w h n i ) ) + k E w P nw log σ( w d n ) 1 σ(x) = 1 + exp( x) the cat sat on the Syntagmatic Fei Sun WORD REPRESENTATION July 20, 2015 11 / 28

Parallel Document Context Model (PDC) the cat on the Paradigmatic sat l = N n=1 w n i d n Negative Sampling l = N n=1 w n i d n ( ) log p(w n i h n i )+ log p(w n i d n ) ( log σ( w n i h n i )+ log σ( w n i d n ) + k E w P nw log σ( w h n i ) ) + k E w P nw log σ( w d n ) 1 σ(x) = 1 + exp( x) the cat sat on the Syntagmatic PDC MF for W-D + W-C Fei Sun WORD REPRESENTATION July 20, 2015 11 / 28 PV-DM not clear

Hierarchical Document Context Model (HDC) N l= n=1 w n i d n log p(w n i d n ) Syntagmatic sat the cat sat on the Fei Sun WORD REPRESENTATION July 20, 2015 12 / 28

Hierarchical Document Context Model (HDC) the cat on the Paradigmatic l= N n=1 w n i d n ( log p(w n i d n )+ i+l j=i L j i ) log p(c n j w n i ) Syntagmatic sat the cat sat on the Fei Sun WORD REPRESENTATION July 20, 2015 12 / 28

Hierarchical Document Context Model (HDC) the cat on the Paradigmatic Syntagmatic sat the cat sat on the l= N n=1 w n i d n Negative Sampling l= N n=1 w n i d n ( log p(w n i d n )+ ( i+l Fei Sun WORD REPRESENTATION July 20, 2015 12 / 28 j=i L j i i+l j=i L j i ( log σ( c n j w n i ) ) log p(c n j w n i ) ) + k E c P nc log σ( c w n i ) ) + log σ( w n i d n ) + k E w P nw log σ( w d n ) 1 σ(x) = 1 + exp( x)

Relation with Existing Models CBOW, SG, PV-DBOW Sub-model Global Context-Aware Neural Language Model [Huang et al, 2012] neural network weighted average of all word vectors Fei Sun WORD REPRESENTATION July 20, 2015 13 / 28

Experiments Fei Sun WORD REPRESENTATION July 20, 2015 14 / 28

Experiments Plan Qualitative Evaluations Verify word representations learned by different relations Quantitative Evaluations Word Analogy Task Word Similarity Task Fei Sun WORD REPRESENTATION July 20, 2015 15 / 28

Experimental Settings Corpus: model corpus size C&W [Collobert et al, 2011] Wikipedia 2007 + Reuters RCV1 085B HPCA [Lebret and Collobert, 2014] Wikipedia 2012 16B GloVe Wikipedia 2014+ Gigaword5 6B GCANLM, CBOW, SG PV-DBOW, PV-DM, PDC, HDC Wikipedia 2010 1B Parameters Setting: window negative iteration learning rate noise distribution 10 10 20 0025 1 005 2 #(w) 075 1 SG, PV-DBOW, HDC 2 CBOW, PV-DM, PDC Fei Sun WORD REPRESENTATION July 20, 2015 16 / 28

Qualitative Evaluations Top 5 similar words to Feynman CBOW SG PDC HDC PV-DBOW einstein schwinger geometrodynamics schwinger physicists schwinger quantum bethe electrodynamics spacetime bohm bethe semiclassical bethe geometrodynamics bethe einstein schwinger semiclassical tachyons relativity semiclassical perturbative quantum einstein Paradigmatic Syntagmatic Fei Sun WORD REPRESENTATION July 20, 2015 17 / 28

Word Analogy Test Set Google [Mikolov et al, 2013] Solution: arg Metric: Semantic: Beijing is to China as Paris is to Syntactic: big is to bigger as deep is to max x W,x a x b, x c ( b + c a) x percentage of questions answered correctly Fei Sun WORD REPRESENTATION July 20, 2015 18 / 28

Word Analogy C&W GCANLM HPCA Glove PV-DM PV-DBOW SG HDC CBOW PDC 80 80 80 60 60 60 Precision 2 40 40 40 20 20 20 0 50 100 300 Semantic 0 50 100 300 Syntactic 0 50 100 300 Total Word2Vec and GloVe are very strong baselines PDC and HDC outperform CBOW and SG respectively 2 percentage of questions answered correctly Fei Sun WORD REPRESENTATION July 20, 2015 19 / 28

Case Study big: bigger deep: deeper CBOW PDC deep deeper deep deeper crevasses 0 0 crevasses 0 0 CBOW: shallower 0 PDC: deeper 0 Fei Sun WORD REPRESENTATION July 20, 2015 20 / 28

Word Similarity Test Set WordSim-353 [Finkelstein et al, 2002] Stanford s Contextual Word Similarities (SCWS) [Huang et al, 2012] Rare Word (RW) [Luong et al, 2013] Evaluation Metric: spearman rank correlation Fei Sun WORD REPRESENTATION July 20, 2015 21 / 28

Word Similarity C&W GCANLM HPCA Glove PV-DM PV-DBOW SG HDC CBOW PDC 80 70 60 60 60 40 ρ 100 ρ 100 ρ 100 40 50 20 20 50 100 300 WordSim 353 40 50 100 300 SCWS 0 50 100 300 RW PV-DBOW do well PDC and HDC outperform CBOW and SG respectively Fei Sun WORD REPRESENTATION July 20, 2015 22 / 28

Summary Revisit word representation models through syntagmatic and paradigmatic relations Two novel models modeling syntagmatic and paradigmatic relations simultaneously State-of-the-art results Fei Sun WORD REPRESENTATION July 20, 2015 23 / 28

Thanks Q & A Fei Sun WORD REPRESENTATION July 20, 2015 24 / 28

References I Bengio, Y, Ducharme, R, Vincent, P, and Janvin, C (2003) A neural probabilistic language model J Mach Learn Res, 3:1137 1155 Blei, D M, Ng, A Y, and Jordan, M I (2003) Latent dirichlet allocation J Mach Learn Res, 3:993 1022 Collobert, R, Weston, J, Bottou, L, Karlen, M, Kavukcuoglu, K, and Kuksa, P (2011) Natural language processing (almost) from scratch J Mach Learn Res, 12:2493 2537 Deerwester, S, Dumais, S T, Furnas, G W, Landauer, T K, and Harshman, R (1990) Indexing by latent semantic analysis Journal of the American Society for Information Science, 41(6):391 407 Finkelstein, L, Gabrilovich, E, Matias, Y, Rivlin, E, andgadi Wolfman, Z S, and Ruppin, E (2002) Placing search in context: The concept revisited ACM Trans Inf Syst, 20(1):116 131 Firth, J R (1957) A synopsis of linguistic theory 1930-55 Studies in Linguistic Analysis (special volume of the Philological Society), 1952-59:1 32 Fei Sun WORD REPRESENTATION July 20, 2015 25 / 28

References II Gabrilovich, E and Markovitch, S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 07, pages 1606 1611, San Francisco, CA, USA Morgan Kaufmann Publishers Inc Harris, Z (1954) Distributional structure Word, 10(23):146 162 Huang, E H, Socher, R, Manning, C D, and Ng, A Y (2012) Improving word representations via global context and multiple word prototypes In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL 12, pages 873 882, Stroudsburg, PA, USA Association for Computational Linguistics Kalchbrenner, N and Blunsom, P (2013) Recurrent continuous translation models In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700 1709, Seattle, Washington, USA Association for Computational Linguistics Lebret, R and Collobert, R (2014) Word embeddings through hellinger pca In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 482 490 Association for Computational Linguistics Fei Sun WORD REPRESENTATION July 20, 2015 26 / 28

References III Lund, K, Burgess, C, and Atchley, R A (1995) Semantic and associative priming in a high-dimensional semantic space In Proceedings of the 17th Annual Conference of the Cognitive Science Society, pages 660 665 Luong, M-T, Socher, R, and Manning, C D (2013) Better word representations with recursive neural networks for morphology In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104 113 Association for Computational Linguistics Maas, A L, Daly, R E, Pham, P T, Huang, D, Ng, A Y, and Potts, C (2011) Learning word vectors for sentiment analysis In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT 11, pages 142 150, Stroudsburg, PA, USA Association for Computational Linguistics Mikolov, T, Chen, K, Corrado, G, and Dean, J (2013) Efficient estimation of word representations in vector space In Proceedings of Workshop of ICLR Mnih, A and Hinton, G (2007) Three new graphical models for statistical language modelling In Proceedings of the 24th International Conference on Machine Learning, ICML 07, pages 641 648, New York, NY, USA ACM Pennington, J, Socher, R, and Manning, C D (2014) Glove: Global vectors for word representation In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532 1543 Fei Sun WORD REPRESENTATION July 20, 2015 27 / 28

References IV Socher, R, Lin, C C, Manning, C, and Ng, A Y (2011) Parsing natural scenes and natural language with recursive neural networks In Getoor, L and Scheffer, T, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 129 136, New York, NY, USA ACM Fei Sun WORD REPRESENTATION July 20, 2015 28 / 28