Neural Word Embeddings from Scratch

Similar documents
GloVe: Global Vectors for Word Representation 1

Deep Learning. Ali Ghodsi. University of Waterloo

DISTRIBUTIONAL SEMANTICS

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

STA141C: Big Data & High Performance Statistical Computing

Word Embeddings 2 - Class Discussions

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

word2vec Parameter Learning Explained

Lecture 6: Neural Networks for Representing Word Meaning

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Natural Language Processing

An overview of word2vec

STA141C: Big Data & High Performance Statistical Computing

Embeddings Learned By Matrix Factorization

Deep Learning for NLP Part 2

Distributional Semantics and Word Embeddings. Chase Geigle

Incrementally Learning the Hierarchical Softmax Function for Neural Language Models

CS224n: Natural Language Processing with Deep Learning 1

Bayesian Paragraph Vectors

arxiv: v3 [cs.cl] 30 Jan 2016

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 7: Word Embeddings

Natural Language Processing and Recurrent Neural Networks

Deep Learning: A Statistical Perspective

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Sparse Word Embeddings Using l 1 Regularized Online Learning

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Seman&cs with Dense Vectors. Dorota Glowacka

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Deep Learning for Natural Language Processing

Semantics with Dense Vectors

Natural Language Processing

The representation of word and sentence

Complementary Sum Sampling for Likelihood Approximation in Large Scale Classification

arxiv: v2 [cs.cl] 1 Jan 2019

NEURAL LANGUAGE MODELS

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

text classification 3: neural networks

Natural Language Processing

CS224n: Natural Language Processing with Deep Learning 1

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec

ATASS: Word Embeddings

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

Algorithm-Independent Learning Issues

Neural Networks for NLP. COMP-599 Nov 30, 2016

Lecture 5 Neural models for NLP

Structured deep models: Deep learning on graphs and beyond

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Large-scale Collaborative Ranking in Near-Linear Time

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Deep Learning. Language Models and Word Embeddings. Christof Monz

Towards Universal Sentence Embeddings

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

A Unified Learning Framework of Skip-Grams and Global Vectors

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Intrinsic Subspace Evaluation of Word Embedding Representations

Neural Networks Language Models

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Natural Language Processing

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Factorization of Latent Variables in Distributional Semantic Models

Global and Local Feature Learning for Ego-Network Analysis

Generating Sentences by Editing Prototypes

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Natural Language Processing

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Learning for NLP

Data Mining & Machine Learning

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

A fast and simple algorithm for training neural probabilistic language models

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Explaining Predictions of Non-Linear Classifiers in NLP

Conditional Language Modeling. Chris Dyer

arxiv: v1 [cs.cl] 21 May 2017

Learning Features from Co-occurrences: A Theoretical Analysis

arxiv: v1 [stat.ml] 24 Mar 2018

Deep Learning for NLP

Deep learning for Natural Language Processing and Machine Translation

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Cross-Lingual Language Modeling for Automatic Speech Recogntion

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

CS 6120/CS4120: Natural Language Processing

A Generalized Language Model in Tensor Space

Deep Learning For Mathematical Functions

Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations

Hierarchical Distributed Representations for Statistical Language Modeling

Explaining Predictions of Non-Linear Classifiers in NLP

Aspect Term Extraction with History Attention and Selective Transformation 1

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

CS534 Machine Learning - Spring Final Exam

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

arxiv: v1 [cs.cl] 31 May 2015

Transcription:

Neural Word Embeddings from Scratch Xin Li 12 1 NLP Center Tencent AI Lab 2 Dept. of System Engineering & Engineering Management The Chinese University of Hong Kong 2018-04-09 Xin Li Neural Word Embeddings from Scratch 2018-04-09 1 / 24

Outline 1 What is Word Embedding? 2 Neural Word Embeddings Revisit Classifical NLM Word2Vec GloVe 3 Bridging Skip-Gram and Matrix Factorization SG-NS as Implicit Matrix Factorization SVD over shifted PPMI matrix 4 Advanced Techniques for Learning Word Representations General-Purpose Word Representations By Ziyi Task-Specific Word Representations By Deng Xin Li Neural Word Embeddings from Scratch 2018-04-09 2 / 24

What is Word Embedding? Word Embedding refers to low dimensional real-valued dense vectors encoding the semantic information of word. Generally the concepts Word Embeddings Distributed Word Representations and Dense Word Vectors can be used interchangeably. John Rupert Firth (linguist) You shall know a word by the company it keeps. Karl Marx (philosopher) The human essence is no abstraction inherent in each single individual. In its reality it is the ensemble of the social relations. Xin Li Neural Word Embeddings from Scratch 2018-04-09 3 / 24

What is Word Embedding? Word Embedding is the by-product of neural language model. Definition of language model: p(w 1:T ) = T p(w t w 1:t 1 ) Neural Language Model (NLM) is the language model where the conditional probability is modeled by neural networks. t=1 Xin Li Neural Word Embeddings from Scratch 2018-04-09 4 / 24

Outline 1 What is Word Embedding? 2 Neural Word Embeddings Revisit Classifical NLM Word2Vec GloVe 3 Bridging Skip-Gram and Matrix Factorization SG-NS as Implicit Matrix Factorization SVD over shifted PPMI matrix 4 Advanced Techniques for Learning Word Representations General-Purpose Word Representations By Ziyi Task-Specific Word Representations By Deng Xin Li Neural Word Embeddings from Scratch 2018-04-09 5 / 24

MLP-LM Bengio et al. JMLR 2003 p w t w t 3 w t 1 = softmax(ph t 1 + b) h t 1 = W x t 3 x t 2 x t 1 Training objective is to maximize the log-likelihood: L = 1 T log[p(wt w 1:t n+1 )] x t 3 = Xw t 3 x t 2 x t 1 w t 3 w t 2 w t 1 Figure: MLP-LM. n is set to 3. Xin Li Neural Word Embeddings from Scratch 2018-04-09 6 / 24

Conv-LM Collobert and Weston. ICML 2008 p(y S) p(y S x ) softmax softmax Conv2D Conv2D S S x x t x Figure: Conv-LM. x t denotes the middle word of S. S x is obtained by replacing middle word of S with x Training objective is to minimize the rank-type loss: L = max(0 1 p(y = 1 S)+ S D x V p(y = 1 S x )) Xin Li Neural Word Embeddings from Scratch 2018-04-09 7 / 24

RNN-LM Mikolov et al. INTERSPEECH 2010 Training objective is same to MLP-LM (i.e. maximum likelihood). p w t w 1 w t 1 = softmax(ph t 1 + b) h t 3 h t 2 h t 1 = f(wx t 1 + Uh t 2 + b) x t 3 = Xw t 3 x t 2 x t 1 w t 3 w t 2 w t 1 Figure: RNN-LM. Xin Li Neural Word Embeddings from Scratch 2018-04-09 8 / 24

Outline 1 What is Word Embedding? 2 Neural Word Embeddings Revisit Classifical NLM Word2Vec GloVe 3 Bridging Skip-Gram and Matrix Factorization SG-NS as Implicit Matrix Factorization SVD over shifted PPMI matrix 4 Advanced Techniques for Learning Word Representations General-Purpose Word Representations By Ziyi Task-Specific Word Representations By Deng Xin Li Neural Word Embeddings from Scratch 2018-04-09 9 / 24

Word2Vec (Mikolov et al. ICLR 2013 & NIPS 2013) Word2Vec involves two different models namely CBOW and SG: 1 CBOW (Continuous Bag-of-Words): Using context words to predict the middle word. 2 SG (Skip-Gram): Using middle word to predict the context words. Xin Li Neural Word Embeddings from Scratch 2018-04-09 10 / 24

Word2Vec (Mikolov et al. ICLR 2013 & NIPS 2013) The main feature of Word2Vec (IMO) is that it is a non-mle framework i.e. its aim is not to model the joint probability of the input words. CBOW: L = 1 T T log[p(w t w t c:t 1 w t+1:t+c )] t=1 SG: L = 1 T T t=1 c j c log[p(w t+j w t )] Word2Vec is the first model for learning word embeddings from unlabeled data!!! Xin Li Neural Word Embeddings from Scratch 2018-04-09 11 / 24

Word2Vec (Mikolov et al. ICLR 2013 & NIPS 2013) Some extensions for improving Word2Vec: HSoftmax (Hierarchical Softmax) 1 Full softmax layer is too fat since it needs to evaluate V (generally more than 1M in the large corpus) output nodes. 2 HSoftmax takes advantage of binary tree representation of output layer and only needs to evaluate log 2 ( V ) nodes. Figure: Binary Tree 3 p(w 2 w i ) = σ(i(n(w 2 j + 1) = j=1 ch(n(w 2 j)))v n(w 2j) x i) Figure: Huffman Tree example Mikolov uses Huffman Tree to construct the hierarchical structure. Xin Li Neural Word Embeddings from Scratch 2018-04-09 12 / 24

Word2Vec (Mikolov et al. ICLR 2013 & NIPS 2013) NS (Negative Sampling) is another alternative for speeding up training. 1 Formulating V -class classification problem as a binary classification problem. ( word prediction = co-occurrence relation prediction ) 2 Training with k additional corrupted samples for each positive sample. L = log σ(x O x I ) + k i=1 E w i P n(w)[log σ( x i x I )] Subsampling: Most frequent words usually provide less information and randomly discarding them should speedup training and improve performance. Xin Li Neural Word Embeddings from Scratch 2018-04-09 13 / 24

Outline 1 What is Word Embedding? 2 Neural Word Embeddings Revisit Classifical NLM Word2Vec GloVe 3 Bridging Skip-Gram and Matrix Factorization SG-NS as Implicit Matrix Factorization SVD over shifted PPMI matrix 4 Advanced Techniques for Learning Word Representations General-Purpose Word Representations By Ziyi Task-Specific Word Representations By Deng Xin Li Neural Word Embeddings from Scratch 2018-04-09 14 / 24

GloVe Pennington et al. EMNLP 2014 Motivations of GloVe: Global co-occurrence count is the primary information for generating word embeddings. (Word2Vec ignores this kind of information) Only using co-occurrence information is not enough to distinguish relevant words from irrelevant words. The appropriate starting point for learning word vectors should be ratios of co-occurrence probabilities!!! Xin Li Neural Word Embeddings from Scratch 2018-04-09 15 / 24

GloVe Pennington et al. EMNLP 2014 F (x i x j ˆx k ) = p ik /p jk = F (x i x j ˆx k ) = p ik /p jk = F ((x i x j ) ˆx k ) = p ik /p jk (1) The roles of word and context word should be exchangeable thus: F ((x i x j ) ˆx k ) = F (x i ˆx k )/F (x j ˆx k ) (2) According to (1) and (2): F (x i ˆx k ) = p ik = x i ˆx k = log(c ik ) log(c i ) = x i ˆx k + b i + ˆb k = log(c ik ) (3) Training objective: J = V ik=1 f (C ik )(x i ˆx k + b i + ˆb k log(c ik )) 2 (4) where f (C ik ) is a weight function to filter the noise from rare co-occurrences. Xin Li Neural Word Embeddings from Scratch 2018-04-09 16 / 24

Outline 1 What is Word Embedding? 2 Neural Word Embeddings Revisit Classifical NLM Word2Vec GloVe 3 Bridging Skip-Gram and Matrix Factorization SG-NS as Implicit Matrix Factorization SVD over shifted PPMI matrix 4 Advanced Techniques for Learning Word Representations General-Purpose Word Representations By Ziyi Task-Specific Word Representations By Deng Xin Li Neural Word Embeddings from Scratch 2018-04-09 17 / 24

SG-NS as Implicit Matrix Factorization Levy and Goldberg NIPS 2014 & TACL 2015 Skip-Gram with Negative Sampling (SG-NS) can be efficiently trained and achieve state-of-the-art results. The outputs of SG-NS are word embeddings X w and context word embeddings X c (ignored). X w R Vw d M R Vw V c A R m n U R m d Σ R d d V T R d n = (X c ) T R d Vc Figure: SG-NS Figure: SVD for d-rank factorization Xin Li Neural Word Embeddings from Scratch 2018-04-09 18 / 24

SG-NS as Implicit Matrix Factorization Levy and Goldberg NIPS 2014 & TACL 2015 Training objective of SG-NS: l(i O) = log σ(x O x I ) + L = I V w k i=1 O V c C IO l(i O) The optimal value is attained at: E w i P n(w)[log σ( x i x I )] y = x O x I = log( C IO D ) log k C I C O The first item is the point-wise mutual information (PMI) of word pair (I O). Xin Li Neural Word Embeddings from Scratch 2018-04-09 19 / 24

SG-NS as Implicit Matrix Factorization Levy and Goldberg NIPS 2014 & TACL 2015 The matrix M PMI k called shifted PMI Matrix emerges as the optimal solution for SG-NS s objective. Each cell of the matrix is defined below: M PMI k ij = X w i X c j = x i x j = PMI(i j) log k The objective of SG-NS can be regarded as a weighted matrix factorization problem over M PMI k. The matrices M PMI k 0 and M PPMI k can be better alternatives of M PMI k. { 0 if C ij = 0 (M PMI k 0 ) ij = M PMI k ij Otherwise M PPMI k ij = PPMI(i j) log k PPMI(i j) = max(pmi(i j) 0) Xin Li Neural Word Embeddings from Scratch 2018-04-09 20 / 24

Outline 1 What is Word Embedding? 2 Neural Word Embeddings Revisit Classifical NLM Word2Vec GloVe 3 Bridging Skip-Gram and Matrix Factorization SG-NS as Implicit Matrix Factorization SVD over shifted PPMI matrix 4 Advanced Techniques for Learning Word Representations General-Purpose Word Representations By Ziyi Task-Specific Word Representations By Deng Xin Li Neural Word Embeddings from Scratch 2018-04-09 21 / 24

SVD over shifted PPMI matrix Levy and Goldberg NIPS 2014 & TACL 2015 shifted PPMI matrix: M SPPMI k = SPPMI k (i j) where SPPMI k (i j) = max(pmi(i j) log k 0) Performing SVD over M SPPMI k and U Σ is treated as word representations X w. This method outperforms SG-NS on word similarity task!!! Xin Li Neural Word Embeddings from Scratch 2018-04-09 22 / 24

Outline 1 What is Word Embedding? 2 Neural Word Embeddings Revisit Classifical NLM Word2Vec GloVe 3 Bridging Skip-Gram and Matrix Factorization SG-NS as Implicit Matrix Factorization SVD over shifted PPMI matrix 4 Advanced Techniques for Learning Word Representations General-Purpose Word Representations By Ziyi Task-Specific Word Representations By Deng Xin Li Neural Word Embeddings from Scratch 2018-04-09 23 / 24

Outline 1 What is Word Embedding? 2 Neural Word Embeddings Revisit Classifical NLM Word2Vec GloVe 3 Bridging Skip-Gram and Matrix Factorization SG-NS as Implicit Matrix Factorization SVD over shifted PPMI matrix 4 Advanced Techniques for Learning Word Representations General-Purpose Word Representations By Ziyi Task-Specific Word Representations By Deng Xin Li Neural Word Embeddings from Scratch 2018-04-09 24 / 24