An overview of word2vec

Similar documents
From perceptrons to word embeddings. Simon Šuster University of Groningen

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

word2vec Parameter Learning Explained

Deep Learning. Ali Ghodsi. University of Waterloo

GloVe: Global Vectors for Word Representation 1

arxiv: v3 [cs.cl] 30 Jan 2016

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

DISTRIBUTIONAL SEMANTICS

Lecture 6: Neural Networks for Representing Word Meaning

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

ECE521 Lectures 9 Fully Connected Neural Networks

CS224n: Natural Language Processing with Deep Learning 1

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Natural Language Processing

Neural Word Embeddings from Scratch

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Natural Language Processing

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Lecture 5 Neural models for NLP

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Learning for Natural Language Processing

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

NEURAL LANGUAGE MODELS

Machine Learning Basics III

Artificial Neural Networks. MGS Lecture 2

arxiv: v2 [cs.cl] 1 Jan 2019

Word Embeddings 2 - Class Discussions

A fast and simple algorithm for training neural probabilistic language models

CS60010: Deep Learning

Deep Learning for NLP Part 2

Neural Network Training

Seman&cs with Dense Vectors. Dorota Glowacka

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Reading Group on Deep Learning Session 1

Machine Learning for Smart Learners

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

CSC321 Lecture 4: Learning a Classifier

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Neural Networks for NLP. COMP-599 Nov 30, 2016

text classification 3: neural networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Statistical Machine Learning from Data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Deep Learning Recurrent Networks 2/28/2018

Assignment 1. Learning distributed word representations. Jimmy Ba

Neural Network Language Modeling

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Sequence Modeling with Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Gradient-Based Learning. Sargur N. Srihari

Incrementally Learning the Hierarchical Softmax Function for Neural Language Models

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CSC321 Lecture 4: Learning a Classifier

Improved Learning through Augmenting the Loss

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Deep Feedforward Networks

CSC321 Lecture 15: Recurrent Neural Networks

Deep Feedforward Networks

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Neural Networks Language Models

Lecture 17: Neural Networks and Deep Learning

STA 4273H: Statistical Machine Learning

Deep Neural Networks (1) Hidden layers; Back-propagation

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Logistic Regression & Neural Networks

Machine Learning for NLP

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Introduction to Machine Learning

Course 395: Machine Learning - Lectures

Data Mining & Machine Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Stochastic gradient descent; Classification

Recurrent Neural Networks. Jian Tang

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Learning Deep Architectures for AI. Part II - Vijay Chakilam

CSE446: Neural Networks Winter 2015

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks

Machine Learning Basics: Maximum Likelihood Estimation

Based on the original slides of Hung-yi Lee

CS224n: Natural Language Processing with Deep Learning 1

Rapid Introduction to Machine Learning/ Deep Learning

Deep Learning For Mathematical Functions

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Transcription:

An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25

Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations 5 model scalability 6 applications Benjamin Wilson word2vec Berlin ML Meetup 2 / 25

Introduction word2vec associates words to points in space word2vec associates words with points in space word meaning and relationships between words are encoded spatially learns from input texts developed by Mikolov, Sutskever, Chen, Corrado and Dean in 2013 at Google Research Benjamin Wilson word2vec Berlin ML Meetup 3 / 25

Introduction Similar words are closer together spatial distance corresponds to word similarity words are close together their "meanings" are similar notation: word w vec[w] its point in space, as a position vector. e.g. vec[woman] = (0.1, 1.3). Benjamin Wilson word2vec Berlin ML Meetup 4 / 25

Introduction Word relationships are displacements the displacement (vector) between the points of two words represents the word relationship. same word relationship same vector e.g. Source: Linguistic Regularities in Continuous Space Word Representations, Mikolov et al, 2013 vec[queen] vec[king] = vec[woman] vec[man] Benjamin Wilson word2vec Berlin ML Meetup 5 / 25

Introduction What s in a name? How can a machine learn the meaning of a word? Machines only understand symbols! Assume the Distributional Hypothesis (D.H.) (Harris, 1954): words are characterised by the company that they keep Suppose we read the word cat. What is the probability P(w cat) that we ll read the word w nearby? D.H. : the meaning of cat is captured by the probability distribution P( cat). Benjamin Wilson word2vec Berlin ML Meetup 6 / 25

Background & significance word2vec as shallow learning word2vec is a successful example of shallow learning word2vec can be trained as a very simple neural network single hidden layer with no non-linearities no unsupervised pre-training of layers (i.e. no deep learning) word2vec demonstrates that, for vectorial representations of words, shallow learning can give great results. Benjamin Wilson word2vec Berlin ML Meetup 7 / 25

Background & significance word2vec focuses on vectorization word2vec builds on existing research architecture is essentially that of Minh and Hinton s log-bilinear model change of focus: vectorization, not language modelling. Benjamin Wilson word2vec Berlin ML Meetup 8 / 25

Background & significance word2vec scales word2vec scales very well, allowing models to be trained using more data. training speeded up by employing one of: hierarchical softmax (more on this later) negative sampling (for another day) runs on a single machine - can train a model at home implementation is published Benjamin Wilson word2vec Berlin ML Meetup 9 / 25

Architecture Learning from text word2vec learns from input text considers each word w 0 in turn, along with its context C context = neighbouring words (here, for simplicity, 2 words forward and back) sample # w 0 context C 1 once {upon, a} 4 time {upon, a, in, a} Benjamin Wilson word2vec Berlin ML Meetup 10 / 25

Architecture Two approaches: CBOW and Skip-gram word2vec can learn the word vectors via two distinct learning tasks, CBOW and Skip-gram. CBOW: predict the current word w 0 given only C Skip-gram: predict words from C given w 0 Skip-gram produces better word vectors for infrequent words CBOW is faster by a factor of window size more appropriate for larger corpora We will speak only of CBOW (life is short). Benjamin Wilson word2vec Berlin ML Meetup 11 / 25

CBOW learning task Architecture Given only the current context C, e.g. C = {upon, a, in, a} predict which of all possible words is the current word w 0, e.g. w 0 = time. multiclass classification on the vocabulary W output is ŷ = ŷ(c) = P( C) is a probability distribution on W, e.g. train so that ŷ approximates target distribution y one-hot on the current word, e.g. Benjamin Wilson word2vec Berlin ML Meetup 12 / 25

Architecture training CBOW with softmax regression Model: yˆ = P( C; α, β) = softmaxβ ( X αw ), w C where α, β are families of parameter vectors. Pictorially: Benjamin Wilson word2vec Berlin ML Meetup 13 / 25

Architecture stochastic gradient descent learn the model parameters (here, the linear transforms) minimize the difference between output distribution ŷ and target distribution y, measured using the cross-entropy H: H(y, ŷ) = w W y w log ŷ w given y is one-hot, same as maximizing the probability of the correct outcome ŷ w0 = P(w 0 C; α, β). use stochastic gradient descent: for each (current word, context) pair, update all the parameters once. Benjamin Wilson word2vec Berlin ML Meetup 14 / 25

CBOW word representations word2vec word representation Post-training, associate every word w W with a vector vec[w]: vec[w] is the vector of synaptic strengthes connecting the input layer unit w to the hidden layer more meaningfully, vec[w] is the hidden-layer representation of the single-word context C = {w}. vectors are (artifically) normed to unit length (Euclidean norm), post-training. Benjamin Wilson word2vec Berlin ML Meetup 15 / 25

CBOW word representations word vectors encode meaning Consider words w, w W : w w P( w) P( w ) (by the Distributional Hypothesis) softmax β (vec[w]) softmax β (vec[w ]) (if model is well-trained) vec[w] vec[w ] The last equivalence is tricky to show... Benjamin Wilson word2vec Berlin ML Meetup 16 / 25

CBOW word representations word vectors encode meaning (cont.) We compare output distributions using the cross-entropy: H(softmax β (u), softmax β (v)) follows from continunity in u, v can be argued for from the convexity in v when u is fixed. Benjamin Wilson word2vec Berlin ML Meetup 17 / 25

CBOW word representations word relationship encoding Given two examples of a single word relationship e.g. queen is to king as aunt is to uncle Find the closest point to vec[queen] + (vec[uncle] vec[aunt]). It should be vec[king]. Perform this test for many word relationship examples. CBOW & Skip-gram give correct answer in 58% - 69% of cases. Cosine distance is used (justified empirically!). What is the natural metric? Source: Efficient estimation of word representations in vector space, Mikolov et al., 2013 Benjamin Wilson word2vec Berlin ML Meetup 18 / 25

model scalability softmax implementations are slow Updates all second-layer parameters for every (current word, context) pair (w 0, C) very costly. Softmax models P(w 0 C; α, β) = exp β w v w W expβ w v where v = w C α w, and (α w ) w W (β w ) w W are the model parameters. For each (w 0, C) pair, must update O( W ) 100k parameters. Benjamin Wilson word2vec Berlin ML Meetup 19 / 25

model scalability alternative models with fewer parameter updates word2vec offers two alternatives to replace softmax. hierarchical softmax (H.S.) (Morin & Bengio, 2005) negative sampling, an adaptation of noise contrastive estimation (Gutmann & Hyvärinen, 2012) (skipped today) negative sampling scales better in vocabulary size quality of word vectors comparable both make significantly fewer parameter updates in the second-layer (no less parameters). Benjamin Wilson word2vec Berlin ML Meetup 20 / 25

hierarchical softmax model scalability choose an arbitrary binary tree (# leaves = vocabulary size) then P( C) induces a weighting of the edges think of each parent node n as a Bernoulli distribution P n on its children. Then e.g. P(time C) = P n0 (left C)P n1 (right C)P n2 (left C) Benjamin Wilson word2vec Berlin ML Meetup 21 / 25

hierarchical softmax model scalability choose an arbitrary binary tree (# leaves = vocabulary size) then P( C) induces a weighting of the edges think of each parent node n as a Bernoulli distribution P n on its children. Then e.g. P(time C) = P n0 (left C)P n1 (right C)P n2 (left C) Benjamin Wilson word2vec Berlin ML Meetup 21 / 25

hierarchical softmax model scalability choose an arbitrary binary tree (# leaves = vocabulary size) then P( C) induces a weighting of the edges think of each parent node n as a Bernoulli distribution P n on its children. Then e.g. P(time C) = P n0 (left C)P n1 (right C)P n2 (left C) Benjamin Wilson word2vec Berlin ML Meetup 21 / 25

hierarchical softmax model scalability choose an arbitrary binary tree (# leaves = vocabulary size) then P( C) induces a weighting of the edges think of each parent node n as a Bernoulli distribution P n on its children. Then e.g. P(time C) = P n0 (left C)P n1 (right C)P n2 (left C) Benjamin Wilson word2vec Berlin ML Meetup 21 / 25

model scalability hierarchical softmax modelling In H.S., the probability of any outcome depends on only O(log W ) parameters. For each parent node n, assume that its Bernoulli distribution is modelled by: P n (left C) = σ(θ n v) where: θ n is a vector of parameters v = w C αw is the context vector 1 σ is the sigmoid function σ(z) = 1+exp z Then, e.g. P(time C) = σ(θ n 0 v) (1 σ(θ n 1 v)) σ(θ n 2 v) depends on only 3 = log 2 W of the parameter vectors θ n. Benjamin Wilson word2vec Berlin ML Meetup 22 / 25

model scalability word2vec H.S. uses Huffman tree could use any binary tree (# leaves = vocabulary size) word2vec uses a Huffman tree frequent words have shorter paths in the tree results in an even faster implementation word count fat 3 fridge 2 zebra 1 potato 3 and 14 in 7 today 4 kangaroo 2 Benjamin Wilson word2vec Berlin ML Meetup 23 / 25

applications application to machine translation train word representations for e.g. English and Spanish separately the word vectors are similarly arranged! learn a linear transform that (approximately) maps the word vectors of English to the word vectors of their translations in Spanish same transform for all vectors Source: Exploiting Similarities among Languages for Machine Translation, Mikolov, Quoc, Sutskever, 2013 Benjamin Wilson word2vec Berlin ML Meetup 24 / 25

applications applications to machine translation - results English - Spanish: can guess the correct translation in 33% - 35% percent of the cases. Source: Exploiting Similarities among Languages for Machine Translation, Mikolov, Quoc, Sutskever, 2013 Benjamin Wilson word2vec Berlin ML Meetup 25 / 25