Lecture 6: Neural Networks for Representing Word Meaning

Similar documents
From perceptrons to word embeddings. Simon Šuster University of Groningen

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

word2vec Parameter Learning Explained

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

DISTRIBUTIONAL SEMANTICS

An overview of word2vec

text classification 3: neural networks

arxiv: v3 [cs.cl] 30 Jan 2016

Neural Word Embeddings from Scratch

ECE521 Lectures 9 Fully Connected Neural Networks

GloVe: Global Vectors for Word Representation 1

NEURAL LANGUAGE MODELS

Natural Language Processing

Word Embeddings 2 - Class Discussions

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing

Natural Language Processing

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Lecture 5 Neural models for NLP

Sequence Modeling with Neural Networks

Neural Networks Language Models

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Learning. Ali Ghodsi. University of Waterloo

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

CS224n: Natural Language Processing with Deep Learning 1

CPSC 340 Assignment 4 (due November 17 ATE)

Midterm sample questions

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Deep Learning for NLP

Neural Network Language Modeling

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

ATASS: Word Embeddings

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

ECE521 Lecture 7/8. Logistic Regression

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Assignment 1. Learning distributed word representations. Jimmy Ba

Convolutional Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Deep Learning. Language Models and Word Embeddings. Christof Monz

CSC 411: Lecture 04: Logistic Regression

HOMEWORK #4: LOGISTIC REGRESSION

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Neural Networks for NLP. COMP-599 Nov 30, 2016

Logistic Regression & Neural Networks

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

epochs epochs

Neural Networks: Backpropagation

Reading Group on Deep Learning Session 1

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Natural Language Processing with Deep Learning CS224N/Ling284

Deep Learning for NLP Part 2

Generative Clustering, Topic Modeling, & Bayesian Inference

CSC321 Lecture 7 Neural language models

Neural Networks and Deep Learning

Deep Learning for Natural Language Processing

Artificial Neural Networks

Comparison of Modern Stochastic Optimization Algorithms

CSCI567 Machine Learning (Fall 2018)

Seman&cs with Dense Vectors. Dorota Glowacka

Deep Neural Networks (1) Hidden layers; Back-propagation

Lecture 7: Word Embeddings

Loss Functions and Optimization. Lecture 3-1

Linear classifiers: Overfitting and regularization

Neural Networks: Backpropagation

Multilayer Perceptrons and Backpropagation

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

HOMEWORK #4: LOGISTIC REGRESSION

Statistical Machine Learning from Data

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Learning from Data: Multi-layer Perceptrons

Lecture 12: Algorithms for HMMs

CS60010: Deep Learning

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CSC 578 Neural Networks and Deep Learning

Bayesian Paragraph Vectors

Distributional Semantics and Word Embeddings. Chase Geigle

Machine Learning Basics III

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Deep Feedforward Networks

CS224n: Natural Language Processing with Deep Learning 1

Machine Learning Lecture 12

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Introduction to Convolutional Neural Networks (CNNs)

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Linear Models in Machine Learning

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Lecture 12: Algorithms for HMMs

Transcription:

Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28

Logistic Regression Input is a feature vector, output is one (binary classification) or many (multinomial distribution) Weights matrix (or vector) directly connects inputs and output Trained by gradient descent (neural network; no hidden units). x 1 x 2 x 3 x 4 w 1 w 2 w 3 w 4 w 5 y x 5 2 / 28

Logistic Functions Logistic Function Sigmoid L f (x) = f (x) = 1 1+e k(x x 0 ) 1+e x Softmax function P(y = j x) = ex w j K k=1 ex w k e natural logarithm base, x 0 the x-value of the sigmoid s midpoint, L the curve s maximum value, k steepness of curve Sigmoid is special case of the logistic function Softmax is a generalization of the logistic function x T w denotes the inner product of x and w 3 / 28

Neural Networks x 1 x 2 x 3 y x 4 x 5 W : input weights, matrix I : input signal, feature vector (one per example) Activation function: sigmoid, tanh 4 / 28

Training of Neural Networks Forward Pass Input signal is presented first Hidden layer state is computed (vector times matrix operation and non-linear activation) Outputs are computed (vectors times matrix operation and usually non-linear activation) Backpropagation To train the network, we need to compute gradient of the error The gradients are sent back using the same weights that were used in the forward pass Gradient Descent Batch gradient descent: compute gradient from batch of N training examples Stochastic gradient descent: compute the gradient from 1 training example each time Minibatch: compute the gradient from minibatch of M training examples (M > 1, M < N) 5 / 28

Training of Neural Networks Learning Rate: Controls how much we change the weights Too little value will result in long training time, too high value will erase previously learned patterns We start with high learning rate and reduce it during training Training Epochs: several passes over the training data are often performed (epoch: number of iterations over the data set in order to train the neural network). Regularization: Network often overfits (it fails to generalize at test time) High weights are used to model only some small subset of data We can try to force the weights to stay small during training to avoid this problem (L1 & L2 regularization) 6 / 28

Model Parameters Choice of the hyper-parameters has to be done manually: Type of activation function Choice of architecture (how many hidden layers, their sizes) Learning rate, number of training epochs What features are presented at the input layer How to regularize 7 / 28

N-grams N-gram-based language models are often used to compute the probability of a sentence W : P(W ) = i (w i w 1... w i 1 ) Often simplified to trigrams: P(W ) = i (w i w i 2, w i 1 ) P(this is a sentence) = P(this) P(is this) P(a this, is) P(sentence is, a) P(a this, is) = C(this is a) C(this is) 8 / 28

One-hot Representations Simple way how to encode discrete concepts, such as words; also known as 1-of-N where N would be the size of the vocabulary. vocabulary = (Monday, Tuesday, is, a, today) Monday = [1 0 0 0 0] Tuesday = [0 1 0 0 0] is = [0 0 1 0 0] a = [0 0 0 1 0] today = [0 0 0 0 1] 9 / 28

Bag-of-Words Representations Ignores word order, sum of one-hot vectors. Can be extended to bag-of-n-grams to capture local ordering of words. vocabulary = (Monday, Tuesday, is, a, today) Monday = [1 0 0 0 0] Tuesday = [0 1 0 0 0] is = [0 0 1 0 0] a = [0 0 0 1 0] today = [0 0 0 0 1] vocabulary = (Monday, Tuesday, is, a, today) Monday Monday = [2 0 0 0 0] today is a Monday = [1 0 1 1 1] today is a Tuesday = [0 1 1 1 1] is a Monday today = [1 0 1 1 1] 10 / 28

A Basic Neural Language Model PREVIOUS WORD HIDDEN LAYER CURRENT WORD Bigram neural language model Previous word predicts current word via hidden layer As many outputs as there are words in the vocabulary Model learns compressed, continuous representations of words (usually the matrix of weights between input and hidden layer) 11 / 28

Another Neural Language Model Bengio et al. (2003, JMLR) Input layer, projection layer, hidden layer and output layer The projection layer is linear The hidden layer is non-linear Softmax at the output computes probability distribution over the whole vocabulary (g(o) = eo i k eo k ) Model is computationally very expensive 12 / 28

Continuous bag-of-words model (CBOW) Mikolov et al. (2013, ICLR) CBOW adds inputs from words within short window to predict the current word The weights for different positions are shared Computationally much more efficient than normal NNLM The hidden layer is just linear 13 / 28

Skip-gram NNLM We can reformulate the CBOW model by predicting surrounding words using the current word Find word representations useful for predicting surrounding words in a sentence or document. 14 / 28

Skip-gram NNLM 15 / 28

The Hidden Layer We wish to learn word vectors with 300 features. Hidden layer is a weight matrix with 10,000 rows (one for every word in vocabulary) and 300 columns (one for every hidden neuron). 16 / 28

Output Weights 17 / 28

Skip-gram Details Each word w W is associated with vector v w R d Each context c C is associated with vector v c R d W words vocabulary, C contexts vocabulary d embedding dimensionality Vector components are latent (parameters to be learnt). Objective function maximizes the probability of corpus D (set of all word and context pairs extracted from text): argmax θ (w,c) D p(c w; θ) 18 / 28

Skip-gram Details The basic Skip-gram formulation defines p(c w; θ) using the softmax function: p(c w; θ) = evc vw e v c vw c C where v c and v w are vector representations for c and w and C is the set of all available contexts. We seek parameter values (i.e., vector representations for both words and contexts) such that the dot product v w v c associated with good word-context pairs is maximized. The formulation is impractical due to the summation term c C ev c vw over all contexts c. 19 / 28

Training Stochastic gradient descent and backpropagation It is useful to sub-sample the frequent words (e.g., the, is, a) Words are thrown out proportional to their frequency (makes things faster, reduces importance of frequent words like IDF) Non-linearity does not seem to improve performance of these models, thus the hidden layer does not use activation function Problem: very large output layer -size equal to vocabulary size, can easily be in order of millions (too many outputs to evaluate) Solution: negative sampling (also Hierarchical softmax) 20 / 28

Negative Sampling Instead of propagating signal from the hidden layer to the whole output layer, only the output neuron that represents the positive class + few randomly sampled neurons are evaluated The output neurons are treated as independent logistic regression classifiers This makes the training speed independent of the vocabulary size (can be easily parallelized) 21 / 28

Negative Sampling Objective Let D denote dataset of observed (w, c) pairs p(d = 1 w, c) denotes probability that (w, c) came from D p(d = 0 w, c) = 1 p(d = 1 w, c) that it didn t argmax θ argmax θ argmax θ (w,c) D (w,c) D (w,c) D p(d = 1 w, c; θ) p(d = 0 w, c; θ) (w,c) D log p(d = 1 w, c; θ) + log(1 p(d = 1 w, c; θ)) (w,c) D 1 log 1+e vc vw + 1+evc vw (w,c) D log 1 See Goldberg and Levy (2014) for full derivation. 22 / 28

Model Evaluation Type of relationship Word Pair 1 Word Pair 2 Common capital city Athens Greece Oslo Norway All capital cities Astana Kazakhstan Harare Zimbabwe Currency Angola kwanza Iran rial City-in-state Chicago Illinois Stockton California Man-Woman brother sister grandson granddaughter Adjective to adverb apparent apparently rapid rapidly Opposite possibly impossibly ethical unethical Comparative great greater tough tougher Superlative easy easiest lucky luckiest Present Participle think thinking read reading Nationality adjective Switzerland Swiss Cambodia Cambodian Past tense walking walked swimming swam Plural nouns mouse mice dollar dollars Plural verbs work works speak speaks 20K questions, 6B words, 1M words vocabulary Find word closest to question word (e.g., Greece), consider it correct if it matches the answer. 23 / 28

Model Evaluation Models trained on the same data (640-dimensional word vectors). Model Semantic-Syntactic Word Relationship test set Architecture Semantic Accuracy [%] Syntactic Accuracy [%] RNNLM 9 36 NNLM 23 53 CBOW 24 64 Skip-gram 55 59 Can you think of additional baseline comparisons? 24 / 28

Model Evaluation Subtracting two word vectors, and add the result to another word. Check more examples out at: http://rare-technologies.com/word2vec-tutorial/#app 25 / 28

Summary Don t count predict! Baroni et al. (ACL, 2014) 26 / 28

Why Does Skip-gram Work? We don t really know! The objective tries to increase the quantity v w v c for good word-context pairs, and decrease it for bad ones. Words that share many contexts will be similar to each other. Lots of parameters here too: contexts, subsampling, rare word pruning, dimension of vectors. 27 / 28

Resources Word2vec available at https://code.google.com/p/word2vec/ Tool for training the word vectors using CBOW and skip-gram architectures, supports both negative sampling and hierarchical softmax Optimized for very large datasets (>billions of training words) Pre-trained on large datasets (100B words) 28 / 28