DISTRIBUTIONAL SEMANTICS

Similar documents
Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

GloVe: Global Vectors for Word Representation 1

Neural Word Embeddings from Scratch

Embeddings Learned By Matrix Factorization

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Natural Language Processing

Deep Learning. Ali Ghodsi. University of Waterloo

Lecture 6: Neural Networks for Representing Word Meaning

From perceptrons to word embeddings. Simon Šuster University of Groningen

An overview of word2vec

Word Embeddings 2 - Class Discussions

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

STA141C: Big Data & High Performance Statistical Computing

word2vec Parameter Learning Explained

Natural Language Processing

Deep Learning for NLP Part 2

Applied Natural Language Processing

Natural Language Processing

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

CS 6120/CS4120: Natural Language Processing

Bayesian Paragraph Vectors

Seman&cs with Dense Vectors. Dorota Glowacka

NEURAL LANGUAGE MODELS

STA141C: Big Data & High Performance Statistical Computing

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CS 6120/CS4120: Natural Language Processing

Latent Semantic Analysis. Hongning Wang

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Semantic Similarity from Corpora - Latent Semantic Analysis

Neural Networks for NLP. COMP-599 Nov 30, 2016

text classification 3: neural networks

arxiv: v2 [cs.cl] 1 Jan 2019

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Factorization of Latent Variables in Distributional Semantic Models

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Data Mining & Machine Learning

Latent Semantic Analysis. Hongning Wang

arxiv: v3 [cs.cl] 30 Jan 2016

Lecture 7: Word Embeddings

Distributional Semantics and Word Embeddings. Chase Geigle

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Boolean and Vector Space Retrieval Models

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Latent semantic indexing

Lecture 5: Web Searching using the SVD

Generic Text Summarization

Ranked Retrieval (2)

9 Searching the Internet with the SVD

13 Searching the Web with the SVD

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Deep Learning for NLP

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Manning & Schuetze, FSNLP, (c)

Midterm sample questions

Learning Features from Co-occurrences: A Theoretical Analysis

Natural Language Processing with Deep Learning CS224N/Ling284

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Deep Learning: A Statistical Perspective

Matrices, Vector Spaces, and Information Retrieval

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Variable Latent Semantic Indexing

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Manning & Schuetze, FSNLP (c) 1999,2000

vector space retrieval many slides courtesy James Amherst

Natural Language Processing (CSE 517): Cotext Models (II)

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Vector-based Models of Semantic Composition. Jeff Mitchell and Mirella Lapata, 2008

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Neural Networks Language Models

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Natural Language Processing

Word Meaning and Similarity. Word Similarity: Distributional Similarity (I)

Feature Engineering, Model Evaluations

Semantics with Dense Vectors

Deep Learning For Mathematical Functions

Leverage Sparse Information in Predictive Modeling

Notes on Latent Semantic Analysis

Sparse Word Embeddings Using l 1 Regularized Online Learning

Learning to translate with neural networks. Michael Auli

Introduction to Logistic Regression

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Deep Learning. Language Models and Word Embeddings. Christof Monz

Chap 2: Classical models for information retrieval

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Information Retrieval

CS 572: Information Retrieval

PV211: Introduction to Information Retrieval

CSE446: non-parametric methods Spring 2017

Lecture 5 Neural models for NLP

Transcription:

COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS

LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc. New senses The Internet provides us with massive amounts of text. Can we use that to obtain word meanings?

DISTRIBUTIONAL SEMANTICS You shall know a word by the company it keeps (Firth) Document co-occurrence often indicative of topic (document as context) E.g. voting and politics Local context reflects a word s semantic class (word window as context) E.g. eat a pizza, eat a burger Two approaches: Count-based (Vector Space Models) Prediction-based

THE VECTOR SPACE MODEL Fundamental idea: represent meaning as a vector Consider documents as context (ex: tweets) One matrix, two viewpoints Documents represented by their words (web search) Words represented by their documents (text analysis) state fun heaven 425 0 1 0 426 3 0 0 427 0 0 0

MANIPULATING THE VSM Weighting the values Creating low-dimensional dense vectors Comparing vectors

TF-IDF Standard weighting scheme for information retrieval Also discounts common words tf matrix the country hell 425 43 5 1 426 24 1 0 427 37 0 3 df 500 14 7 the country hell 425 0 25.8 6.2 426 0 5.2 0 427 0 0 18.5!"# $ = log * "# $ tf-idf matrix

DIMENSIONALITY REDUCTION Term-document matrices are very sparse Dimensionality reduction: create shorter, denser vectors More practical (less features) Remove noise (less overfitting)

SINGULAR VALUE DECOMPOSITION V D 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A (term-document matrix)! = #Σ% & m=rank(a) U (new term matrix) m Σ (singular values) m V T (new document matrix) D V 2.2 0.3 5.5 2.8 1.3 3.7 8.7 0.1 3.5 2.9 2.1 1.9 m 9.1 0 0 0 4.4 0 0 0 2.3 0 0 0 0 0 0 0.1 m 0.2 4.0 4.1 0.6 2.6 6.1 1.3 0.2 1.4 1.9 1.8 0.3

TRUNCATING LATENT SEMANTIC ANALYSIS Truncating U, Σ, and V T to k dimensions produces best possible k rank approximation of original matrix So truncated, U k (or V kt ) is a new low dimensional representation of the word (or document) Typical values for k are 100-5000 U m k U k k V 2.2 0.3 5.5 2.8 1.3 3.7 2.4 1.1 4.7 8.7 0.1 3.5 2.9 2.1 3.3 1.9 V 2.2 0.3 5.5 2.8 1.3 3.7 2.4 1.1 4.7 2.9 2.1 3.3

WORDS AS CONTEXT the country hell state 1973 10 1 fun 54 2 0 heaven 55 1 3 Lists how often words appear with other words In some predefined context (usually a window) The obvious problem with raw frequency: dominated by common words

POINTWISE MUTUAL INFORMATION For two events x and y, pointwise mutual information (PMI) comparison between the actual joint probability of the two events (as seen in the data) with the expected probability under the assumption of independence!"#(%, ') = log -.(%, '). %.(')

CALCULATING PMI the country hell Σ state 1973 10 1 12786 fun 54 2 0 633 heaven 55 1 3 627 Σ 1047519 3617 780 15871304 p(x,y) = count(x,y)/σ p(x) = Σ x / Σ p(y) = Σ y / Σ x= state, y = country p(x,y) = 10/15871304 = 6.3 x 10-7 p(x) = 12786/15871304 = 8.0 x 10-4 p(y) = 3617/15871304 = 2.3 x 10-4 PMI(x,y) = log 2 (6.3 x 10-7 )/((8.0 x 10-4 ) (2.3 x 10-4 )) = 1.78

PMI MATRIX the country hell state 1.22 1.78 0.63 fun 0.37 3.79 -inf heaven 0.41 2.80 6.60 PMI does a better job of capturing interesting semantics E.g. heaven and hell But it is obviously biased towards rare words And doesn t handle zeros well

PMI TRICKS Zero all negative values (PPMI) Avoid inf and unreliable negative values Counter bias towards rare events Artificially increase marginal probabilities Smooth probabilities

SIMILARITY Regardless of vector representation, classic use of vector is comparison with other vector Though vectors can also be used directly as features For IR: find documents most similar to query

COSINE SIMILARITY The cosine of the angle between two vectors is the dot product of the two vectors divided by the product of their norms: cos $ = ' ( ) ' ) Where And ' ( ) = /,-. ', ), ' = 0 / ' 1,,-.

COSINE SIMILARITY EXAMPLE " = [0,0,1,0,1, 1, 1] ( = 0,1,1,0, 1, 1,0 " ) ( = 1 + 1 + 1 = 3 " = ( = 1 + 1 + 1 + 1 = 2 cos θ = 1 2 2 = 0.75

EMBEDDINGS FROM PREDICTIONS Neural network inspired approaches seek to learn vector representations of words and their contexts Key idea Word embeddings should be similar to embeddings of neighbouring words And dissimilar to other words that don t occur nearby Using vector dot product for vector comparison u. v = j u j v j As part of a classifier over a word and its immediate context

EMBEDDINGS FROM PREDICTIONS Framed as learning a classifier Skip-gram: predict words in local context surrounding given word P(life" "rests) P(he" "rests) P(in" "rests) P(peace" "rests) "Bereft"of"life"he"rests in"peace!"if"you"hadn't"nailed"him" P(rests" "{he,"in,"life,"peace}) CBOW: predict word in centre, given words in the local surrounding context Local context means words within L positions, e.g., L=2

SKIP GRAM MODEL Generates each word in context given centre word P(life" "rests) P(he" "rests) P(in" "rests) P(peace" "rests) "Bereft"of"life"he"rests in"peace!"if"you"hadn't"nailed"him" Total probability defined as Where subscript denotes position in running text For each word, P (w k w j )= Y l2 L,..., 1,1,...,L P P (w t+l w t ) exp(c wk v wj ) w 0 2V exp(c w 0 v wj )

EMBEDDING PARAMETERISATION Two parameter matrices, with d-dimensional embedding for all words target word embedding for word i 1 2.. i. V w 1..d WV 1 2.. j. V c 1..d C context word embedding for word j Fig 19.17, JM V w d V c d Words are numbered, e.g., by sorting vocabulary and using word location as its index

ONE-HOT VECTORS AND EMBEDDINGS Words are integer numbers, e.g., cat = 17235 th word The embeddings for cat are then: V 17235 = [ 1.23, 0.8, -0.15, 0.7, 1.1, -1.3, ] (d-dim. vector) C 17235 = [ 0.32, 0.1, 0.27, 2.5, -0.1, 0.45, ] (d-dim. vector) Using a separate embedding for cat appearing in the centre and appearing in the context of another word A one-hot vector is all 0s, with a single 1 at index i E.g., x = cat = [0,0,0,..., 0,1,0,..., 0] where index 17235 is set to 1, all other V-1 entries are 0 This allows us to write V cat as V x

SKIP-GRAM MODEL Output layer probabilities of context words Input layer 1-hot input vector x 1 x 2 Projection layer embedding for w t C d V y 1 y 2 y k w t-1 w t x j V W V d y V y 1 y 2 x V 1 V 1 d C d V y k w t+1 y V Fig 19.18, JM

SKIP-GRAM COMPONENTS 1. Lookup embeddings from W for centre word v j = V x 2. Compute the dot product with all possible context words v j. c k for all possible words in the vocabulary k V 3. Normalise output vector to ensure values are positive and sum to one Softmax transformation z! exp zi P i exp z i These values can now be considered probabilities; hope that Prob for observed context words > Prob other words. i

TRAINING THE SKIP-GRAM MODEL Only data requirement is raw text Train to maximise likelihood of the text, using gradient descent But is too slow: to obtain the probability of a word need to sum over the entire vocabulary Alternative: negative sampling. Reduced the problem to binary classification: predict the correct sample instead of a random one. More efficient than MLE

CBOW ARCHITECTURE Input layer 1-hot input vectors for each context word w t-1 x 1 x 2 x j Projection layer sum of embeddings for context words Output layer probability of w t C V d y 1 y 2 x V x 1 x 2 V W d V y k w t w t+1 x j C V d 1 d y V x V 1 V

PROPERTIES Skip-gram and CBOW both perform fairly well No clear reason to prefer one over another, choice is task dependent Very fast to train using negative sampling approximation In fact Skip-gram with negative sampling related to LSA Can be viewed as factorisation of the PMI matrix over words and their contexts See Levy and Goldberg (2014) for details

EVALUATING WORD VECTORS Lexicon style tasks WordSim-353 are pairs of nouns with judged relatedness SimLex-999 also covers verbs and adjectives TOEFL asks for closest synonym as multiple choice Word analogy task Man is to King as Woman is to??? France is to Paris as Italy is to??? Evaluate where in the ranked predictions the correct answer is, given tables of known relations

POINTERS TO SOFTWARE Word2Vec C implementation of Skip-gram and CBOW https://code.google.com/archive/p/word2vec/ GenSim Python library with many methods include LSI, topic models and Skipgram/CBOW https://radimrehurek.com/gensim/index.html GLOVE http://nlp.stanford.edu/projects/glove/

FURTHER READING JM3, Ch 15, JM3 Ch 16.1 16.2 Optional: From Frequency to Meaning: Vector Space Models of Semantics Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. NIPS 2013. Distributed representations of words and phrases and their compositionality. J. Pennington, R. Socher, and C. D. Manning. EMNLP 2014. GloVe: Global Vectors for Word Representation. O. Levy, and Y. Goldberg. NIPS 2014, Neural word embedding as implicit matrix factorization.