Deep Learning. Ali Ghodsi. University of Waterloo

Similar documents
An overview of word2vec

Neural Word Embeddings from Scratch

STA141C: Big Data & High Performance Statistical Computing

word2vec Parameter Learning Explained

GloVe: Global Vectors for Word Representation 1

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

STA141C: Big Data & High Performance Statistical Computing

DISTRIBUTIONAL SEMANTICS

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

arxiv: v3 [cs.cl] 30 Jan 2016

Embeddings Learned By Matrix Factorization

From perceptrons to word embeddings. Simon Šuster University of Groningen

Distributional Semantics and Word Embeddings. Chase Geigle

CS224n: Natural Language Processing with Deep Learning 1

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Deep Learning for NLP Part 2

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Lecture 6: Neural Networks for Representing Word Meaning

Natural Language Processing

Natural Language Processing and Recurrent Neural Networks

Lecture 7: Word Embeddings

Word Embeddings 2 - Class Discussions

Seman&cs with Dense Vectors. Dorota Glowacka

Applied Natural Language Processing

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

Natural Language Processing

Natural Language Processing

The representation of word and sentence

Deep Learning for Natural Language Processing

text classification 3: neural networks

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Natural Language Processing (CSE 517): Cotext Models (II)

A Unified Learning Framework of Skip-Grams and Global Vectors

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Semantics with Dense Vectors

Dimensionality Reduction and Principle Components Analysis

Machine Learning for Smart Learners

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Deep Learning for NLP

CS224n: Natural Language Processing with Deep Learning 1

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Sparse Word Embeddings Using l 1 Regularized Online Learning

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

Deep Learning Recurrent Networks 2/28/2018

Neural Networks Language Models

Deep Learning. Language Models and Word Embeddings. Christof Monz

Deep Learning for NLP

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Neural Networks for NLP. COMP-599 Nov 30, 2016

NEURAL LANGUAGE MODELS

Lecture 5 Neural models for NLP

Learning to translate with neural networks. Michael Auli

Factorization of Latent Variables in Distributional Semantic Models

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

CS 572: Information Retrieval

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Deep Learning: A Statistical Perspective

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Natural Language Processing (CSE 517): Cotext Models

Latent Semantic Analysis. Hongning Wang

Bayesian Paragraph Vectors

Conditional Language Modeling. Chris Dyer

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Latent Semantic Analysis. Hongning Wang

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Probabilistic Latent Semantic Analysis

Conceptual Questions for Review

arxiv: v2 [cs.cl] 1 Jan 2019

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

L26: Advanced dimensionality reduction

ATASS: Word Embeddings

Word Representations via Gaussian Embedding. Luke Vilnis Andrew McCallum University of Massachusetts Amherst

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS 6120/CS4120: Natural Language Processing

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Deep Learning For Mathematical Functions

CS 6120/CS4120: Natural Language Processing

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations

The Singular Value Decomposition

Global and Local Feature Learning for Ego-Network Analysis

Dimensionality Reduction

Recurrent Neural Network

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Text Classification based on Word Subspace with Term-Frequency

Transcription:

University of Waterloo

Language Models A language model computes a probability for a sequence of words: P(w 1,..., w T ) Useful for machine translation Word ordering: p (the cat is small) > p (small the is cat) Useful for machine translation Word choice: p (walking home after school) > p (walking house after school)

One-hot representation Most Statistical NLP work regards words as atomic symbols: book, conference, university In vector space, this is a sparse vector with one 1 and a lot of zeros. Problem: [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ] 1 Dimensionality of the vector will be the size of vocabulary. e.g. 13 M for Google 1T and 500K for big vocab. 2 book [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ] library [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ]

Problem Dimensionality of the vectors increase with vocabulary size. Vectors will be very high dimensional.

Idea 1: Reduce the dimensionality (Dimensionality reduction) Singular Value Decomposition (SVD) X = UΣV T The columns of U contain the eigenvectors of XX T. The columns of V contain the eigenvectors of X T X. Σ is diagonal matrix. Diagonal values are eigenvalues of XX T or X T X.

Nonlinear dimensionality reduction by locally linear embedding. Sam Roweis & Lawrence Saul. Science, v.290,2000 Figure: Arranging words in a continuous semantic space. Each word was initially represented by a high-dimensional vector that counted the number of times it appeared in different encyclopedia articles. LLE was applied to these word-document count vectors.

Computational Cost Computational cost in SVD scales quadratically for d n matrix: O(nd 2 ) (when d < n) This is not possible for large number of words or document. It is hard for out of sample words or documents.

Idea 2: Learn low-dimensional vectors directly Learning representations by back-propagating errors. (Rumelhnrt, Hinton, Williams 1986) A Neural Probabilistic Language Model (Bengio et al., 2003) Natural Language Processing (Almost) from Scratch ( Collobert et al., 2011) Efficient Estimation of Word Representations in Vector Space (Mikolov et al., 2013) GloVe: Global Vectors for Word Representation (Pennington et al., 2014)

Idea 2 Predict surrounding words of every word(word2rec) Capture co occurrence counts directly (Glove) They are fast and can incorporate a new sentence/document or add a word to the vocabulary.

Distributional Hypothesis The Distributional Hypothesis in linguistics is derived from the semantic theory of language usage, i.e. words that are used and occur in the same contexts tend to purport similar meanings. Harris, Z. (1954). Distributional structure a word is characterized by the company it keeps. Firth, J.R. (1957). A synopsis of linguistic theory 1930-1955

Co-occurrence matrix Co-occurrence can be interpreted as an indicator of semantic proximity of words. Silence is the language of God, all else is poor translation. Rumi (1207,1273) counts silence is the language of God silence 0 1 0 0 0 0 is 1 0 1 0 0 0 the 0 1 0 1 0 0 language 0 0 1 0 1 0 of 0 0 0 1 0 1 God 0 0 0 0 1 0

Vectors for Word Representation word2vec and GloVe learn relationships between words. The output are vectors with interesting relationships. For example: vec(king) vec(man) + vec(woman) vec(queen) or vec(montreal Canadiens) vec(montreal) + vec(toronto) vec(toronto Maple Leafs)

Continuous bag-of-word and skip-gram Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, et. al., arxiv 2013 Figure: The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word.

A very simple CBOW model Assume that c = 1. i.e. The length of the window is 1. Predict one target word, given one context word. This is like a bigram model. Credit: Xin Rong Figure: A very simple CBOW model.

A very simple CBOW model Assume only x k = 1, i.e. x T = [0 0 0... 1... 0 0] h = x T W is the k-th row of W, W k:. Let s define u c = h as the vector representation of the central (input) word. Credit: Xin Rong

Details (Blackboard)

The skip-gram model and negative sampling From paper: Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al. 2013) logσ(ν w0 T ν wi ) + k E wi P n (w)[logσ(ν w0 T ν wi )] i=1 Where k is the number of negative samples and we use, The sigmoid function! σ(x) = 1 1+e x So we maximize the probability of two words co-occurring in first log

The skip-gram model and negative sampling logσ(ν w0 T ν wi ) + k E wi P n (w)[logσ(ν w0 T ν wi )] i=1 Max. probability that real outside word appears, minimize prob. that random words appear around center word P n = U(s) 3/4 /Z, the unigram distribution U(w) raised to the 3/4rd power (We provide this function in the starter code). The power makes less recent words be sampled more often

What to do with the two sets of vectors? We end up with L and L from all the vectors v and v Both capture similar co-occurrence information. It turns out, the best solution is to simply sum them up: L final = L + L One of many hyper parameters explored in GloVe: Global Vectors for Word Representation (Pennington et al. (2014)

PMI and PPMI PMI of two words may produce negative value which is not interpretable. For instance, if two words w i, w j are never involved in a single document, we would have #(w i, w j ) = 0, PMI(w i, w j ) = log(0) =. A convenient way to cope with the difficulty described above is to replace negative values by zeros, namely Positive PMI. PPMI(w i, w j ) = max(pmi(w i, w j ), 0).

PMI and Word2vec Levy and Goldberg (2013) proved the equivalence between word2vec and matrix factorization model. Since E wn P n [log σ( v i u n )] = n #(w n ) T log σ( v i u n ), This can be re-written as J(w i, w j ) = #(w i, w j ) log σ(v i u j ) + k #(w i ) #(w j) T log σ( v i u j ).

PMI and Word2vec Let x = v i u j, and optimize with respect to x ( e 2x #(wi, w j ) k #(w i ) #(w j ) T Quadratic with respect to e x J(w i, w j ) x ) 1 e x = 0, #(w i, w j ) k #(w i ) #(w j ) T = 0. (1)

PMI and Word2vec Two solutions One is not valid (i.e. e x = 1) since x = v i u j, e x = #(w i, w j ) T #(w i )#(w j ) 1 k v i u j = log( #(w i, w j ) T #(w i )#(w j ) ) log k Skip-gram is equivalent to a matrix factorization model with the matrix entry being log( #(w i,w j ) T #(w i )#(w j ) log k. )