CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Similar documents
GloVe: Global Vectors for Word Representation 1

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Embeddings Learned By Matrix Factorization

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

DISTRIBUTIONAL SEMANTICS

word2vec Parameter Learning Explained

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

1 Inference in Dirichlet Mixture Models Introduction Problem Statement Dirichlet Process Mixtures... 3

arxiv: v3 [cs.cl] 30 Jan 2016

An overview of word2vec

From perceptrons to word embeddings. Simon Šuster University of Groningen

Natural Language Processing

Lecture 7: Word Embeddings

Information Retrieval

Word Embeddings 2 - Class Discussions

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Lecture 6: Neural Networks for Representing Word Meaning

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Natural Language Processing and Recurrent Neural Networks

Deep Learning for NLP Part 2

arxiv: v2 [cs.cl] 1 Jan 2019

Ad Placement Strategies

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Distributional Semantics and Word Embeddings. Chase Geigle

Deep Learning. Ali Ghodsi. University of Waterloo

ECS289: Scalable Machine Learning

A Unified Learning Framework of Skip-Grams and Global Vectors

Bayesian Paragraph Vectors

Learning Features from Co-occurrences: A Theoretical Analysis

Notes on Latent Semantic Analysis

Manning & Schuetze, FSNLP (c) 1999,2000

Midterm exam CS 189/289, Fall 2015

Linear Regression. S. Sumitra

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Convolutional Neural Networks

CS260: Machine Learning Algorithms

Parallel Numerical Algorithms

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Matrix Factorization Techniques for Recommender Systems

Probabilistic Latent Semantic Analysis

Data Mining & Machine Learning

Manning & Schuetze, FSNLP, (c)

CS224n: Natural Language Processing with Deep Learning 1

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Linear Discrimination Functions

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Lecture 5 Neural models for NLP

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Introduction to Logistic Regression

Neural Word Embeddings from Scratch

text classification 3: neural networks

STA141C: Big Data & High Performance Statistical Computing

Natural Language Processing

Machine Learning

Natural Language Processing

Linear Algebra and Eigenproblems

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Linear Regression (continued)

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Feature Engineering, Model Evaluations

Math 304 (Spring 2010) - Lecture 2

Deep Learning & Neural Networks Lecture 4

9 Classification. 9.1 Linear Classifiers

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Sparse Gaussian conditional random fields

1. Ignoring case, extract all unique words from the entire set of documents.

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Support Vector Machine

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

NEURAL LANGUAGE MODELS

Large-Scale Behavioral Targeting

A tutorial on sparse modeling. Outline:

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

CSE 546 Midterm Exam, Fall 2014

Sample questions for Fundamentals of Machine Learning 2018

Supporting Information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Linear Models for Regression. Sargur Srihari

STA141C: Big Data & High Performance Statistical Computing

Stochastic Gradient Descent. CS 584: Big Data Analytics

18.6 Regression and Classification with Linear Models

Algorithms for Collaborative Filtering

ECS289: Scalable Machine Learning

Language as a Stochastic Process

Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing

Transcription:

GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes the two primary model families for learning vectors, matrix factorization methods over term-document matrices such as LSA (Deerwester et al., 1990) and context-window modeling methods such as Word2Vec (Mikolov et al., 2014). The goal of GloVe is to embed representations of words in a corpus into a continuous vector space in such a manner that the parallel semantic relationships between words are modelled by vector offsets between words. In other words, we desire that the produced vector space encodes equivalent semantic relationships via linear offsets: W queen W king W wife W husband W dog W puppy W cat W kitten Similarly, we desire that the produced vector space encodes equivalent functional relationships via linear offsets: W king W kings W queen W queens W dog W dogs W cat W cats GloVe begins by forming a word-word co-occurrence matrix X where X ij is the distanceweighted co-occurrence count of word i and word j within a context window of size n where by distance weighted we mean that if in a context window word i is found m words from word j, we let that co-occurrence contribute n m+1 to X n ij. To recover continuous vector space embeddings for the words, we seek to factor log X into W T W + b + b where W R n V is taken as the final word embeddings, W R n V are the context word embeddings, and b and b are bias terms meant to account for co-occurrences caused by the overall frequency of a word in the corpus rather than by any relationship between the words. Alex Adamson: aadamson@stanford.edu 1

Approach We approach the problem by casting the factorization objective above into a least squares problem: J = V i,j=1 where f is a weighting function defined by { (x/x max ) α if x < x max f = 1 otherwise f(x ij )(w T i w j + b i + b j log X ij ) 2 (1) and intended to prevent very frequent words from dominating the descent direction. Instead of solving via alternating least squares, we can take advantage of the fact that the co-occurrence matrix X is likely to be very sparse since even in large corpora, word frequencies roughly follow a power law and in turn f(x) is likely to be very sparse. We instead solve by either stochastic descent where at each iteration we update the parameters based on the gradient observed for a particular word-word co-occurrence (as in Pennington et al.) or by full-batch methods (as we seek to develop here). This approach allows us to update all parameters in a single iteration rather than holding all but one parameter constant for the sake of casting each iteration as a legal quadratic program. Pennington et al. trains the LBL using hogwild descent via Adagrad where each thread is assigned some subset of the word pairs that have a positive number of co-occurrences and in parallel performs (without locking) the updates to the descent direction and updates to the word matrices and biases which are stored in parameters shared by across all threads. In practice, this method has been shown to be numerically stable and robust to choices of model parameters (the learning rate η, x max, α, and the size of the word representations, n). Hence, there is no reason to prefer full-batch methods over the hogwild implementation other than possible speedup. Distributing GloVe We seek to distribute the training portion of the GloVe algorithm (i.e. developing the logbilinear model given a co-occurrence matrix) using the Spark framework. Our main goal is to find a method to exploit the sparsity of the co-occurrence matrix to prevent performing full matrix-matrix multiplications while still performing full batch updates. Spark s data model makes it non-trivial to implement any sort of hogwild update because sharing parameters stored in distributed matrices or other data structures requires synchronization and hence removes the advantages of hogwild updates, so we instead seek to find a way to compute and perform the updates using operations on the full parameter matrices. Before we begin the training iterations, we calculate and cache f(x) in block matrix form and do the same for log X. Both operations take O(( V /p) 2 ) time where p is the number of partitions and require no communication (since we have already partitioned X and the operation here is just a map-values). Alex Adamson: aadamson@stanford.edu 2

We explore how to efficient calculate and apply the full-batch updates. First, note that we can find the gradient with respect to the factorization expression W T W + B + B = log X as follows: J (W T W + B + B log X) f(x) (W T W + B + B log X) (2) Having calculated this, we then update W using gradient descent as W W η J (W T W + B + B log X) (W T W + B + B log X) W = W η (f(x) (W T W + B + B log X) W ) and update W, b, b analogously. The matrix-matrix operations present here are the elementwise multiply between f(x) and the factorization expression, the elementwise additions between B, W T W, B, and log X, and the matrix-matrix multiplies W T W and the partial cost gradient and W. We would like to minimize both the communication and the computation required during the matrix-matrix multiplies. Here we exploit the sparsity of f. We model f(x) as a block matrix and partition it by storing a block on a machine with its neighboring blocks. Note that before building X, we have sorted the words in the corpus such that the word i is the ith most frequent word in the corpus and so we expect the bottom right corner of X (the co-occurrence counts between the rarest words) to be relatively sparse. With sufficiently small block size, we will avoid having to calculate the local matrix multiplies for several blocks. We partition (W T W +B+ B log X) identically to f(x) and lazily evaluate it after we apply the elementwise product with f(x). We implement the elementwise product by performing an inner join over the blocks of the two matrices (represented as ((row block index, column block index), and performing a local elementwise product on the sub-matrices. Note that since the matrices are partitioned identically, no network communication is required. Since we do not explicitly represent empty entries, we preserve the block-level sparsity pattern in the result. Since we lazily evaluate the local matrices that compose the blocks of (W T W + B + B log X) and in particular lazily evaluate the blocks of W T W, we avoid having to actually calculate the local matrix multiplies or communicate the contents of the blocks matrices over the network for all multiplies involving a completely sparse block. Let b c and b r be the number of row and column blocks in the left matrix respectively (so b r and b c is the number of row and column blocks in the right matrix) and assume both matrices are partitioned into a grid where each block in the grid is on the same machine. Assume that the left matrix is in R V n and the right matrix is in R n V. Before accounting for the sparsity of f(x), the shuffle size (in matrix elements) for the matrix-multiply is V n b c ( bc V n 1) + p b r ( br 1) where p is the number of machines across which the blocks are p partitioned. Suppose s is the number of sparse blocks in the left matrix (which are distributed randomly across the matrix). Then, in expectation, the shuffle size for the matrix multiply decreases to s b cb r ( ( bc V n 1) + p b r ( br 1)) p V n b c Alex Adamson: aadamson@stanford.edu 3

n V Time Cost 4000 25 2m2.938s 0.0615 8000 25 2m12.139s 0.0291 16000 25 5m27.746s 0.0481 4000 50 2m16.826s 0.0546 8000 50 4m55.088s 0.0477 16000 50 6m51.855s 0.0481 Table 1: Results for Hogwild implementation n V Time Cost 4000 25 2m30.405s 0.00444 8000 25 3m49.019s 0.00213 16000 25 8m11.923s 0.00178 4000 50 2m40.416s 0.00173 8000 50 4m26.961s 0.00173 16000 50 10m16.377s 0.00173 Table 2: Results for Spark implementation using a block size of 128 rows/cols per block Results Our Spark implementation scales significantly worse with the size of the vocabulary than the implementation provided by the GloVe authors. This makes sense considering our implementation depended on leveraging the sparsity of f(x) by avoiding communication and computation involving completely empty blocks, but for all but the blocks involving the rarest words, it is unlikely that the block is completely sparse and hence we fail to avoid these computations and shuffles if even one entry in the block is nonzero. Even with a block size of 128, if we re using the 8000 most common words in a corpus of over a million sentences and over 72000 words with a window size of 15 (as we are doing here), it is staggeringly unlikely that no pair of words in the 128 least common words of the 8000 ever cooccur. For instance, for the case where we are using 16000 words, of the 15625 possible blocks, 15623 of them have a nonzero value. Additionally, as we lower the block size, the size of the join during the matrix multiplication (as in number of blocks output by the join) increases cubically (although the number of local matrices sent over the network remains the same), so there is a trade off between the lowered communication of raw matrices afforded by a lower block size and the increased number of local matrix routines that need to called (i.e. the number of records that are output by the join/cogroup in the matrix multiply). The Spark implementation scales substantially better with the size of the word representations than the Hogwild implementation. This is likely because the word representation size is smaller than the block size we use (128) so increasing the size of the word representation does not lead to the shuffling of more elements or more calls to BLAS routines (although it does increase the cost of the sparse matrix multiplies in BLAS). We use BLAS via netlib-java when evaluating the product of blocks whereas the Hogwild implementation computes the analogous result (Wi T W j for each cooccurring pair) via a basic dot-product implementation Alex Adamson: aadamson@stanford.edu 4

rather than SIMD or some other optimization. Finally, note that our algorithm appears to converge much quicker than the Hogwild implementation so in some sense while the runtime of an iteration of our algorithm does not scale as well as an iteration of the Hogwild implementation, we do more work towards the objective per iteration (and per unit time) than the Hogwild implementation. 1 To illustrate how the algorithm scales with sparsity, we run the algorithm after zeroing all entries in the cooccurrence matrix that fall below a threshold. x min % sparse blocks Time Cost 0.01 0.961 2m41.287s 1.126E-16 0.025 0.677 1m12.681s 8.11E-17 0.05 0.305 40.601s 5.76E-17 Table 3: Results for Spark implementation using a block size of 128 rows/cols per block, 4000 words, 25 dimensional vectors, varying the minimum threshold for clipping the cooccurrence matrix entries. Note that these costs are not comparable to those from the tables above since the clipping in general should make the regression easier (there are fewer terms in the summation) The results as we increase the block sparsity of X are as expected: even for modest increases in the block sparsity, we see significant speedups. We expect that for larger vocabularies than we are using here (i.e. the sort of vocabularies that would be of actual interest in NLP tasks), Future Work Given more time, we would have liked to compare the scalability of the two implementations on much larger vocabularies (relative to the size of the corpus, since that ratio is the primary factor in the sparsity of the cooccurrence matrix) since our expectation is that the Spark implementation s scaling relative to the vocabulary size would dovetail with that of the Hogwild implementation as the number of sparse entries in sparse blocks approaches the number of sparse entries. References Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391, 1990. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. word2vec, 2014. Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532 1543, 2014. 1 We have not peformed an exhaustive gridsearch and it is possible that with the correct learning rate, the iterations to convergence of both algorithms is closer Alex Adamson: aadamson@stanford.edu 5