Seman&cs with Dense Vectors. Dorota Glowacka

Size: px

Start display at page:

Download "Seman&cs with Dense Vectors. Dorota Glowacka"

Gerard Barrie Clark
5 years ago
Views:

1 Semancs with Dense Vectors Dorota Glowacka

2 Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values in the vector were a funcon of the count of the word co- occurring with each neighbouring word - each word is thus represented with a vector that is long (with vocabularies of 20,000 to 50,000) and sparse (with most elements of the vector for each word equal to zero)

3 Today s Lecture How to represent a word with vectors that are short (with length of 50 1,000) and dense (most values are non- zero) Why short vectors? - easier to include as features in machine learning systems - because they contain fewer parameters, they generalize berer and are less prone to overfitng - sparse vectors are berer at capturing synonymy

4 Singular Value Decomposion (SVD) SVD is a method for finding the most important dimensions of a dataset It can be applied to any rectangular matrix SVD belongs to a family of methods that can approximate an N- dimensional dataset using fewer dimensions, such as Principle Component Analysis (PCA) or Factor Analysis First applied in Latent Seman>c Analysis (LSA) to tasks generang embeddings from term- document matrices

5 Singular Value Decomposion (SVD) Dimensionality reducon methods first rotate the axes of the original dataset into a new space. The new space is chosen so that the highest order dimension captures the most variance in the original dataset, the next dimension captures the next most variance, and so on. While some informaon about the relaonship between the original points is necessarily lost in the new transformaon, the remaining dimensions preserve as much as possible of the original setng.

7 Latent Semanc Analysis (LSA) LSA is a parcular applicaon of SVD to a V c term- document matrix X represenng V words and their co- occurrence with c documents. SVD factorizes matrix X into the product of three matrices: 1. V m matrix W, where each row w represents a word and each column represents m dimensions in a latent space. m column vectors are orthogonal to each other and are ordered by the amount of variance in the original dataset m = rank of X (number of linearly independent rows)

8 Latent Semanc Analysis (LSA) 2. Σ is a diagonal m m matrix with singular values along the diagonal, expressing the importance of each dimension. 3. The m c matrix C, where each row represents one of the latent dimensions and the m row vectors are orthogonal to each other. By using only the first k dimensions of W, Σ and C, the product of these 3 matrices becomes a least- squares approximaon to the original X. Since the first dimensions encode the most variance, SVD models the most important informaon in the original X

9 X " $ % = W " $ % σ σ σ " σ m " $ % C " $ % V c V m m m m c

10 Taking only the top k m dimensions a]er SVD is applied to the co- occurrence matrix X: X " $ % = W k " $ % σ σ σ " σ k " $ % C " $ % V c V k k k k c

11 SVD and LSA Using only the top k dimensions leads to a reduced W matrix, with one k- dimensioned row per word This row acts as a dense k- dimensional vector (embedding) represenng that word LSA embeddings generally set k = 300 LSA applies a parcular weighng for each co- occurrence cell that mulplies two weights: local and global

12 LSA term weighng The local weight of each term i in document j is its log frequency: log f ( i, j) +1 The global weight of term i is a version of its entropy: 1+ j p( i, j)log p i, j log D ( ) where D is the number of documents.

13 SVD and word- context In LSA, SVD is applied to the term- document matrix. An alternave is to apply SVD to the word- word or word- context matrix the context dimensions are words (rather than documents as in LSA) Relies on PPMI- weighted word- word matrix Only top dimensions are used truncated SVD

15 Skip- gram and CBOW Methods for generang dense embeddings inspired by neural network models Neural network language models are given a word and predict a context this process can be used to learn word embeddings. The intuion is that words with similar meanings tend to occur near each other in text. The process for learning these embeddings has a strong relaonship with SVD factorizaon and dot- product similarity metrics.

16 Skip- gram Model Learns two separate embeddings for each word w: word embedding v and context embedding c. Embeddings encoded in two matrices: word matrix W and context matrix C. Each row i of word matrix W is 1 x d vector embedding vi for word i vocabulary V. Each column i of the context matrix C is a d x 1 vector embedding ci for word i in vocabulary V.

17 Predicon with Skip- grams Skip- gram model predicts each neighbouring word in a context window of L words, e.g. context window L = 2 the context is "w t 2, w t 1, w t+1, w t+2 $ % We want to predict each of the context words from word wj. Skip- gram calculates the probability p(wk wj) by compung dot product between context vector ck of word k and target vector vj for word wj. The higher the dot product between two vectors, the more similar they are.

18 Predicon with Skip- grams

19 Predicon with Skip- grams Dot product ck vj is a number ranging from - inf. to +inf. We use sosmax funcon to normalize the dot product into probabilies: p( w k w j ) = ( ) ( ) exp c k v j i V exp c i v j Compung the denominator requires compung dot product between each word in V and the target word wi, which may take a long me.

20 Skip- gram with negave sampling Faster than using the sosmax funcon In the training phase, for each target word the algorithm chooses surrounding context words as posi>ve examples. For each posive example, the algorithm samples k noise examples, or nega>ve examples, according to their weighted unigram probability, from non- neighbour words. The goal is to move the embeddings towards the neighbour words and away from noise words.

21 Skip- gram with negave sampling lemon, a [tablespoon of apricot preserves or] jam c1 c2 w c3 c4 goal - learn an embedding whose dot product with each context word is high We select 2 noise words for each of the context words: [cement physical dear coaxial apricot attendant hence forever puddle] n1 n2 n3 n4 w n5 n6 n7 n8 We want noise words n to have a low dot- product with target embedding w.

22 Skip- gram with negave sampling More formally, the learning objecve is: logα ( c w) + k i=1 Ε ni p w ( ) % logσ n w ' ( ) i ( where σ is a sigmoid funcon of the dot product. The learning starts with randomly inialized W and C matrices, and then walking through the training corpus to maximize the objecve funcon.

24 Skip- gram as neural network We have input vector x of word wj represented as one- hot vector (one element = 1, and all the others equal 0). Predict probability of each of the output words in 3 steps: 1. Select embedding from W: x is mulplied by W to give hidden (projec>on) layer. 2. Compute dot product ck x vj: for each of the context words, mulply projecon vector by context matrix C. This produces a 1 x V dimensional output vector with a score for each word in V. 3. Normalize dot products into probabilies: p( w k w ) j = y k = ( ) ( ) exp c k v j i V exp c i v j

25 Properes of embeddings Redmond Havel ninjutsu graffi4 capitulate Redmond Wash Redmond Washington Microso] Vaclav Havel ninja spray paint capitulaon President Vaclav Havel Velvet Revoluon Maral arts graffi capitulated swordsmanship taggers capitulang Examples of the closest tokens to some target words using a phrase- based extension of the skip- gram algorithm (Mikolov et al. 2013)

26 Properes of Embeddings Offsets between embeddings can capture rela4ons between words, e.g. vector(king) vector(man) + vector(woman) is close to vector(queen) Offsets can capture gramma4cal number

27 Brown clustering Method of grouping words into clusters based on their relaonship with preceding and following words. Brown clusters can be used to create bit vectors for a word that can funcon as a syntacc representaon. Algorithm makes use of class- based language model, where each word w belongs to a class c with a probability p. Probabilies to a pair of words are assigned by modelling the transion between classes rather than between words. Class- based LM can assign probability to an enre corpus given a parcular clustering C: P( corpus C) n = i 1 P( c i c i 1 )P( w i c ) i

28 Brown clustering Brown clustering is a hierarchical algorithm: 1. Each word is inially assigned to its own cluster. 2. We consider merging each pair of clusters. The pair whose merger results in the smallest decrease in the likelihood of the corpus is merged. 3. Clustering proceeds unl all words are in one big cluster. Two words are most likely to be clustered if they have similar probabilies for preceding and following words.

29 Brown clustering By tracing the order in which clusters are merged, the model builds a binary tree with leaves are words. A word can be represented by binary string that corresponds to its path from the root with 0 for le] and 1 for right

30 Brown clustering We can extract useful features by taking binary prefixes of the bit string: 01 - cluster of month names {November, October} names of common nouns for corporate execuves {chairman, president} 1 verbs {run, sprint, walk} 0 is nouns The shorter the prefix, the more abstract the cluster.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent