ATASS: Word Embeddings Lee Gao April 22th, 2016
Guideline Bag-of-words (bag-of-n-grams) Today High dimensional, sparse representation dimension reductions LSA, LDA, MNIR Neural networks Backpropagation algorithm Convolutional neural network Recurrent neural network Word embeddings Continuous bag-of-words Skip-gram Downstream predictions
Neural Networks 1 1 For more information, see A Tutorial on Deep Learning, Quoc V. Le (https://cs.stanford.edu/~quocle)
Example: Example Should I watch the movie Gravity? Both Mary and John rated 3/5. Historical ratings: ( O : I like the movie; X : I do not like the movie)
Decision Function Features: x 1 : Mary s rating, x 2 : John s rating Decision function: ( ) h (x;θ,b) = g θ T 1 x + b, where g (z) = 1 + exp ( z) Objective: min θ,b m i=1 [ ( ) h x (i) ;θ,b y (i)] 2
Neural network illustration Learning: stochastic gradient descent. α is learning rate, a large α will give aggressive updates, a small α will give conservative updates. θ 1 = θ 1 α θ 1 θ 2 = θ 2 α θ 2 b = b α b where the partial derivative (at example i) for θ 1 is, and similar for θ 2 and b ( ( ) θ 1 = h x (i) ;θ,b y (i)) 2 θ [ 1 ( ) = 2 g θ T x (i) + b y (i)][ ( )] ( ) 1 g θ T x (i) + b g θ T x (i) + b x (i) 1
The limitations of linear decision function The limitations of linear decision function: the samples are not linearly separable Problem decomposition: two simple problems that can be using linear models
Neural network illustration Suppose the two decision functions are h 1 (x;(θ 1,θ 2 ),b 1 ), h 2 (x;(θ 3,θ 4 ),b 2 ) Objective min w,c m i=1 Neural network illustration [ h ((h 1 (x (i)) (,h 2 x (i))) ) ;w,c y (i)] 2
The backpropagation algorithm Goal: compute the parameter gradients An implementation of the chain rule specifically designed for neural networks Generalized parameters: θ for weights, b for biases, layers indexed by 1 (input), 2,..., L (output) θ (l) ij : weight connecting neuron i in layer l to neuron j in layer l + 1. b (l) i : bias of neuron i in layer l. Decision functions: h (1) = x ( ( h (2) = g θ (1)) ) T h (1) + b (1) ( ( h (L 1) = g θ (L 2)) ) T h (L 2) + b (L 2) h (x) = h (L) = g ( ( θ (L 1)) T h (L 1) + b (L 1) )
The backpropagation algorithm 1. Perform a feed-forward pass to compute h (1), h (2),..., h (L). 2. For the output layer, compute ( ) ( sl 1 δ (L) = 2 h (L) y )g 1 θ (L 1) i h (L 1) 1 i + b (L 1) 1 where s l is the number of neurons in layer l. 3. Perform a backward pass, for l = L 1, L 2,...,2. For each node j in layer l, compute ( ) ( ) sl+1 sl 1 δ (l) j = θ (l) jk δ (l+1) k g θ (L 1) ij h (L 1) i + b (L 1) j k=1 i=1 i=1 4. The desired partial derivatives can be computed as θ (l) ij b (l) i = h (l) i δ (l+1) j = δ (l+1) i
Deep vs. shallow networks Deep networks are more computationally attractive than shallow networks Much less connections
Convolutional neural networks Networks seen so far, every neuron in the first hidden layer connects to all the neurons in the inputs Does not work when x is high dimensional. Convolutional neural network (CNN): Locally connected neural networks Weight sharing: w 1 = w 4 = w 7, w 2 = w 5 = w 8, w 3 = w 6 = w 9.
Recurrent neural networks x 0, x 1..., x T are labels (e.g.: a sequence of words) h 0, h 1..., h T are the hidden states of the recurrent network. Three sets of parameters: input to hidden weights W, hidden to hidden weights U, hidden to output weights V Data generating process: f (x) = Vh T h t = σ (Uh t 1 + Wx t ), for t = T,..., 1 h 0 = σ (Wx 0 ) Objective: minimizing (y f (x)) 2
Word Embeddings
Word embeddings Weakness of bag-of-words model Word order information is lost: AlphaGo beats Lee / Lee beats AlphaGo -> [ AlphaGo, beats, Lee ] Semantic information is lost: cannot distinguish between the difference between stock and returns and the difference between stock and Africa. High dimensionality Word embeddings Map words into a low dimensional space (relative to vocabulary size). Neural networks Continuous bag-of-words (CBOW) Skip-gram
Continuous bag-of-words Idea: find word vector representations that are useful for predicting a certain word using surrounding words in a sentence or a document. Embedding vectors Word: v w R r for word w V W ; context: v c R r for context c V C. r is the embedding dimensionality, hyper parameter. The probability for word w to appear in context c = (w t l,...,w t 1,w t+1,...,w t+l ), v c = 1 2l l i=1 ( vwt i + v wt+i ). p ( w ) c = σ (vw v c ) 1 + exp( v w v c ) 1
CBOW Objective Negative sampling: choose v w (v c is a deterministic function of v w ) to maximize logσ (v w v c ) + k E wn P(w)σ ( v wn v c ) k: hyper parameter controlling penalization on (w,c) pairs not appearing in the corpus Negative words w N are drawn according to empirical distribution P (w) = #(w) D, where D is the set of (w,c) pairs. Global objective L = #(w,c) [ logσ (v w v c ) + ke wn P(w)σ ( )] v wn v c w VW c VC
CBOW-Doc Architecture Idea: not only each individual word, but also each document is represented as by a dense vector which is trained to predict words in the document. provides a direct way to embed a document into a vector space. Document embedding vector v d R r is directly learned from the neural network model. The probability for word w to appear in context c and document d p ( w c,d ) = σ (vw (αv c + (1 α)v d )) α [0, 1] is the weight assigned to the context vector v c.
Skip-gram Idea: find word vector representations that are useful for predicting surrounding words given a contain word in a sentence or a document. The probability for context c = (w t l,...,w t 1,w t+1,...,w t+l ) to appear around word w p ( c ) w = σ (vc v w ) 1 + exp( v c v w ) 1
Skip-gram Objective Negative sampling: choose v w to maximize logσ (v c v w ) + k E cn P(c)σ ( v cn v w ) k: hyper parameter controlling penalization on (w,c) pairs not appearing in the corpus Negative contexts c N are drawn according to empirical distribution P (c) = #(c) D, where D is the set of (c,w) pairs. Global objective L = #(w,c) [ logσ (v w v c ) + k E cn P(c)σ ( )] v cn v w w VW c VC
Matrix Factorization Word-context matrix is a V W V C matrix M Each row corresponds to a word w V w. Each column corresponds to a context c V c. Each element M wc measures the association between a word and context. Word embedding: factorizing M into a V W r word embedding matrix W and a V C r context embedding matrix C. ( ) CBOW/Skip-gram: M wc = #(w,c) D log #(w) #(c) log k, called shifted point-wise mutual information The stochastic gradient based training method is similar to symmetric SVD: W SVD 1/2 = U r Σ r, where M = UΣV T.
Cosine similarity Word similarities Similarity ( w,w ) = v w v w v w v w Example (letters to shareholders from N-CSR files) china oil politics shareholder 1 chinese commodity terrorism shareholders 2 indonesia energy rhetoric stockholders 3 brazil gasoline political stockholder 4 russia cotton standoff shareowner 5 japan fuel presidential trustees 6 asia gold partisan shareowners 7 turkey brent debate classify 8 states natural threats directors 9 population food uncertainties mergers 10 india ore attacks semiannual
Word Clouds by Sentiments Reduce 300-dimension word vectors to 2-dimension vectors using t-distributed stochastic neighbor embedding (t-sne). Top 30 words similar to good and bad.
Word Clouds by Topics Top 30 words similar to region, politics, investment, macro, index, commodity, shareholder, industry.
Downstream predictions For downstream predictions, we may need document level feature vectors y = β 0 + β x X + β d v d where v d is document embedding vector. Generate document level features Direct learning: a document vector is directly learned from the neural network Taking average: a document vector is the average of word vectors. Denote the word vectors as vw R d, where d is the dimensionality of the word embedding space. A document vector vd = 1 #(w) d w d vw. Clustering: K-means, Spectral Clustering etc. Cluster words using embedding vectors, and represent documents using bag-of-clusters