STA141C: Big Data & High Performance Statistical Computing

Size: px

Start display at page:

Download "STA141C: Big Data & High Performance Statistical Computing"

Gyles George
5 years ago
Views:

1 STA141C: Big Data & High Performance Statistical Computing Lecture 9: Dimension Reduction/Word2vec Cho-Jui Hsieh UC Davis May 15, 2018

2 Principal Component Analysis

d-dimensional vector x, where x i is number of occurrences

3 Principal Component Analysis (PCA) Data matrix can be big. Example: bag-of-word model Each document is represented by a d-dimensional vector x, where x i is number of occurrences of word i. number of features = number of potential words 10,000

4 Feature generation for documents Bag of n-gram features (n = 2): 10,000 words 10, potential features

5 Data Matrix (document) Use the bag-of-word matrix or the normalized version (TF-IDF) for a dataset (denoted by D): tfidf(doc, word, D) = tf (doc, word) idf (word, D) tf(doc, word): term frequency (word count in the document)/(total number of terms in the document) idf(word, Dataset): inverse document frequency log((number of documents)/(number of documents with this word))

6 PCA: Motivation Data can have huge dimensionality: Reuters text collection (rcv1): 677,399 documents, 47,236 features (words) Pubmed abstract collection: 8,200,000 documents, 141,043 features (words) Can we find a low-dimensional representation for each document? Enable many learning algorithms to run efficiently Sometimes achieve better prediction performance (de-noising) Visualize the data

7 PCA: Motivation Orthogonal projection of data onto lower-dimensional linear space that: Maximize variance of projected data (preserve as much information as possible) Minimize reconstruction error

8 PCA: Formulation Given the data x 1,, x n R d, compute the principal vector w by: 1 w = arg max w =1 n where x = i x i/n is the mean. n (w T x i w T x) 2 i=1 First, shift data so that ˆx i = x i x, so 1 w = arg max w =1 n n i=1 (w T ˆx i ) 2 1 = arg max w =1 n w T ˆX ˆX T w where each column of ˆX is ˆx i The first principal component w is the leading eigenvector of ˆX ˆX T (eigenvector corresponding to the largest eigenvalue)

9 PCA: Formulation 2nd principal component w 2 : Perpendicular to w 1 Again, largest variance Eigenvector corresponding to the second eigenvalue Top k principal components: w 1,, w k Top k eigenvectors The k-dimensional subspace with largest variance W = arg max W R d k,w T W =I { k r=1 1 n w T r ˆX ˆX T w r }

10 PCA: illustration

11 PCA: Computation PCA: top-k eigenvectors of ˆX ˆX T Assume ˆX = UΣV T, then principal components are U k (top-k singular vectors of ˆX ) Projection of ˆX to U k : U T k ˆX = Σ k V T k (k by n matrix) Each column is the k-dimensional features for a example PCA can be computed in two ways: Top-k SVD of ˆX Top-k SVD of ˆX ˆX T (explicitly form the matrix only when d is small) Need large scale SVD solver for dense or sparse matrices.

12 Word2vec: Learning Word Representations

13 Word2vec: Motivation Goal: understand the meaning of a word Given a large text corpus, how to learn low-dimensional features to represent a word? Skip-gram model: For each word w i, define the contexts of the word as the words surrounding it in an L-sized window: w i L 2, w i L 1, w i L,, w }{{ i 1, w } i, w i+1,, w i+l, w }{{} i+l+1, contexts of w i contexts of w i Get a collection of (word, context) pairs, denoted by D.

14 Skip-gram model (Figure from word2vec-tutorial-the-skip-gram-model/)

15 Use bag-of-word model Idea 1: Use the bag-of-word model to describe each word Assume we have context words c 1,, c d in the corpus, compute #(w, c i ) := number of times the pair (w, c i ) appears in D For each word w, form a d-dimensional (sparse) vector to describe w #(w, c 1 ),, #(w, c d ),

16 PMI/PPMI Representation Similar to TF-IDF: Need to consider the frequency for each word and each context Instead of using co-ocurrent count #(w, c), we can define pointwise mutual information: ˆP(w, c) #(w, c) D PMI(w, c) = log( ) = log ˆP(w) ˆP(c) #(w)#(c), #(w) = c #(w, c): number of times word w occurred in D #(c) = w #(w, c): number of times context c occurred in D D : number of pairs in D Positive PMI (PPMI) usually achieves better performance: PPMI(w, c) = max(pmi(w, c), 0) M PPMI : a n by d word feature matrix, each row is a word and each column is a context

17 PPMI Matrix

18 Low-dimensional embedding (Word2vec) Advantages to extracting low-dimensional dense representations: Improve computational efficiency for end applications Better visualization Better performance (?) Perform PCA/SVD on the sparse feature matrix: M PPMI U k Σ k V T k Then W SVD = U k Σ k is the context representation of each word (Each row is a k-dimensional feature for a word) This is one of the word2vec algorithm.

19 Generalized Low-rank Embedding SVD basis will minimize min W,V MPPMI WV T 2 F Extensions (Glove, Google W2V,... ): Use different loss function (instead of F ) Negative sampling (less weights to 0s in M PPMI ) Adding bias term: M PPMI WV T + b w e T + eb T c Details and comparisons: Improving Distributional Similarity with Lessons Learned from Word Embeddings, Levy et al., ACL Glove: Global Vectors for Word Representation, Pennington et al., EMNLP 2014.

20 Results The low-dimensional embeddings are (often) meaningful: (Figure from

21 Coming up Tree-based algorithms Questions?

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal