STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017
Principal Component Analysis
Principal Component Analysis (PCA) Data matrix can be big. Example: bag-of-word model Each document is represented by a d-dimensional vector x, where x i is number of occurrences of word i. number of features = number of potential words 10,000
Feature generation for documents Bag of n-gram features (n = 2): 10,000 words 10, 000 2 potential features
Data Matrix (document) Use the bag-of-word matrix or the normalized version (TF-IDF) for a dataset (denoted by D): tfidf(doc, word, D) = tf (doc, word) idf (word, D) tf(doc, word): term frequency (word count in the document)/(total number of terms in the document) idf(word, Dataset): inverse document frequency log((number of documents)/(number of documents with this word))
PCA: Motivation Data can have huge dimensionality: Reuters text collection (rcv1): 677,399 documents, 47,236 features (words) Pubmed abstract collection: 8,200,000 documents, 141,043 features (words) Can we find a low-dimensional representation for each document? Enable many learning algorithms to run efficiently Sometimes achieve better prediction performance (de-noising) Visualize the data
PCA: Motivation Orthogonal projection of data onto lower-dimensional linear space that: Maximize variance of projected data (preserve as much information as possible) Minimize reconstruction error
PCA: Formulation Given the data x 1,, x n R d, compute the principal vector w by: 1 w = arg max w =1 n where x = i x i/n is the mean. n (w T x i w T x) 2 i=1 First, shift data so that ˆx i = x i x, so 1 w = arg max w =1 n n i=1 (w T ˆx i ) 2 1 = arg max w =1 n w T ˆX ˆX T w where each column of ˆX is ˆx i The first principal component w is the leading eigenvector of ˆX ˆX T (eigenvector corresponding to the largest eigenvalue)
PCA: Formulation 2nd principal component w 2 : Perpendicular to w 1 Again, largest variance Eigenvector corresponding to the second eigenvalue Top k principal components: w 1,, w k Top k eigenvectors The k-dimensional subspace with largest variance W = arg max W R d k,w T W =I { k r=1 1 n w T r ˆX ˆX T w r }
PCA: illustration
PCA: Computation PCA: top-k eigenvectors of ˆX ˆX T Assume ˆX = UΣV T, then principal components are U k (top-k singular vectors of ˆX ) Projection of ˆX to U k : U T k ˆX = Σ k V T k (k by n matrix) Each column is the k-dimensional features for a example PCA can be computed in two ways: Top-k SVD of ˆX Top-k SVD of ˆX ˆX T (explicitly form the matrix only when d is small) Need large scale SVD solver for dense or sparse matrices.
Word2vec: Learning Word Representations
Word2vec: Motivation Goal: understand the meaning of a word Given a large text corpus, how to learn low-dimensional features to represent a word? Skip-gram model: For each word w i, define the contexts of the word as the words surrounding it in an L-sized window: w i L 2, w i L 1, w i L,, w }{{ i 1, w } i, w i+1,, w i+l, w }{{} i+l+1, contexts of w i contexts of w i Get a collection of (word, context) pairs, denoted by D.
Skip-gram model (Figure from http://mccormickml.com/2016/04/19/ word2vec-tutorial-the-skip-gram-model/)
Use bag-of-word model Idea 1: Use the bag-of-word model to describe each word Assume we have context words c 1,, c d in the corpus, compute #(w, c i ) := number of times the pair (w, c i ) appears in D For each word w, form a d-dimensional (sparse) vector to describe w #(w, c 1 ),, #(w, c d ),
PMI/PPMI Representation Similar to TF-IDF: Need to consider the frequency for each word and each context Instead of using co-ocurrent count #(w, c), we can define pointwise mutual information: ˆP(w, c) #(w, c) D PMI(w, c) = log( ) = log ˆP(w) ˆP(c) #(w)#(c), #(w) = c #(w, c): number of times word w occurred in D #(c) = w #(w, c): number of times context c occurred in D D : number of pairs in D Positive PMI (PPMI) usually achieves better performance: PPMI(w, c) = max(pmi(w, c), 0) M PPMI : a n by d word feature matrix, each row is a word and each column is a context
PPMI Matrix
Low-dimensional embedding (Word2vec) Advantages to extracting low-dimensional dense representations: Improve computational efficiency for end applications Better visualization Better performance (?) Perform PCA/SVD on the sparse feature matrix: M PPMI U k Σ k V T k Then W SVD = U k Σ k is the context representation of each word (Each row is a k-dimensional feature for a word) This is one of the word2vec algorithm.
Generalized Low-rank Embedding SVD basis will minimize min W,V MPPMI WV T 2 F Extensions (Glove, Google W2V,... ): Use different loss function (instead of F ) Negative sampling (less weights to 0s in M PPMI ) Adding bias term: M PPMI WV T + b w e T + eb T c Details and comparisons: Improving Distributional Similarity with Lessons Learned from Word Embeddings, Levy et al., ACL 2015. Glove: Global Vectors for Word Representation, Pennington et al., EMNLP 2014.
Results The low-dimensional embeddings are (often) meaningful: (Figure from https://www.tensorflow.org/tutorials/word2vec)
PageRank/Hubs and Authorities
Ranking Websites Text based ranking systems (a dominated approach in the early 90s) Compute the similarity between query and websites (documents) Keywords are a very limited way to express a complex information Need to rank websites by popularity, authority,... PageRank: Developed by Brin and Page (1999) Determine the authority and popularity by hyperlinks
PageRank Main idea: estimate the ranking of websites by the link structure.
Topology of Websites Transform the hyperlinks to a directed graph: The adjacency matrix A such that A ij = 1 if page j points to page i
Transition Matrix Normalize the adjacency matrix so that the matrix is a stochastic matrix (each column sum up to 1) P ij : probability that arriving at page i from page j P: a stochastic matrix or a transition matrix
Random walk: step 1 Random walk through the transition matrix Start from [1, 0, 0, 0] (can use any initialization)
Random walk: step 1 Random walk through the transition matrix x t+1 = Px t
Random walk: step 2 Random walk through the transition matrix
Random walk: step 2 Random walk through the transition matrix x t+2 = Px t+1
PageRank (convergence) PageRank Algorithm Start from an initial vector x with i x i = 1 (the initial distribution) For t = 1, 2,... x t+1 = Px t Each x t is a probability distribution (sums up to 1)
PageRank (convergence) PageRank Algorithm Start from an initial vector x with i x i = 1 (the initial distribution) For t = 1, 2,... x t+1 = Px t Each x t is a probability distribution (sums up to 1) Will converge to a stationary distribution π such that π = Pπ if P satisfies the following two conditions: 1 P is irreducible: for all i, j, there exists some t such that (P t ) i,j > 0 2 P is aperiodic: for all i, j, we have gcd{t : (P t ) i,j > 0} = 1
PageRank (convergence) PageRank Algorithm Start from an initial vector x with i x i = 1 (the initial distribution) For t = 1, 2,... x t+1 = Px t Each x t is a probability distribution (sums up to 1) Will converge to a stationary distribution π such that π = Pπ if P satisfies the following two conditions: 1 P is irreducible: for all i, j, there exists some t such that (P t ) i,j > 0 2 P is aperiodic: for all i, j, we have gcd{t : (P t ) i,j > 0} = 1 π is the unique right eigenvector of P with eigenvalue 1. π is not the right singular vector because P is not symmetric.
PageRank How to guarantee convergence? Add the possibility of jumping to a random node with small probability α, we get the commonly used PageRank π = ( αp + (1 α)ve T ) π v = 1 n e = [ 1 1 1 n n... n ]T is commonly used Personalized PageRank: v = e i
PageRank: Summary Input: Transition matrix P and personalization vector v 1 Initial x (0) i = 1 n for i = 1, 2,..., n 2 for t = 1, 2,... do x (t+1) αpx (t) + (1 α)v 3 end for
Hubs and Authorities Proposed by (Kleinberg, 1999) Also, identify important websites on the Internet Main idea: two types of scores authority : a web page with authoritative content it is pointed to by many hub pages hub : a web page pointing to many authoritative web pages.
Hubs and Authorities Proposed by (Kleinberg, 1999) Also, identify important websites on the Internet Main idea: two types of scores authority : a web page with authoritative content it is pointed to by many hub pages hub : a web page pointing to many authoritative web pages. Let h R n be the hubs score and a R n be the authority score for n web pages. Initialize h = [1, 1,, 1], a = [1, 1,, 1] M R n n is the network graph: M ij = 1 means page i links to page j M ij = 0 otherwise.
Hubs and Authorities Authority of a page: Counting how many in-links to each page. n a i = M ji h j a = M T h j=1
Hubs and Authorities A hub page should link to many pages with high authority: Page s hub value is the sum of the authority scores of all the pages it links to n h i = M ij a j h = Ma j=1
Hubs and Authority Re-compute authority: each page s new authority score is equal to the sum of the hub scores that point to it. a = M T h
Hubs and Authorities Normalize a, h after each iteration. After infinite number of iterations: a (M T M) M T 1 h (MM T ) 1 Therefore, Authority score a is the leading eigenvector of M T M the leading right singularvector of M Hub score h is the leading eigenvector of MM T the leading left singularvector of M
Coming up Linear Systems, Regression Questions?