STA141C: Big Data & High Performance Statistical Computing

Similar documents
STA141C: Big Data & High Performance Statistical Computing

Link Analysis Ranking

Online Social Networks and Media. Link Analysis and Web Search

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Deep Learning. Ali Ghodsi. University of Waterloo

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Link Analysis. Leonid E. Zhukov

Faloutsos, Tong ICDE, 2009

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Data Mining and Matrices

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

1 Searching the World Wide Web

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Data Mining Recitation Notes Week 3

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Embeddings Learned By Matrix Factorization

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

0.1 Naive formulation of PageRank

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

DISTRIBUTIONAL SEMANTICS

Neural Word Embeddings from Scratch

1998: enter Link Analysis

Google Page Rank Project Linear Algebra Summer 2012

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Inf 2B: Ranking Queries on the WWW

Slides based on those in:

Link Mining PageRank. From Stanford C246

Math 304 Handout: Linear algebra, graphs, and networks.

How does Google rank webpages?

A Note on Google s PageRank

Node and Link Analysis

Lecture 12: Link Analysis for Web Retrieval

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Latent Semantic Analysis. Hongning Wang

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

IR: Information Retrieval

ECEN 689 Special Topics in Data Science for Communications Networks

Link Analysis. Stony Brook University CSE545, Fall 2016

Machine Learning CPSC 340. Tutorial 12

PROBABILISTIC LATENT SEMANTIC ANALYSIS

1 Singular Value Decomposition and Principal Component

Probabilistic Latent Semantic Analysis

Principal Component Analysis

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Dimensionality Reduction and Principle Components Analysis

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Page rank computation HPC course project a.y

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Computing PageRank using Power Extrapolation

CS6220: DATA MINING TECHNIQUES

AM205: Assignment 2. i=1

PCA, Kernel PCA, ICA

Machine Learning - MT & 14. PCA and MDS

Introduction to Machine Learning

Graph Models The PageRank Algorithm

Principal Component Analysis

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lecture 7 Mathematics behind Internet Search

Lecture 7: Word Embeddings

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Calculating Web Page Authority Using the PageRank Algorithm

Introduction to Data Mining

Lecture 5: Web Searching using the SVD

Eigenvalue Problems Computation and Applications

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

9 Searching the Internet with the SVD

Principal Component Analysis (PCA)

13 Searching the Web with the SVD

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota


Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Latent Semantic Analysis. Hongning Wang

Uncertainty and Randomization

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

CS6220: DATA MINING TECHNIQUES

CPSC 540: Machine Learning

On the mathematical background of Google PageRank algorithm

Seman&cs with Dense Vectors. Dorota Glowacka

Principal components analysis COMS 4771

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Lecture: Local Spectral Methods (1 of 4)

Learning representations

Transcription:

STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017

Principal Component Analysis

Principal Component Analysis (PCA) Data matrix can be big. Example: bag-of-word model Each document is represented by a d-dimensional vector x, where x i is number of occurrences of word i. number of features = number of potential words 10,000

Feature generation for documents Bag of n-gram features (n = 2): 10,000 words 10, 000 2 potential features

Data Matrix (document) Use the bag-of-word matrix or the normalized version (TF-IDF) for a dataset (denoted by D): tfidf(doc, word, D) = tf (doc, word) idf (word, D) tf(doc, word): term frequency (word count in the document)/(total number of terms in the document) idf(word, Dataset): inverse document frequency log((number of documents)/(number of documents with this word))

PCA: Motivation Data can have huge dimensionality: Reuters text collection (rcv1): 677,399 documents, 47,236 features (words) Pubmed abstract collection: 8,200,000 documents, 141,043 features (words) Can we find a low-dimensional representation for each document? Enable many learning algorithms to run efficiently Sometimes achieve better prediction performance (de-noising) Visualize the data

PCA: Motivation Orthogonal projection of data onto lower-dimensional linear space that: Maximize variance of projected data (preserve as much information as possible) Minimize reconstruction error

PCA: Formulation Given the data x 1,, x n R d, compute the principal vector w by: 1 w = arg max w =1 n where x = i x i/n is the mean. n (w T x i w T x) 2 i=1 First, shift data so that ˆx i = x i x, so 1 w = arg max w =1 n n i=1 (w T ˆx i ) 2 1 = arg max w =1 n w T ˆX ˆX T w where each column of ˆX is ˆx i The first principal component w is the leading eigenvector of ˆX ˆX T (eigenvector corresponding to the largest eigenvalue)

PCA: Formulation 2nd principal component w 2 : Perpendicular to w 1 Again, largest variance Eigenvector corresponding to the second eigenvalue Top k principal components: w 1,, w k Top k eigenvectors The k-dimensional subspace with largest variance W = arg max W R d k,w T W =I { k r=1 1 n w T r ˆX ˆX T w r }

PCA: illustration

PCA: Computation PCA: top-k eigenvectors of ˆX ˆX T Assume ˆX = UΣV T, then principal components are U k (top-k singular vectors of ˆX ) Projection of ˆX to U k : U T k ˆX = Σ k V T k (k by n matrix) Each column is the k-dimensional features for a example PCA can be computed in two ways: Top-k SVD of ˆX Top-k SVD of ˆX ˆX T (explicitly form the matrix only when d is small) Need large scale SVD solver for dense or sparse matrices.

Word2vec: Learning Word Representations

Word2vec: Motivation Goal: understand the meaning of a word Given a large text corpus, how to learn low-dimensional features to represent a word? Skip-gram model: For each word w i, define the contexts of the word as the words surrounding it in an L-sized window: w i L 2, w i L 1, w i L,, w }{{ i 1, w } i, w i+1,, w i+l, w }{{} i+l+1, contexts of w i contexts of w i Get a collection of (word, context) pairs, denoted by D.

Skip-gram model (Figure from http://mccormickml.com/2016/04/19/ word2vec-tutorial-the-skip-gram-model/)

Use bag-of-word model Idea 1: Use the bag-of-word model to describe each word Assume we have context words c 1,, c d in the corpus, compute #(w, c i ) := number of times the pair (w, c i ) appears in D For each word w, form a d-dimensional (sparse) vector to describe w #(w, c 1 ),, #(w, c d ),

PMI/PPMI Representation Similar to TF-IDF: Need to consider the frequency for each word and each context Instead of using co-ocurrent count #(w, c), we can define pointwise mutual information: ˆP(w, c) #(w, c) D PMI(w, c) = log( ) = log ˆP(w) ˆP(c) #(w)#(c), #(w) = c #(w, c): number of times word w occurred in D #(c) = w #(w, c): number of times context c occurred in D D : number of pairs in D Positive PMI (PPMI) usually achieves better performance: PPMI(w, c) = max(pmi(w, c), 0) M PPMI : a n by d word feature matrix, each row is a word and each column is a context

PPMI Matrix

Low-dimensional embedding (Word2vec) Advantages to extracting low-dimensional dense representations: Improve computational efficiency for end applications Better visualization Better performance (?) Perform PCA/SVD on the sparse feature matrix: M PPMI U k Σ k V T k Then W SVD = U k Σ k is the context representation of each word (Each row is a k-dimensional feature for a word) This is one of the word2vec algorithm.

Generalized Low-rank Embedding SVD basis will minimize min W,V MPPMI WV T 2 F Extensions (Glove, Google W2V,... ): Use different loss function (instead of F ) Negative sampling (less weights to 0s in M PPMI ) Adding bias term: M PPMI WV T + b w e T + eb T c Details and comparisons: Improving Distributional Similarity with Lessons Learned from Word Embeddings, Levy et al., ACL 2015. Glove: Global Vectors for Word Representation, Pennington et al., EMNLP 2014.

Results The low-dimensional embeddings are (often) meaningful: (Figure from https://www.tensorflow.org/tutorials/word2vec)

PageRank/Hubs and Authorities

Ranking Websites Text based ranking systems (a dominated approach in the early 90s) Compute the similarity between query and websites (documents) Keywords are a very limited way to express a complex information Need to rank websites by popularity, authority,... PageRank: Developed by Brin and Page (1999) Determine the authority and popularity by hyperlinks

PageRank Main idea: estimate the ranking of websites by the link structure.

Topology of Websites Transform the hyperlinks to a directed graph: The adjacency matrix A such that A ij = 1 if page j points to page i

Transition Matrix Normalize the adjacency matrix so that the matrix is a stochastic matrix (each column sum up to 1) P ij : probability that arriving at page i from page j P: a stochastic matrix or a transition matrix

Random walk: step 1 Random walk through the transition matrix Start from [1, 0, 0, 0] (can use any initialization)

Random walk: step 1 Random walk through the transition matrix x t+1 = Px t

Random walk: step 2 Random walk through the transition matrix

Random walk: step 2 Random walk through the transition matrix x t+2 = Px t+1

PageRank (convergence) PageRank Algorithm Start from an initial vector x with i x i = 1 (the initial distribution) For t = 1, 2,... x t+1 = Px t Each x t is a probability distribution (sums up to 1)

PageRank (convergence) PageRank Algorithm Start from an initial vector x with i x i = 1 (the initial distribution) For t = 1, 2,... x t+1 = Px t Each x t is a probability distribution (sums up to 1) Will converge to a stationary distribution π such that π = Pπ if P satisfies the following two conditions: 1 P is irreducible: for all i, j, there exists some t such that (P t ) i,j > 0 2 P is aperiodic: for all i, j, we have gcd{t : (P t ) i,j > 0} = 1

PageRank (convergence) PageRank Algorithm Start from an initial vector x with i x i = 1 (the initial distribution) For t = 1, 2,... x t+1 = Px t Each x t is a probability distribution (sums up to 1) Will converge to a stationary distribution π such that π = Pπ if P satisfies the following two conditions: 1 P is irreducible: for all i, j, there exists some t such that (P t ) i,j > 0 2 P is aperiodic: for all i, j, we have gcd{t : (P t ) i,j > 0} = 1 π is the unique right eigenvector of P with eigenvalue 1. π is not the right singular vector because P is not symmetric.

PageRank How to guarantee convergence? Add the possibility of jumping to a random node with small probability α, we get the commonly used PageRank π = ( αp + (1 α)ve T ) π v = 1 n e = [ 1 1 1 n n... n ]T is commonly used Personalized PageRank: v = e i

PageRank: Summary Input: Transition matrix P and personalization vector v 1 Initial x (0) i = 1 n for i = 1, 2,..., n 2 for t = 1, 2,... do x (t+1) αpx (t) + (1 α)v 3 end for

Hubs and Authorities Proposed by (Kleinberg, 1999) Also, identify important websites on the Internet Main idea: two types of scores authority : a web page with authoritative content it is pointed to by many hub pages hub : a web page pointing to many authoritative web pages.

Hubs and Authorities Proposed by (Kleinberg, 1999) Also, identify important websites on the Internet Main idea: two types of scores authority : a web page with authoritative content it is pointed to by many hub pages hub : a web page pointing to many authoritative web pages. Let h R n be the hubs score and a R n be the authority score for n web pages. Initialize h = [1, 1,, 1], a = [1, 1,, 1] M R n n is the network graph: M ij = 1 means page i links to page j M ij = 0 otherwise.

Hubs and Authorities Authority of a page: Counting how many in-links to each page. n a i = M ji h j a = M T h j=1

Hubs and Authorities A hub page should link to many pages with high authority: Page s hub value is the sum of the authority scores of all the pages it links to n h i = M ij a j h = Ma j=1

Hubs and Authority Re-compute authority: each page s new authority score is equal to the sum of the hub scores that point to it. a = M T h

Hubs and Authorities Normalize a, h after each iteration. After infinite number of iterations: a (M T M) M T 1 h (MM T ) 1 Therefore, Authority score a is the leading eigenvector of M T M the leading right singularvector of M Hub score h is the leading eigenvector of MM T the leading left singularvector of M

Coming up Linear Systems, Regression Questions?