Machine Learning and Data Mining. Dimensionality Reduction; PCA & SVD. Prof. Alexander Ihler

Similar documents
Machine Learning and Data Mining. Dimensionality Reduction; PCA & SVD. Prof. Alexander Ihler

Machine Learning and Data Mining. Dimensionality Reduction; PCA & SVD. Kalev Kask

CS425: Algorithms for Web Scale Data

Matrix Factorization and Collaborative Filtering

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Dimensionality Reduction and Principle Components Analysis

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Learning representations

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Dimensionality Reduction

PCA, Kernel PCA, ICA

Principal Component Analysis

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012


Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Lecture Notes 10: Matrix Factorization

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Approximate SDP solvers, Matrix Factorizations, the Netflix Prize, and PageRank. Mittagseminar Martin Jaggi, Oct

CSC 411 Lecture 12: Principal Component Analysis

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Structured matrix factorizations. Example: Eigenfaces

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Foundations of Computer Vision

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Dimensionality reduction

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Lecture: Face Recognition and Feature Reduction

CS281 Section 4: Factor Analysis and PCA

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Introduction to Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

7 Principal Component Analysis

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Lecture: Face Recognition and Feature Reduction

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Data Mining and Matrices

Vector Space Models. wine_spectral.r

CS 572: Information Retrieval

Principal Component Analysis (PCA)

Expectation Maximization

Principal Component Analysis

Notes on Latent Semantic Analysis

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Generative Clustering, Topic Modeling, & Bayesian Inference

Data Mining Techniques

Information Retrieval

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Collaborative Filtering Matrix Completion Alternating Least Squares

Using SVD to Recommend Movies

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Problems. Looks for literal term matches. Problems:

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear Algebra Background

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

Point Distribution Models

Seman&cs with Dense Vectors. Dorota Glowacka

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Singular Value Decompsition

Machine Learning Techniques

14 Singular Value Decomposition

Machine Learning - MT & 14. PCA and MDS

Singular Value Decomposition and Principal Component Analysis (PCA) I

CS 4495 Computer Vision Principle Component Analysis

Probabilistic Latent Semantic Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Matrix and Tensor Factorization from a Machine Learning Perspective

15 Singular Value Decomposition

Principal components analysis COMS 4771

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

STA141C: Big Data & High Performance Statistical Computing

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Statistical Machine Learning

Principal Component Analysis

Linear Methods for Regression. Lijun Zhang

EE16B Designing Information Devices and Systems II

Principal Component Analysis CS498

Embeddings Learned By Matrix Factorization

CSE 446 Dimensionality Reduction, Sequences

Probabilistic Graphical Models

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Linear Algebra Methods for Data Mining

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

PCA and admixture models

Numerical Methods I Singular Value Decomposition

STA141C: Big Data & High Performance Statistical Computing

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Weighted Low Rank Approximations

Transcription:

+ Machine Learning and Data Mining Dimensionality Reduction; PCA & SVD Prof. Alexander Ihler

Mo#va#on High-dimensional data Images of faces Text from articles All S&P 500 stocks Can we describe them in a simpler way? Embedding: place data in R d, such that similar data are close Ex: embedding images in D Ex: embedding movies in D The Color Purple serious Amadeus Braveheart Chick flicks? Sense and Sensibility Ocean s 11 Lethal Weapon The Princess Diaries The Lion King escapist Independence Day Dumb and Dumber

Mo#va#on High-dimensional data Images of faces Text from articles All S&P 500 stocks Can we describe them in a simpler way? Embedding: place data in R d, such that similar data are close Ex: S&P 500 vector of 500 (change in) values per day But, lots of structure Some elements tend to change together Maybe we only need a few values to approximate it? Tech stocks up x, manufacturing up 1.5x,? How can we access that structure?

Dimensionality reduc#on Ex: data with two real values [x 1,x ] We d like to describe each point using only one value [z 1 ] We ll communicate a model to convert: [x 1,x ] ~ f(z 1 ) Ex: linear function f(z): [x 1,x ] = µ + z * v = µ + z * [v 1,v ] µ, v are the same for all data points (communicate once) z tells us the closest point on v to the original point [x 1,x ] 1000 1000 950 950 900 900 850 800 v 850 800 x! 750 700 x (i) x! 750 700 z (i) * v + µ 650 650 600 x 1! 550 550 600 650 700 750 800 850 900 950 1000 600 x 1! 550 550 600 650 700 750 800 850 900 950 1000

Principal Components Analysis How should we find v? Assume X is zero mean, or Find v as the direction of maximum spread (variance) Solution is the eigenvector with largest eigenvalue Project X to v: 1000 950 Variance of projected points: 900 850 800 750 Best direction v: 700 650 600 ) largest eigenvector of X T X 550 550 600 650 700 750 800 850 900 950 1000

Principal Components Analysis How should we find v? Assume X is zero mean, or Find v as the direction of maximum spread (variance) Solution is the eigenvector with largest eigenvalue Equivalent: v also leaves the smallest residual variance! ( error ) 1000 950 900 Project X to v: Variance of projected points: 850 800 750 Best direction v: 700 650 600 550 550 600 650 700 750 800 850 900 950 1000 ) largest eigenvector of X T X

Another interpreta#on Data covariance: Describes spread of the data Draw this with an ellipse Gaussian is 5 1 0-1 - - -1 0 1 5 Ellipse shows the contour, = constant

Geometry of the Gaussian Oval shows constant value Write Σ in terms of eigenvectors Then

PCA representa#on Subtract data mean from each point (Typically) scale each dimension by its variance Helps pay less attention to magnitude of the variable Compute covariance matrix, S = 1/m (x i -µ) (x i -µ) Compute the k largest eigenvectors of S S = V D V^T mu = np.mean( X, axis=0, keepdims=true ) # find mean over data points X0 = X - mu # zero-center the data S = X0.T.dot( X0 ) / m # S = np.cov( X.T ), data covariance D,V = np.linalg.eig( S ) # find eigenvalues/vectors: can be slow! pi = np.argsort(d)[::-1] # sort eigenvalues largest to smallest D,V = D[pi], V[:,pi] # D,V = D[0:k], V[:,0:k] # and keep the k largest

Singular Value Decomposi#on Alternative method to calculate (still subtract mean 1 st ) Decompose X = U S V T Orthogonal: X T X = V S S V T = V D V T X X T = U S S U T = U D U T U*S matrix provides coefficients Example x i = U i,1 S 11 v 1 + U i, S v + Gives the least-squares approximation to X of this form X m x n ¼ U m x k S k x k V T k x n

SVD for PCA Subtract data mean from each point (Typically) scale each dimension by its variance Helps pay less attention to magnitude of the variable Compute the SVD of the data matrix mu = np.mean( X, axis=0, keepdims=true ) # find mean over data points X0 = X - mu # zero-center the data U,S,Vh = scipy.linalg.svd(x0, False) # X0 = U * diag(s) * Vh Xhat = U[:,0:k].dot( np.diag(s[0:k]) ).dot( Vh[0:k,:] ) # approx using k largest eigendir

Some uses of latent spaces Data compression Cheaper, low-dimensional representa#on Noise removal Simple true data + noise Supervised learning, e.g. regression: Remove colinear / nearly colinear features Reduce feature dimension => combat overfiong

Applica#ons of SVD Eigen-faces Represent image data (faces) using PCA LSI / topic models Represent text data (bag of words) using PCA Collabora#ve filtering Represent ra#ng data matrix using PCA and more

Eigen-faces Eigen-X = represent X using PCA Ex: Viola Jones data set x images of faces = 576 dimensional measurements X m x n

Eigen-faces Eigen-X = represent X using PCA Ex: Viola Jones data set x images of faces = 576 dimensional measurements Take first K PCA components + X m x n ¼ U m x k S k x k V T k x n Mean V[0,:] V[1,:] V[,:] V[,:]

Eigen-faces Eigen-X = represent X using PCA Ex: Viola Jones data set x images of faces = 576 dimensional measurements Take first K PCA components Mean Dir 1 Dir Dir Dir Projecting data onto first k dimensions Xi k=5 k=10 k=50.

Eigen-faces Eigen-X = represent X using PCA Ex: Viola Jones data set x images of faces = 576 dimensional measurements Take first K PCA components Projecting data onto first k dimensions Dir Dir 1

Text representa#ons Bag of words Remember word counts but not order Example: Rain and chilly weather didn't keep thousands of paradegoers from camping out Friday night for the 111th Tournament of Roses. Spirits were high among the street party crowd as they set up for curbside seats for today's parade. ``I want to party all night,'' said Tyne Gaudielle, 15, of Glendale, who spent the last night of the year along Colorado Boulevard with a group of friends. Whether they came for the partying or the parade, campers were in for a long night. Rain continued into the evening and temperatures were expected to dip down into the low 0s.

Text representa#ons Bag of words Remember word counts but not order Example: ### nyt/000-01-01.0015.txt rain chilly Rain and chilly weather didn't keep thousands of paradegoers from camping out Friday night for the weather 111th Tournament of Roses. didn keep Spirits were high among the street party crowd as thousands they set up for curbside seats for today's parade. paradegoers camping ``I want to party all night,'' said Tyne Gaudielle, 15, out of Glendale, who spent the last night of the year along friday Colorado Boulevard with a group of friends. night 111th Whether they came for the partying or the parade, tournament campers were in for a long night. Rain continued into the evening roses and temperatures were expected to dip down into the low spirits 0s. high among street

Text representa#ons Bag of words Remember word counts but not order Example: VOCABULARY: 0001 ability 000 able 000 accept 000 accepted 0005 according 0006 account 0007 accounts 0008 accused 0009 act 0010 acting 0011 action 001 active. Observed Data (text docs): DOC # WORD # COUNT 1 9 1 1 56 1 1 17 1 1 166 1 1 176 1 1 187 1 1 19 1 1 198 1 56 1 1 7 1 1 81

Latent Seman#c Indexing (LSI) PCA for text data Create a giant matrix of words in docs Word j appears = feature x_j in document i = data example I Doc i? Word j Huge matrix (mostly zeros) Typically normalize rows to sum to one, to control for short docs Typically don t subtract mean or normalize columns by variance Might transform counts in some way (log, etc) PCA on this matrix provides a new representation Document comparison Fuzzy search ( concept instead of word matching)

Matrices are big, but data are sparse Typical example: Number of docs, D ~ 10 6 Number of unique words in vocab, W ~ 10 5 FULL Storage required ~ 10 11 Sparse Storage required ~ 10 9 DxW matrix (# docs x # words) Looks dense, but that s just plotting Each entry is non-negative Typically integer / count data

Latent Seman#c Indexing (LSI) What do the principal components look like? PRINCIPAL COMPONENT 1 0.15 genetic 0.1 gene 0.11 snp 0.19 disease 0.16 genome_wide 0.117 cell 0.110 variant 0.109 risk 0.098 population 0.097 analysis 0.09 expression 0.09 gene_expression 0.09 gwas 0.089 control 0.088 human 0.086 cancer

Latent Seman#c Indexing (LSI) What do the principal components look like? PRINCIPAL COMPONENT 1 0.15 genetic 0.1 gene 0.11 snp 0.19 disease 0.16 genome_wide 0.117 cell 0.110 variant 0.109 risk 0.098 population 0.097 analysis 0.09 expression 0.09 gene_expression 0.09 gwas 0.089 control 0.088 human 0.086 cancer PRINCIPAL COMPONENT 0.7 snp -0.196 cell 0.187 variant 0.181 risk 0.180 gwas 0.16 population 0.16 genome_wide 0.155 genetic 0.10 loci -0.116 mir -0.116 expression 0.11 allele 0.108 schizophrenia 0.107 disease -0.10 mirnas -0.099 protein Q: But what does -0.196 cell mean?

1 11 10 9 8 7 6 5 1 5 5? 1 1 1 5 5 1 5 5 5 1 6 users movies From Y. Koren of BellKor team X ¼ N x D U N x K V T K x D S K x K Collabora#ve filtering (NeWlix)

Latent space models From Y. Koren of BellKor team Model ratings matrix as user and movie positions Infer values from known ratings items 1 1 5 1 users 5 5 5 1 5 5 ~ Extrapolate to unranked users.1 -.. 1.1 -...5 - -.5.8 -.. 1.. -.9 ~ items -.5 -..6..5.5 -.8.1.7 -..5.6 1. 1.7.. -1.9 1. -..9. -.7.8 1..7 -.1 -.6 1..1 1.1.1. -.7.1 - -1.7.

Latent space models From Y. Koren of BellKor team The Color Purple serious Amadeus Braveheart Chick flicks? Sense and Sensibility Ocean s 11 Lethal Weapon The Princess Diaries The Lion King escapist Independence Day Dumb and Dumber

Some SVD dimensions Dimension 1 Offbeat / Dark-Comedy Mass-Market / 'Beniffer' Movies Lost in Translation Pearl Harbor The Royal Tenenbaums Armageddon Dogville The Wedding Planner Eternal Sunshine of the Spotless Mind Coyote Ugly Punch-Drunk Love Miss Congeniality See timelydevelopment.com Dimension Good Twisted VeggieTales: Bible Heroes: Lions The Saddest Music in the World The Best of Friends: Season Wake Up Felicity: Season I Heart Huckabees Friends: Season Freddy Got Fingered Friends: Season 5 House of 1 Dimension What a 10 year old boy would watch What a liberal woman would watch Dragon Ball Z: Vol. 17: Super Saiyan Fahrenheit 9/11 Battle Athletes Victory: Vol. : Spaceward Ho! The Hours Battle Athletes Victory: Vol. 5: No Looking Back Going Upriver: The Long War of John Kerry Battle Athletes Victory: Vol. 7: The Last Dance Sex and the City: Season Battle Athletes Victory: Vol. : Doubt and Conflic Bowling for Columbine

Latent space models Latent representation encodes some meaning What kind of movie is this? What movies is it similar to? Matrix is full of missing data Hard to take SVD directly Typically solve using gradient descent Easy algorithm (see Netflix challenge forum) # for user u, movie m, find the kth eigenvector & coefficient by iterating: predict_um = U[m,:].dot( V[:,u] ) # predict: vector-vector product err = ( rating[u,m] predict_um ) # find error residual V_ku, U_mk = V[k,u], U[m,k] # make copies for update U[m,k] += alpha * err * V_ku # Update our matrices V[k,u] += alpha * err * U_mk # (compare to least-squares gradient)

Nonlinear latent spaces Latent space Any alterna#ve representa#on (usually smaller) from which we can (approximately) recover the data Linear: Encode Z = X V T ; Decode X ¼ Z V Ex: Auto-encoders Use neural network with few internal nodes Train to recover the input x stats.stackexchange.com Related: wordvec Trains an NN to recover the context of words Use internal hidden node responses as a vector representa#on of the word

Summary Dimensionality reduction Representation: basis vectors & coefficients Linear decomposition PCA / eigendecomposition Singular value decomposition Examples and data sets Face images Text documents (latent semantic indexing) Movie ratings