Lecture 8: Principal Component Analysis; Kernel PCA

Size: px

Start display at page:

Download "Lecture 8: Principal Component Analysis; Kernel PCA"

Della Doyle
5 years ago
Views:

1 Lecture 8: Principal Component Analysis; Kernel PCA Lester Mackey April 23, 2014 Stats 306B: Unsupervised Learning

2 Sta306b April 19, 2011 Principal components: 16 PCA example: digit data 130 threes, a subset of 638 such threes and part of the handwritten digit dataset. Each three is a greyscale image, and the variables X j,j=1,...,256 are the greyscale values for each pixel.

3 PCA example: digit data First Principal Component Second Principal Component

4 PCA example: digit data Two-component model has the form ˆf(λ) = x + λ 1 v 1 + λ 2 v 2 = + λ 1 + λ 2. Here we have displayed the first two principal component directions, v 1 and v 2,asimages.

PCA in the wild: Eigen- faces Courtesy: Percy

pixels Each x i 2 R d is a face image x ji =

representation of i-th face than x i Can use z i

5 PCA in the wild: Eigen- faces Courtesy: Percy Liang Turk and Pentland, 1991 d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: (dk + nk) time instead of (dn) when n, d k

6 PCA in the wild: Latent semanc analysis Courtesy: Percy Liang Deerwester/Dumais/Harshman, 1990 d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n ) u stocks: 2 0 chairman: 4 1 the: wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2 Applications: information retrieval Note: no computational savings; original x is already sparse

7 PCA in the wild: Anomaly detecon Courtesy: Percy Liang Lakhina/Crovella/Diot, 04 x ji = amount of tra c on link j in the network during each time interval i Model assumption: total tra c is sum of flows along a few paths Apply PCA: each principal component intuitively represents a path Anomaly when tra c deviates from first few principal components

8 PCA in the wild: Part- of- speech tagging Courtesy: Percy Liang Schütze, 95 Part-of-speech (PS) tagging task: Input: I like reducing the dimensionality of data. utput: NUNVERBVERB(-ING) DET NUN PREP NUN. Each x i is (the context distribution of) a word. x ji is number of times word i appeared in context j Key idea: words appearing in similar contexts tend to have the same PS tags; so cluster using the contexts of each word type Problem: contexts are too sparse Solution: runpcafirst, then cluster using new representation

9 PCA in the wild: Mul- task learning Courtesy: Percy Liang Ando & Zhang 05 Have n related tasks (classify documents for various users) Each task has a linear classifier with weights x i Want to share structure between classifiers ne step of their procedure: given n linear classifiers x 1,...,x n, run PCA to identify shared structure: X = ( x 1... x n ) u UZ Each column principal of U component is an eigen- classifier is a eigen-classifier ther step of their procedure: Retrain classifiers, regularizing towards subspace U

10 Choosing a number of components As in the clustering se[ng, an important problem with no single soluon May be constrained by goals (visualizaon), resources, or minimum fracon of variance to be explained Note: Eigenvalue magnitudes determine explained variance e.g., Eigenvalues from face image dataset i i Rapid decay to zero è variance explained by a few components Could look for elbow or compare with reference distribuon

11 PCA limitaons and extensions Squared Euclidean reconstrucon error not appropriate for all data types Various extensions, like exponen6al family PCA, have been developed for binary, categorical, count, and nonnegave data (e.g., Collins/Dasgupta/Schapire, A Generalizaon of Principal Component Analysis to the Exponenal Family) PCA can only find linear compressions of data What if data best summarized in a non- linear fashion? Kernel PCA allows us to perform such non- linear dimensionality reducon PCA Kernel PCA Credit: Percy Liang

12 Blackboard discussion See lecture notes

Data Mining Techniques

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality