CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Similar documents
CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

Math 291-2: Lecture Notes Northwestern University, Winter 2016

MAT 1302B Mathematical Methods II

Introduction to Machine Learning

Lecture 10: Powers of Matrices, Difference Equations

Singular Value Decomposition

CS281 Section 4: Factor Analysis and PCA

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Principal Component Analysis

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

15 Singular Value Decomposition

CS168: The Modern Algorithmic Toolbox Lecture #14: Markov Chain Monte Carlo

Math 290-2: Linear Algebra & Multivariable Calculus Northwestern University, Lecture Notes

Answers in blue. If you have questions or spot an error, let me know. 1. Find all matrices that commute with A =. 4 3

7 Principal Component Analysis

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

CS168: The Modern Algorithmic Toolbox Lecture #11: The Fourier Transform and Convolution

14 Singular Value Decomposition

4 Bias-Variance for Ridge Regression (24 points)

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

1 Adjacency matrix and eigenvalues

1 Extrapolation: A Hint of Things to Come

CS168: The Modern Algorithmic Toolbox Lecture #6: Markov Chain Monte Carlo

Chapter 3 Transformations

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Math 416, Spring 2010 Gram-Schmidt, the QR-factorization, Orthogonal Matrices March 4, 2010 GRAM-SCHMIDT, THE QR-FACTORIZATION, ORTHOGONAL MATRICES

Lecture: Face Recognition and Feature Reduction

Dot Products, Transposes, and Orthogonal Projections

Lecture 8 : Eigenvalues and Eigenvectors

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

CSC 411 Lecture 12: Principal Component Analysis

Principal Component Analysis (PCA)

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

MATH 31 - ADDITIONAL PRACTICE PROBLEMS FOR FINAL

Multivariate Statistical Analysis

Lecture: Face Recognition and Feature Reduction

Structured matrix factorizations. Example: Eigenfaces

6 EIGENVALUES AND EIGENVECTORS

STA 4273H: Sta-s-cal Machine Learning

Machine Learning - MT & 14. PCA and MDS

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Lecture 6: Lies, Inner Product Spaces, and Symmetric Matrices

1 Principal Components Analysis

The value of a problem is not so much coming up with the answer as in the ideas and attempted ideas it forces on the would be solver I.N.

Dimensionality Reduction and Principle Components

Math (P)Review Part I:

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

PCA, Kernel PCA, ICA

Numerical Analysis Lecture Notes

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Econ Slides from Lecture 7

Principal Component Analysis

MATH 310, REVIEW SHEET 2

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

UNIT 6: The singular value decomposition.

Principal Component Analysis

Singular Value Decomposition and Principal Component Analysis (PCA) I

UNDERSTANDING THE DIAGONALIZATION PROBLEM. Roy Skjelnes. 1.- Linear Maps 1.1. Linear maps. A map T : R n R m is a linear map if

Exercises * on Principal Component Analysis

MAT 211, Spring 2015, Introduction to Linear Algebra.

There are two things that are particularly nice about the first basis


CS 246 Review of Linear Algebra 01/17/19

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Lecture 2: Linear Algebra

Linear algebra and differential equations (Math 54): Lecture 13

MATH 304 Linear Algebra Lecture 34: Review for Test 2.

Detailed Assessment Report MATH Outcomes, with Any Associations and Related Measures, Targets, Findings, and Action Plans

MAT1302F Mathematical Methods II Lecture 19

18.06SC Final Exam Solutions

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Methods for sparse analysis of high-dimensional data, II

Linear Algebra and Robot Modeling

1 Reductions from Maximum Matchings to Perfect Matchings

Iterative Methods for Solving A x = b

Theorem 1 ( special version of MASCHKE s theorem ) Given L:V V, there is a basis B of V such that A A l

Computation. For QDA we need to calculate: Lets first consider the case that

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Recitation 8: Graphs and Adjacency Matrices

MATH 1120 (LINEAR ALGEBRA 1), FINAL EXAM FALL 2011 SOLUTIONS TO PRACTICE VERSION

1 9/5 Matrices, vectors, and their applications

Bare-bones outline of eigenvalue theory and the Jordan canonical form

Recitation 9: Probability Matrices and Real Symmetric Matrices. 3 Probability Matrices: Definitions and Examples

Lecture 14: The Rank-Nullity Theorem

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Singular Value Decomposition

Dimensionality Reduction and Principal Components

A PRIMER ON SESQUILINEAR FORMS

COMP 558 lecture 18 Nov. 15, 2010

CS286.2 Lecture 8: A variant of QPCP for multiplayer entangled games

Mathematical foundations - linear algebra

From Stationary Methods to Krylov Subspaces

Econ Slides from Lecture 8

arxiv: v1 [math.na] 5 May 2011

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

Transcription:

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method Tim Roughgarden & Gregory Valiant April 15, 015 This lecture began with an extended recap of Lecture 7. Recall that if we have a matrix X, whose rows represent the data points, then we can find an orthonormal matrix (rows/columns are orthogonal to eachother, and have norm 1) Q and a diagonal matrix D, s.t. X t X = Q t DQ. Given such a decomposition, the rows of Q are the eigenvectors of X t X, with associated eigenvalues given by the corresponding entries of the diagonal matrix D. The decomposition X t X = Q t DQ is unique, up to permuting the diagonal entries of D (and the corresponding permutation of the rows of Q). Henceforth we will assume that we choose the representation in which the entries of D are ordered in decreasing order from the upper left, to lower right. Recall that there are several uses of PCA: As a tool for dimension reduction: e.g. we want to represent high-dimensional data as k-dimensional data, one natural choice is to pick the k dimensional subspace that capture as much of the lengths of the datapoints as possible. This is the space spanned by the top k principal components. Note that this subspace is dependent on how we scale the data (as opposed to the random projection dimension reduction JL-machinery we saw in Lecture 4). As a tool for de-noising data: i.e. the hope is that the data is actually low dimensional, but with noise. By keeping the low-dimensional structure, we are removing the unstructured noise. As a tool for revealing structure in the data. For example, the European Genomes example I gave (see the links from the course page), shows that the top two principal components of genetic snp data corresponds to geographic latitude/longitude. 1 The Power Iteration Method How does one compute the top k principal components? There are two main approaches: 1) an explicit computation of the decomposition X t X = Q t DQ, and ) the power iteration method. The first approach, which can be done symbolically, is basically only feasible for 1

small matrices (and involves computing roots of the characteristic polynomial of the matrix X t X. The power iteration method is what is typically used in practice (especially in settings where we only want the top few principal components), and gives some additional intuition as to what the principal components actually mean. The main intuition behind the power method is that if one regards the matrix M = X t X as a function that maps the unit sphere to an ellipse, the longest axis of the ellipse corresponds to the top principal components. Given that the top principal components corresponds to the direction in which multiplication by M stretches the vector the most, it is natural to hope that if we start with a random vector v, and keep applying M over and over, we will end up having stretched v so much in the direction of M s top principal component, that the image of v will essentially lie almost entirely in the direction of this top principal component. This is the power iteration algorithm. We now restate it formally: Algorithm 1 Power Iteration Given matrix M = X t X: Select random unit vector v 0 For i = 1,,..., set v i = M i v 0. If v i / v i v i 1 / v i 1, then return v i / v i. Often, rather than computing Mv 0, M v 0, M 3 v 0,..., M 4 v 0, M 5 v 0 one instead uses the repeated squaring trick and computes Mv 0, M v 0, M 4 v 0, M 8 v 0,.... To see why the algorithm works, we first establish that if M = Q t DQ, then M i = Q t D i Q namely that M i has the same orientation (i.e. eigenvectors) as M, but all of the eigenvalues are raised to the ith power (and are hence exaggerated e.g. if λ 1 > λ, then λ 10 1 > 1000λ 10 ). Claim 1.1 If M = Q t DQ, then M i = Q t D i Q. Proof: We prove this via induction on i. The base case, i = 1 is immediate. Assuming that the statement holds for some specific i 1, consider the following: M i+1 = M i M = Q t D i QQ t DQ = Q t D i DQ = Q t D i Q, where we used our induction hypothesis to get the second equality, and the third equality followed from the orthogonality of Q, which allowed us to replace QQ t by the identity matrix. We can now quantify the performance of the power iteration algorithm:

Theorem 1. With probability at least 1/ over the choice of v 0, for and t 1, M t v 0, w 1 1 d where w 1 is the top eigenvector of M, with eigenvalue λ 1, and λ is the second-largest eigenvalue of M. In particular, if t > 10 log d then v log λ 1 t, w 1 > 1 e 0 > 0.99999, namely v t is λ essentially pointing in the same direction as w 1. Proof: Let w 1, w,..., represent the eigenvectors of M, with associated eigenvalues λ 1,.... We can consider the representation of our random vector v 0 in the (unknown) basis defined by w 1, w,.... Namely v 0 = c 1 w 1 + c w +.... We claim that, with probability at least 1/, c 1 > 1. This is a simple fact about random vectors, and follows from the fact that we can d choose a random unit vector by selecting each coordinate independently from a Gaussian of variance 1/d, then normalizing (by a factor that will be very close to 1). Given that c 1 > 1/, ( λ λ 1 ) t, M t v 0, w 1 = d i=1 (λ i c i) c 1 λ t 1 + dλ t + λ t d 1 d ( λ λ 1 ) t. Note that the with probability 1/ statement can be replaced by with probability at least 1 1/ 100 by repeating the above algorithm 100 times (for independent choices of v 0 ), and outputting the recovered vector v that maximizes M v. The above theorem shows that the number of power iterations we require scales as log d. log λ 1 λ If λ 1 /λ is large, then we are in excellent shape. If λ 1 λ, then the algorithm might take a long time (or might never) find w 1. Note that if λ 1 = λ, then w 1 and w are not uniquely defined -any pair of orthonormal vectors that lie in this -d space will be eigenvectors with eigenvalue λ 1 = λ. Note that in this case, the power iteration algorithm will simply return a vector that lies in this subspace, which is the correct thing to do. 1.1 Finding Lots of Principal Components The power iteration algorithm finds the top principal component. If we want to find the top k, we can do the following: 1. Find the top component, w 1, using power iteration.. Project our data orthogonally to w 1, by, for each datapoint x, replacing it with x w 1 x, w 1. 3. Recurse by finding the top k 1 principal components of the the new data (in which the direction w 1 has been removed). 3

The correctness of this greedy algorithm follows immediately from the fact that the k dimensional subspace that maximizes the norms of the projections of a data matrix X contains the k 1 dimensional subspace that maximizes the norms of the projections. 1. Eigen-spectrum How do you know how many principal components to find? The simple answer is that you don t. In general, it is worth computing a lot of them, considering the eigen-spectrum namely the set of eigenvalues. Often, if you ploting the eigenvalues, you will see that after some point, they are all fairly small (i.e. your data might have 00 dimensions, but after the first 50 eigenvalues, the rest are all tiny). Looking at this plot might give you some heuristic sense of how to choose the number of components so as to maximize the signal of your data, while preserving the low-dimensionality. Failure Cases When and why does PCA fail? 1. You messed up the scaling/normalizing. PCA is sensitive to different scalings/normalizations of the coordinates. Much of the artistry of getting good results from PCA involves choosing an appropriate scaling for the difference data coordinates.. Non-linear structure. PCA is all about finding linear structure in your data. If your data has some low-dimensional, but non-linear structure, i.e. you have two data coordinates, x, y, with y x, then PCA will not find this. [Note: of course you can always add extra non-linear dimensions to your data, i.e. add coordinates x, y, xy, etc., and then PCA this higher dimensional data...you can even do this in a kernelized manner...if you haven t seen the kernel trick in another class, don t worry about this last comment.] 3. Non-orthogonal structure. It is nice to have coordinates that are interpretable. Even if your data is essentially linear, the fact that the principal components are all orthogonal will often mean that after the top 3 or 4 components, it will be almost impossible to interpret the meanings of the components. This is because, say, the 5th component is FORCED to be orthogonal to the top 4 components, which is a rather artificial constraint to put on an interpretable feature. 3 Power iteration, principal components, and Markov Chains I didn t have time to mention this in class, but wanted to just make a note of this because someone asked after class. (Don t worry, this won t be on the exam.) 4

You might recall the expression M t v from Lecture 6 on Markov Chains. In that lecture, M was the transition matrix of a Markov Chain, v was an initial state (or distribution over states), and lim t M t v = π, which satisfies M t π = π is the stationary distribution for the Markov Chain, which is the limiting distribution, provided such a distribution exists (recall the two conditions for the fundamental theorem of Markov chains: every state must be eventually reachable from every other state, and the chain cannot be periodic ). After today s lecture, it should be clear that π is the top eigenvector of the transition matrix M, and running MCMC (or explicitly calculating M t v) are both essentially applying the power iteration method to find this limiting stationary distribution. In this lecture M = X t X is symmetric, whereas a typical transition matrix for a Markov Chain might not be symmetric. The main difference is that for a transition matrix M that is not symmetric, we will have negative eigenvalues, and the interpretation/intuition gets a bit more complicated, though many of the ideas still apply. We will talk more about what happens when one applies PCA to transition matrices in the last week of the course. 5