CS281 Section 4: Factor Analysis and PCA

Similar documents
Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture 7: Con3nuous Latent Variable Models

Linear Dimensionality Reduction

PCA and admixture models

Dimension Reduction. David M. Blei. April 23, 2012

STA 414/2104: Machine Learning

Introduction to Machine Learning

Factor Analysis (10/2/13)

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Factor Analysis and Kalman Filtering (11/2/04)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

ECE521 week 3: 23/26 January 2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Maximum variance formulation

CSC 411 Lecture 12: Principal Component Analysis

Multivariate Statistical Analysis

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Linear Dynamical Systems

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

PCA, Kernel PCA, ICA

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Introduction to Graphical Models

Collaborative Filtering: A Machine Learning Perspective

Expectation Maximization

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

1 Principal Components Analysis

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Unsupervised Learning

Statistical Pattern Recognition

Latent Variable Models and EM algorithm

Machine Learning - MT & 14. PCA and MDS

Overfitting, Bias / Variance Analysis

STA 4273H: Statistical Machine Learning

Unsupervised Learning: Dimensionality Reduction

STA 414/2104: Lecture 8

CS181 Midterm 2 Practice Solutions

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

STATISTICAL LEARNING SYSTEMS

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Probabilistic Graphical Models

Variational Autoencoders

Chapter 4: Factor Analysis

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

CS229 Lecture notes. Andrew Ng

Gaussian Models

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Weighted Low Rank Approximations

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Latent Variable Models and EM Algorithm

Linear Models for Classification

Data Mining Techniques

Expectation Maximization Algorithm

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

GWAS V: Gaussian processes

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Principal Component Analysis

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Week 5: Logistic Regression & Neural Networks

Linear Regression (9/11/13)

ECE 5984: Introduction to Machine Learning

Advanced Introduction to Machine Learning

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Linear Factor Models. Sargur N. Srihari

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Machine Learning 2nd Edition

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Linear Methods for Prediction

Probabilistic Machine Learning. Industrial AI Lab.

Machine Learning for Data Science (CS4786) Lecture 12

Mixture of Gaussians Models

Machine Learning Techniques for Computer Vision

Linear & nonlinear classifiers

STA414/2104 Statistical Methods for Machine Learning II

CS281A/Stat241A Lecture 17

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Principal Component Analysis

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Linear Dependent Dimensionality Reduction

Principal component analysis

Factor Analysis. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

STA 414/2104: Lecture 8

Y1 Y2 Y3 Y4 Y1 Y2 Y3 Y4 Z1 Z2 Z3 Z4

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Graphical Models for Collaborative Filtering

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine learning for pervasive systems Classification in high-dimensional spaces

Lecture 16 Deep Neural Generative Models

L11: Pattern recognition principles

Transcription:

CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we have discussed linear regression for fitting continuous outputs given their corresponding features, and classification methods which learn a linear decision boundary, like logistic regression. Finally, we ve extended these methods to include nonlinear functions of the input using generalized linear models. In many settings, however, we do not have a set of outputs and features, we just have a set of data. This is the domain of unsupervised machine learning. Our goals in this setting are less clear. Rarely is our data entirely random. We often believe there is some latent, low-dimensional structure in the data that is corrupted by noise. Unsupervised machine learning is about finding this latent structure, and today we will discuss some of the most widely used methods for doing so. 1. Principal Component Analysis (PCA) Imagine we are presented with a bunch of data {x n }, where each x n lives in R D. For example, in Figure 1 we have a cloud of points in R. In many cases we believe the data is actually lower dimensional, 1-dimensional in this case. That is, the -D data points are well approximated by a line, and each data point is described by a 1-D value indicating the location on the line. How could we find the line that best explains these data? This is very similar to linear regression, except in this case we do not know the latent position along the line. (a) Heuristic motivation One way is to assume the data is Gaussian distributed and find the MLE covariance. This corresponds to the sample covariance. Then, intuitively, the best line will be parallel to the long axis of the ellipse corresponding to the covariance matrix. That is, the best line will be parallel to the eigenvector of the covariance matrix with the largest eigenvalue. Each point will be approximated by its projection onto this eigenvector. This is the intuition behind PCA: We find the eigenvectors of the sample covariance matrix, and then summarize the observed data as projections onto the top eigenvectors, or principal components. The size of this projection for a given datapoint x n is called the score. (b) Minimizing reconstruction error Can we formalize the process by which we arrived at this algorithm? Under what objective function are the eigenvectors with largest eigenvalues the best principal components? It turns out that given a data matrix X R D N (columns correspond to x n ), PCA finds principal components W R D L and scores Z L N that minimize the 1

4 3 1 x 0 1 3 4 4 3 1 0 1 3 4 x 1 Figure 1: A cloud of data in R that we would like to approximate with a lower dimensional representation. reconstruction error J(W, Z) = X WZ F, where W is constrained to be orthonormal and the error is measured with respect to the Frobenius norm. Let us solve for the optimal first principal component w 1 and scores z 1 under the objective given in the first characterization J(w 1, z 1 ) x n z 1,n w 1 xn T x n z 1,n w1 T x n + z 1,n wt 1 w 1 xn T x n z 1,n w1 T x n + z 1,n. Taking partial derivatives with respect to z 1,n yields J(w 1, z 1 ) = 1 w1 T z 1,n N x n + z 1,n = z 1,n = w1 T x n. Hence the optimal score z 1,n is the projection of the data onto the first principal component. = 0

Plug this back in to get J(w 1 ) xn T x n z 1,n w1 T x n + z 1,n xn T x n z 1,n + z 1,n = const. 1 N n z 1,n where the last line follows because = const. varz 1, Ez 1 = EX T w = EX T w = 0. Note: It is tempting to make the following substitution J(w 1 ) xn T x n z 1,n w1 T x n + z 1,n xn T x n (w1 T x n)(w1 T x n) + z 1,n xn T x n xn T (w 1 w1 T )x n ) + z 1,n, and then claim that w 1 w T 1 = 1 since w has unit norm, but this is false!. Instead, w 1w T 1 = W is a rank-one matrix with entries W i,j = w i w j. Returning to our objective function, we see that minimizing J(w 1 ) yields a first principal component w 1 in the direction that maximizes the variance of the projected data. What is this variance? Plugging back in for z n yields varz 1 w T 1 x nx T n w 1 = w T 1 Σw 1, where Σ is the empirical covariance matrix. If the magnitude of w 1 is unconstrained then the variance is trivially maximized when w 1, but by the orthonormality of W we have the constraint w 1 = 1. This can be added to our objective function with a Lagrange multiplier J(w 1 ) = w T 1 Σw 1 + λ 1 (w T 1 w 1 1). Taking partial derivatives and setting to zero yields Σw 1 = λ 1 w 1. 3

We see that the optimal w 1 is an eigenvector of the covariance matrix. Left multiplying by w1 T we see that the variance of the projected data is the eigenvalue corresponding to w 1, so to maximize the variance we choose the eigenvector with the largest λ 1. (c) Generative model approach It would be nice if we could formulate PCA as a generative model and thereby glean some intuition for why the eigenvectors of the empirical covariance matrix are good principal components. If we were to invert the process described above, we would first sample projections z n and then transform them with W in order to get the corresponding observed data x n. There was no noise in the reconstruction error formulation; the error only stemmed from the fact that we had fewer principal components than dimensions (L < D). Hence, in the generative model, x n would be deterministic given z n and W. We can relax this assumption a bit and add isotropic Gaussian noise such that x n N (Wz n, σ I), where W is still constrained to be orthogonal and where we are ignoring mean shifts for simplicity. If we assume z n N (0, I), then we have a model known as probabilistic PCA. It turns out that in the limit of σ 0 the MLE estimate of W and z n recovers the classical PCA solution.. Factor Analysis Factor analysis (FA) is another dimensionality reduction technique with a long history in statistics, psychology, and other fields. It turns out that both PCA and FA can be viewed as special cases of the generative model described above. In factor analysis, however, we have the following model: z n N (0, I), x n z n N (Wz n, Ψ), where Ψ is restricted to be diagonal. This is just like probabilistic PCA except that here we allow different dimensions of our observations to have different variances. Instead of principal components, we will call the z n s latent factors. The model is best motivated with an example. Suppose we have a set of of ratings vectors {x n } n=1 N that N users gave to each of D jokes. Our goal is to predict a user s rating on a joke she hasn t heard, based on her ratings for other jokes and other users ratings for the new joke. This is the collaborative filtering problem. We will assume that each user has a latent vector of preferences z n along L dimensions, e.g. length of the joke, knock-knock-iness, or years of math required to understand the joke. Each joke will have a weighting of these dimensions, w d R L. Users ratings are weighted sums, wd Tz n, of how well the joke aligns to their latent preferences, plus noise. If we knew the preference dimensions, weights, and user preferences, we could easily describe 4

the distribution over joke ratings in terms of additive noise, but otherwise we have to model the potentially complex distribution over ratings directly. Take the simplest case where x n = Wz n + µ + noise, i.e. x is a linear function of z, with W R D L. Furthermore, assume that both the z s and the noise follow multivariate Gaussian distributions. z n N (z n µ 0, Σ 0 ) x n z n N (Wz n + µ, Ψ). Since both are Gaussian, their joint distribution will be Gaussian and the marginal distribution over x will be Gaussian. Our goal is to explain the data in terms of latent variables z. In the Gaussian case this means explaining covariance in x in terms of covariance in z, rather than in terms of noise. Hence we restrict Ψ to be diagonal. Ψ = diag ( σ 11,..., σ DD). Under this restriction, the model is specified by θ = {µ 0, Σ 0, W, µ, Ψ} = L + L(L + 1) + LD + D + D parameters. This is actually still overparameterized because we can take µ 0 = 0 and Σ 0 = I without loss of generality by pushing them into µ and W, respectively. This leaves us with: θ = {W, µ, Ψ} = LD + D + D = O(LD) parameters. With a standard trick of augmenting our factors with a constant we can roll the µ into W. By these arguments we have arrived at the generative model for factor analysis posited above. (a) Learning In order to learn the parameters W and Ψ and infer the latent factors z n of the model, we make use of the EM algorithm. In the E step we update the latent factors given the current weights and noise matrices, and in the M step we set the weights and noise matrices to their MAP estimates under the current factors. One benefit of the EM algorithm is that it is easy to handle missing data. We simply estimate it during the E step. (b) Unidentifiability Just as in mixture models, the learned weight matrix is not uniquely defined. We can rotate by any orthogonormal matrix R and the likelihood will not change. Hence one must take care when interpreting the factors. 5