COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Similar documents
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Introduction to Machine Learning

PCA, Kernel PCA, ICA

Machine Learning - MT & 14. PCA and MDS

Expectation Maximization

Lecture 7: Con3nuous Latent Variable Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Variational Autoencoders

PCA and admixture models

Probabilistic Graphical Models

Factor Analysis (10/2/13)

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Principal Component Analysis

Advanced Introduction to Machine Learning CMU-10715

CS181 Midterm 2 Practice Solutions

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CSC411 Fall 2018 Homework 5

CS281 Section 4: Factor Analysis and PCA

Principal components analysis COMS 4771

Latent Variable Models and EM Algorithm

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Latent Variable Models and EM algorithm

Unsupervised Learning

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

STA 414/2104: Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Principal Component Analysis

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Nonlinear Dimensionality Reduction

Introduction to Machine Learning

Lecture: Face Recognition and Feature Reduction

Statistical Machine Learning

4 Bias-Variance for Ridge Regression (24 points)

Data Mining Techniques

Dimensionality Reduction with Principal Component Analysis

Advanced Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

LECTURE 16: PCA AND SVD

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Principal Component Analysis (PCA)

Kernel methods for comparing distributions, measuring dependence

Principal Component Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Lecture: Face Recognition and Feature Reduction

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Dimensionality Reduc1on

Singular Value Decomposition and Principal Component Analysis (PCA) I

Data Mining Techniques

Probabilistic & Unsupervised Learning

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

(a)

MLCC 2015 Dimensionality Reduction and PCA

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

CSC 411 Lecture 12: Principal Component Analysis

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Principal Component Analysis

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

4 Bias-Variance for Ridge Regression (24 points)

Data Preprocessing Tasks

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Principal Component Analysis and Linear Discriminant Analysis

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

14 Singular Value Decomposition

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

7 Principal Component Analysis

Dimensionality Reduction

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Nonlinear Dimensionality Reduction

Variational Inference (11/04/13)

Supervised Learning Coursework

Probabilistic Graphical Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Neuroscience Introduction

Principal Component Analysis (PCA)

Probabilistic Latent Semantic Analysis

1 Linearity and Linear Systems

Kernel methods, kernel SVM and ridge regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Collaborative Filtering: A Machine Learning Perspective

STATISTICAL LEARNING SYSTEMS

CS229 Lecture notes. Andrew Ng

Lecture VIII Dim. Reduction (I)

Transcription:

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

PRINCIPAL COMPONENT ANALYSIS

DIMENSIONALITY REDUCTION We re given data x 1,..., x n, where x R d. This data is often high-dimensional, but the information doesn t use the full d dimensions. For example, we could represent the above images with three numbers since they have three degrees of freedom. Two for shifts and a third for rotation. Principal component analysis can be thought of as a way of automatically mapping data x i into some new low-dimensional coordinate system. It capture most of the information in the data in a few dimensions Extensions allow us to handle missing data, and unwrap the data.

PRINCIPAL COMPONENT ANALYSIS Example: How can we approximate this data using a unit-length vector q? x i q is a unit-length vector, q T q = 1. q (q T x i )q Red dot: The length, q T x i, to the axis after projecting x onto the line defined by q. The vector (q T x i )q takes q and stretches it to the corresponding red dot. So what s a good q? How about minimizing the squared approximation error, q = arg min q x i qq T x i 2 subject to q T q = 1 qq T x i = (q T x i )q : The approximation of x i by stretching q to the red dot.

PCA : THE FIRST PRINCIPAL COMPONENT This is related to the problem of finding the largest eigenvalue, q = arg min q = arg min q x i qq T x i 2 s.t. q T q = 1 ( ) xi T x i q T x i xi T q }{{} = XX T We ve defined X = [x 1,..., x n ]. Since the first term doesn t depend on q and we have a negative sign in front of the second term, equivalently we solve q = arg max q q T (XX T )q subject to q T q = 1 This is the eigendecomposition problem: q is the first eigenvector of XX T λ = q T (XX T )q is the first eigenvalue

PCA: GENERAL The general form of PCA considers K eigenvectors, q = arg min q = arg min q x i xi T x i K { (xi T q k )q k 2 s.t. q T 1, k = k k q k = 0, k k k=1 }{{} approximates x K k=1 q T k ( ) x i xi T } {{ } = XX T q k The vectors in Q = [q 1,..., q K ] give us a K dimensional subspace with which to represent the data: q T 1 x K x proj =., x (q T k x)q k = Qx proj q T k=1 Kx The eigenvectors of (XX T ) can be learned using built-in software.

EIGENVALUES, EIGENVECTORS AND THE SVD An equivalent formulation of the problem is to find (λ, q) such that (XX T )q = λq Since (XX T ) is a PSD matrix, there are r min{d, n} pairs, λ 1 λ 2 λ r > 0, q T k q k = 1, q T k q k = 0 Why is (XX T ) PSD? Using the SVD, X = USV T, we have that (XX T ) = US 2 U T Q = U, λ i = (S 2 ) ii 0 Preprocessing: Usually we first subtract off the mean of each dimension of x.

PCA: EXAMPLE OF PROJECTING FROM R 3 TO R 2 For this data, most information (structure in the data) can be captured in R 2. (left) The original data in R 3. The hyperplane is defined by q 1 and q 2. [ ] (right) The new coordinates for the data: x i x proj x T i = i q 1 xi T. q 2

EXAMPLE: DIGITS Data: 16 16 images of handwritten 3 s (as vectors in R 256 ) Mean λ 1 = 3.4 10 5 λ 2 = 2.8 10 5 λ 3 = 2.4 10 5 λ 4 = 1.6 10 5 Above: The first four eigenvectors q and their eigenvalues λ. Original M = 1 M = 10 M = 50 M = 250 Above: Reconstructing a 3 using the first M 1 eigenvectors plus the mean, and approximation M 1 x mean + (x T q k )q k k=1

PROBABILISTIC PCA

PCA AND THE SVD We ve discussed how any matrix X has a singular value decomposition, X = USV T, U T U = I, V T V = I and S is a diagonal matrix with non-negative entries. Therefore, XX T = US 2 U T (XX T )U = US 2 U is a matrix of eigenvectors, and S 2 is a diagonal matrix of eigenvalues.

A MODELING APPROACH TO PCA Using the SVD perspective of PCA, we can also derive a probabilistic model for the problem and use the EM algorithm to learn it. This model will have the advantages of: Handling the problem of missing data Allowing us to learn additional parameters such as noise Provide a framework that could be extended to more complex models Gives distributions used to characterize uncertainty in predictions etc.

PROBABILISTIC PCA In effect, this is a new matrix factorization model. With the SVD, we had X = USV T. We now approximate X WZ, where W is a d K matrix. In different settings this is called a factor loadings matrix, or a dictionary. It s like the eigenvectors, but no orthonormality. The ith column of Z is called zi R K. Think of it as a low-dimensional representation of x i. The generative process of Probabilistic PCA is x i N(Wz i, σ 2 I), z i N(0, I). In this case, we don t know W or any of the z i.

THE LIKELIHOOD Maximum likelihood Our goal is to find the maximum likelihood solution of the matrix W under the marginal distribution, i.e., with the z i vectors integrated out, W ML = arg max W ln p(x 1,..., x n W) = arg max W ln p(x i W). This is intractable because p(x i W) = N(x i 0, σ 2 I + WW T ), N(x i 0, σ 2 I + WW T 1 ) = (2π) d 2 σ 2 I + WW T 1 2 e 1 2 xt (σ 2 I+WW T ) 1 x We can set up an EM algorithm that uses the vectors z 1,..., z n.

EM FOR PROBABILISTIC PCA Setup The marginal log likelihood can be expressed using EM as ln p(x i, z i W) dz i = + q(z i ) ln p(x i, z i W) q(z i ) q(z i ) ln dz i L q(z i ) p(z i x i, W) dz i KL EM Algorithm: Remember that EM has two iterated steps 1. Set q(z i ) = p(z i x i, W) for each i (making KL = 0) and calculate L 2. Maximize L with respect to W Again, for this to work well we need that we can calculate the posterior distribution p(z i x i, W), and maximizing L is easy, i.e., we update W using a simple equation

THE ALGORITHM EM for Probabilistic PCA Given: Data x 1:n, x i R d and model x i N(Wz i, σ 2 ), z i N(0, I), z R K Output: Point estimate of W and posterior distribution on each z i E-Step: Set each q(z i ) = p(z i x i, W) = N(z i µ i, Σ i ) where Σ i = (I + W T W/σ 2 ) 1, µ i = Σ i W T x i /σ 2 M-Step: Update W by maximizing the objective L from the E-step [ ] [ W = x i µ T i σ 2 I + (µ i µ T i + Σ i ) ] 1 Iterate E and M steps until increase in n ln p(x i W) is small. Comment: The probabilistic framework gives a way to learn K and σ 2 as well.

EXAMPLE: IMAGE PROCESSING = 8 x 8 patch X data matrix, e.g., 64 x 262,144 For image problems such as denoising or inpainting (missing data) Extract overlapping patches (e.g., 8 8) and vectorize to construct X Model with a factor model such as Probabilistic PCA Approximate x i Wµ i, where µ i is the posterior mean of z i Reconstruct the image by replacing x i with Wµ i (and averaging)

EXAMPLE: DENOISING Noisy image on left, denoised image on right. The noise variance parameter σ 2 was learned for this example.

E XAMPLE : M ISSING DATA Another somewhat extreme example: I Image is 480 320 3 (RGB dimension) I Throw away 80% at random I (left) Missing data, (middle) reconstruction, (right) original image

KERNEL PCA

KERNEL PCA We ve seen how we can take an algorithm that uses dot products, x T x, and generalize with a nonlinear kernel. This generalization can be made to PCA. Recall: With PCA we find the eigenvectors of the matrix n x ix T i = XX T. Let φ(x) be a feature mapping from R d to R D, where D d We want to solve the eigendecomposition [ ] φ(x i )φ(x i ) T q k = λ k q k without having to work in the higher dimensional space. That is, how can we do PCA without explicitly using φ( ) and q?

KERNEL PCA Notice that we can reorganize the operations of the eigendecomposition φ(x i ) ( φ(x i ) T ) q k /λk = q k }{{} = a ki That is, the eigenvector q k = n a kiφ(x i ) for some vector a k R n. The trick is that instead of learning q k, we ll learn a k. Plug this equation for q k back into the first equation: N φ(x i ) j=1 a kj φ(x i ) T φ(x j ) }{{} = K(x i,x j) = λ k a ki φ(x i ) and multiply both sides by φ(x l ) T for each l {1,..., n}.

KERNEL PCA When we multiply the following by φ(x l ) T for l = 1..., n: N φ(x i ) j=1 we get a new set of linear equations a kj φ(x i ) T φ(x j ) }{{} = K(x i,x j) = λ k a ki φ(x i ) K 2 a k = λ k Ka k Ka k = λ k a k where K is the n n kernel matrix constructed on the data. Because K is guaranteed to be PSD because it is a matrix of dot-products, the LHS and RHS above share a solution for (λ k, a k ). Now perform regular PCA, but on the kernel matrix K instead of the data matrix XX T. We summarize the algorithm on the following slide.

KERNEL PCA ALGORITHM Kernel PCA Given: Data x 1,..., x n, x R d, and a kernel function K(x i, x j ). Construct: The kernel matrix on the data, e.g., K ij = b exp Solve: The eigendecomposition { xi xj 2 c }. Ka k = λ k a k for the first r n eigenvector/eigenvalue pairs (λ 1, a 1 ),..., (λ r, a r ). Output: A new coordinate system for x i by (implicitly) mapping φ(x i ) and then projecting q T k φ(x i) x i projection λ 1 a 1i. λ r a ri where a ki is the ith dimension of the kth eigenvector a k.

KERNEL PCA AND NEW DATA Q: How do we handle new data, x 0? Before, we could take the eigenvectors q k and project x T 0 q k, but a k is different here. A: Recall the relationship of a k to q k in kernel PCA is q k = a ki φ(x i ). We used the kernel trick to avoid working with or even defining φ(x i ). As with regular PCA, after mapping x 0 we want to project onto eigenvectors x 0 projection φ(x 0 ) T q 1. φ(x 0 ) T q r Plugging in for q k : φ(x 0 ) T q k = a ki φ(x 0 ) T φ(x i ) = a ki K(x 0, x i ).

EXAMPLE RESULTS An example of kernel PCA using the Gaussian kernel. (left) Original data, colored for reference (but may be classes) (middle) New coordinates using kernel width c = 2 (right) New coordinates using kernel width c = 10 Terminology: What we are doing is closely related to spectral clustering and can be considered an instance of manifold learning.