Machine Learning - MT & 14. PCA and MDS

Similar documents
Principal Component Analysis

Introduction to Machine Learning

Machine Learning - MT Clustering

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA and admixture models

PCA, Kernel PCA, ICA

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Unsupervised Learning

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

What is Principal Component Analysis?

Principal Component Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

MLCC 2015 Dimensionality Reduction and PCA

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis and Linear Discriminant Analysis

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Principal Component Analysis (PCA)

Advanced Introduction to Machine Learning CMU-10715

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Lecture: Face Recognition and Feature Reduction

Dimensionality Reduction

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Neuroscience Introduction

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Principal Component Analysis

Unsupervised Learning Basics

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Lecture: Face Recognition and Feature Reduction

CS281 Section 4: Factor Analysis and PCA

CS 340 Lec. 6: Linear Dimensionality Reduction

Probabilistic & Unsupervised Learning

Probabilistic Latent Semantic Analysis

Principal Component Analysis (PCA)

Machine Learning - MT Classification: Generative Models

CS 4495 Computer Vision Principle Component Analysis

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Preprocessing & dimensionality reduction

Deriving Principal Component Analysis (PCA)

4 Bias-Variance for Ridge Regression (24 points)

Dimensionality reduction

Machine learning for pervasive systems Classification in high-dimensional spaces

Expectation Maximization

Principal Component Analysis

20 Unsupervised Learning and Principal Components Analysis (PCA)

Lecture 7: Con3nuous Latent Variable Models

EECS 275 Matrix Computation

Collaborative Filtering: A Machine Learning Perspective

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Machine Learning - MT Linear Regression

Example: Face Detection

Principal components analysis COMS 4771

Sample questions for Fundamentals of Machine Learning 2018

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Numerical Methods I Singular Value Decomposition

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS 6375 Machine Learning

Statistical Pattern Recognition

Language Information Processing, Advanced. Topic Models

Supervised Learning Coursework

Unsupervised learning: beyond simple clustering and PCA

ECE521 week 3: 23/26 January 2017

L11: Pattern recognition principles

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Data Mining Techniques

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Generative Clustering, Topic Modeling, & Bayesian Inference

Statistical Machine Learning

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Dimensionality Reduction

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning 2nd Edition

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

A Least Squares Formulation for Canonical Correlation Analysis

Machine Learning 11. week

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

A Coupled Helmholtz Machine for PCA

PCA and LDA. Man-Wai MAK

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

7 Principal Component Analysis

Novelty Detection. Cate Welch. May 14, 2015

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Covariance and Principal Components

4 Bias-Variance for Ridge Regression (24 points)

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Image Analysis. PCA and Eigenfaces

Transcription:

Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016

Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary) Revision Class for M.Sc. + D.Phil. Thu Week 9 (2pm & 3pm) Work through ML HT2016 Exam (Problem 3 is optional) 1

Supervised Learning: Summary Training data is of the form (x i, y i) where x i are features and y i is target We formulate a model: generative or discriminative Choose a suitable training criterion (loss function, maximum likelihood) Use optimisation procedure to learn parameters Use regularization or other techniques to reduce overfitting Use trained classifier to predict targets/labels on unseen x new 2

Unsupervised Learning Training data is of the form x 1,..., x N Infer properties about the data Search: Identify patterns in data Density Estimation: Learn the underlying distribution generating data Clustering: Group similar points together Today: Dimensionality Reduction 3

Outline Today, we ll study a technique for dimensionality reduction Principal Component Analysis (PCA) identifies a small number of directions which explain most variation in the data PCA can be kernelised Dimensionality reduction is important both for visualising and as a preprocessing step before applying other (typically unsupervised) learning algorithms 4

Principal Component Analysis (PCA) 5

Principal Component Analysis (PCA) 5

Principal Component Analysis (PCA) 5

Principal Component Analysis (PCA) 5

Principal Component Analysis (PCA) 5

PCA: Maximum Variance View PCA is a linear dimensionality reduction technique Find the directions of maximum variance in the data (x i) N i=1 Assume that data is centered, i.e., i xi = 0 Find a set of orthogonal vectors v 1,..., v k The first principal component (PC) v 1 is the direction of largest variance The second PC v 2 is the direction of largest variance orthogonal to v 1 The i th PC v i is the direction of largest variance orthogonal to v 1,..., v i 1 V D k gives projection z i = V T x i Z = XV for datapoint x i for entire dataset 6

PCA: Maximum Variance View We are given i.i.d. data (x i) N i=1; data matrix X Want to find v 1 R D, v 1 = 1, that maximizes Xv 1 2 Let z = Xv 1, so z i = x i v 1. We wish to find v 1 so that N i=1 z2 i is maximised. N zi 2 = z T z i=1 = v T 1 X T Xv 1 The maximum value attained by v T 1 X T Xv 1 for v 1 = 1 is the largest eigenvalue of X T X. The argmax is the corresponding eigenvector v 1. Find v 2, v 3,..., v k that are all successively orthogonal to previous directions and maximise (as yet unexplained variance) 7

PCA: Best Reconstruction We have i.i.d. data (x i) N i=1; data matrix X Find a k-dimensional linear projection that best represents the data Suppose V k R D k is such that columns of V k are orthogonal Project data X on to subspace defined by V Z = XV k Minimize reconstruction error N x i V k Vkx T i 2 i=1 8

Principal Component Analysis (PCA) 9

Equivalence between the Two Objectives: One PC Case Let v 1 be the direction of projection The point x is mapped to x = (v 1 x)v 1, where v 1 = 1 Maximum Variance Find v 1 that maximises N i=1 (v1 xi)2 Best Reconstruction Find v 1 that minimises: N N x i x i 2 = ( x i 2 2(x i x i) + x i 2) i=1 = i=1 N ( x i 2 2(v 1 x i) 2 + (v 1 x i) 2 v 1 2) i=1 N N = x i 2 (v 1 x i) 2 i=1 i=1 So the same v 1 satisfies the two objectives 10

Finding Principal Components: SVD Let X be the N D data matrix Pair of singular vectors u R N, v R D and singular value σ R + if σu = Xv and σv = X T u v is an eigenvector of X T X with eigenvalue σ 2 u is an eigenvector of XX T with eigenvalue σ 2 11

Finding Principal Components: SVD X = UΣV T (say N > D) Thin SVD: U is N D, Σ is D D, V is D D, U T U = V T V = I D Σ is diagonal with σ 1 σ 2 σ D 0 The first k principal components are first k columns of V Full SVD: U is N N, Σ = N D, V is D D. V and U are orthonormal matrices 12

Algorithm for finding PCs (when N > D) Constructing the matrix X T X takes time O(D 2 N) Eigenvectors of X T X can be computed in time O(D 3 ) Iterative methods to get top k singular (right) vectors directly: Initiate v 0 to be random unit norm vector Iterative Update: v t+1 = X T Xv t v t+1 = v t+1 / v t+1 until (approximate) convergence Update step only takes O(ND) time (compute Xv t first, then X T (Xv t )) This gives the singular vector corresponding to the largest singular value Subsequent singular vectors obtained by choosing v 0 orthogonal to previously identified singular vectors (this needs to be done at each iteration to avoid numerical errors creeping in) 13

Algorithm for finding PCs (when D N) Constructing the matrix XX T takes time O(N 2 D) Eigenvectors of XX T can be computed in time O(N 3 ) The eigenvectors give the left singular vectors, u i of X To obtain v i, we use the fact that v i = σ 1 X T u i Iterative method can be used directly as in the case when N > D 14

PCA: Reconstruction Error We have thin SVD: X = UΣV T Let V k be the matrix containing first k columns of V Projection on to k PCs: Z = XV k = U k Σ k, where U k is the matrix of the first k columns of U and Σ k is the k k diagonal submatrix for Σ of the top k singular values Reconstruction: X = ZVk T = U k Σ k Vk T N D Reconstruction error = x i V k Vkx T i 2 = i=1 j=k+1 This follows from the following calculations: σ 2 j D X = UΣV T = σ ju jvj T j=1 X X D F = j=k+1 σ 2 j k X = U k Σ k Vk T = σ ju jvj T j=1 15

Reconstruction of an Image using PCA 16

How many principal components to pick? Look for an elbow in the curve of reconstruction error vs # PCs 17

Application: Eigenfaces A popular application of PCA for face detection and recognition is known as Eigenfaces Face detection: Identify faces in a given image Face Recognition: Classification (or search) problem to identify a certain person 18

Application: Eigenfaces PCA on a dataset of face images. Each principal component can be thought of as being an element of a face. Source: http://vismod.media.mit.edu/vismod/demos/facerec/basic.html 19

Application: Eigenfaces Detection: Each patch of the image can be checked to identify whether there is a face in it Recognition: Map all faces in terms of their principal components. Then use some distance measure on the projections to find faces that are most like the input image. Why use PCA for face detection? Even though images can be large, we can use the D N approach to be efficient The final model (the PCs) can be quite compact, can fit on cameras, phones Works very well given the simplicity of the model 20

Application: Latent Semantic Analysis X is an N D matrix, D is the size of dictionary x i is a vector of word counts (bag of words) Reconstruction using k eigenvectors X ZV T k, where Z = XV k z i, z j is probably a better notion of similarity than x i, x j V T k X Z Non-negative matrix factorisation has more natural interpretation, but is harder to compute 21

PCA: Beyond Linearity 22

PCA: Beyond Linearity 22

PCA: Beyond Linearity 22

PCA: Beyond Linearity 22

Projection: Linear PCA 23

Projection: Kernel PCA 24

Kernel PCA Suppose our original data is, for example, x R 2 We could perform degree 2 polynomial basis expansion as: φ(x) = [1, 2x 1, 2x 2, x 2 1, x 2 2, ] T 2x 1x 2 Recall that we can compute the inner products φ(x) φ(x ) efficiently using the kernel trick φ(x) φ(x ) = 1 + 2x 1x 1 + 2x 2x 2 + x 2 1(x 1) 2 + x 2 2(x 2) 2 + 2x 1x 2x 1x 2 = (1 + x 1x 2 + x 1x 2) 2 = (1 + x x ) 2 =: κ(x, x ) 25

Kernel PCA Suppose we use the feature map: φ : R D R M Let φ(x) be the N M matrix We want find the singular vectors of φ(x) (eigenvectors of φ(x) T φ(x)) However, in general M N (in fact M could be infinite for some kernels) Instead we ll find the eigenvectors of φ(x)φ(x) T, the kernel matrix 26

Kernel PCA Recall that the kernel matrix is: κ(x 1, x 1) κ(x 1, x 2) κ(x 1, x N ) κ(x 2, x 1) κ(x 2, x 2) κ(x 2, x N ) K = φ(x)φ(x) T =........ κ(x N, x 1) κ(x N, x 2) κ(x N, x N ) Let u R N be an eigenvector of K, (left singular vector of φ(x)) The corresponding principal component v R M is σ 1 φ(x) T u We won t express v explicitly, instead we can compute projections of a new datapoint x new on to the principal component v using the kernel function: φ(x new) T v = σ 1 φ(x new) T φ(x) T u = σ 1 [κ(x new, x 1 ), κ(x new, x 2 ),, κ(x new, x N )]u So in order to compute projections onto principal components we do not need to store the principal components explicitly! 27

Kernel PCA For PCA, we assumed that the datamatrix X is centered, i.e., i xi = 0 However, this is not the case for the matrix φ(x) Instead we can consider: φ(x i) = φ(x i) 1 N N φ(x k ) k=1 The corresponding matrix K is given by the entries K ij = κ(x i, x j) 1 N N κ(x i, x l ) 1 N l=1 N l=1 κ(x j, x l ) + 1 N 2 N k=1 l=1 N κ(x l, x k ) Succintly, if O is the matrix of all with every entry 1/N, i.e., O = 11 T /N K = K OK KO + OKO To perform kernel PCA, we need to find the eigenvectors of K 28

Projection: PCA vs Kernel PCA 29

Kernel PCA Applications Kernel PCA is not necessarily very useful for visualisation Also, kernel PCA does not directly give a useful way to construct a low-dimensional reconstruction of the original data Most powerful uses of kernel PCA are in other machine learning applications After kernel PCA preprocessing, we may get higher accuracy for classification, clustering, etc. 30

PCA Summary Algorithm: We ve expressed PCA as SVD of data matrix X Equivalently, we can use eigendecomposition of the matrix X T X Running Time: O(NDk) to compute k principal components (avoid computing the matrix X T X) PCs are uncorrelated, but there may be non-linear (higher-order) effects PCA depends on scale or units of measurement; it may be a good idea to standardize data PCA is sensitive to outliers PCA can be kernelised: Useful as preprocessing for further ML applications, rather than visualisation 31