Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Similar documents
Principal Component Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

PCA, Kernel PCA, ICA

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Principal Component Analysis

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Introduction to Machine Learning

1 Principal Components Analysis

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

What is Principal Component Analysis?

Nonlinear Dimensionality Reduction

Principal Component Analysis (PCA)

Advanced Introduction to Machine Learning CMU-10715

Data Preprocessing Tasks

Principal Component Analysis

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Principal Component Analysis

STA 414/2104: Lecture 8

Linear and Non-Linear Dimensionality Reduction

Computation. For QDA we need to calculate: Lets first consider the case that

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Novelty Detection. Cate Welch. May 14, 2015

Kernel Principal Component Analysis

Image Analysis. PCA and Eigenfaces

PCA and LDA. Man-Wai MAK

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

CS 340 Lec. 6: Linear Dimensionality Reduction

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Advanced Machine Learning & Perception

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Maximum variance formulation

Manifold Learning: Theory and Applications to HRI

Lecture 7: Con3nuous Latent Variable Models

LEC 3: Fisher Discriminant Analysis (FDA)

MLCC 2015 Dimensionality Reduction and PCA

Machine Learning (BSMC-GA 4439) Wenke Liu

Preprocessing & dimensionality reduction

STA 414/2104: Lecture 8

Interpreting Deep Classifiers

Principal Component Analysis (PCA)

CSC 411 Lecture 12: Principal Component Analysis

Statistical Machine Learning

Principal Component Analysis

PCA and LDA. Man-Wai MAK

Data Mining and Analysis: Fundamental Concepts and Algorithms

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning - MT & 14. PCA and MDS

Support Vector Machines

Convergence of Eigenspaces in Kernel Principal Component Analysis

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Fundamentals of Matrices

Dimensionality Reduction

STA 414/2104: Machine Learning

Covariance and Correlation Matrix

Data Mining Techniques

Announcements (repeat) Principal Components Analysis

Online Kernel PCA with Entropic Matrix Updates

ECE 661: Homework 10 Fall 2014

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

7. Variable extraction and dimensionality reduction

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

Nonlinear Dimensionality Reduction

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

Dimensionality Reduction

Kernel methods for comparing distributions, measuring dependence

Motivating the Covariance Matrix

Lecture: Face Recognition and Feature Reduction

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Statistical Pattern Recognition

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Unsupervised dimensionality reduction

Face Detection and Recognition

Bayesian Learning (II)

Covariance to PCA. CS 510 Lecture #14 February 23, 2018

EECS 275 Matrix Computation

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Principal Component Analysis

Dimension Reduction and Low-dimensional Embedding

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Unsupervised learning: beyond simple clustering and PCA

Lecture VIII Dim. Reduction (I)

STATISTICAL LEARNING SYSTEMS

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Lecture 7 Spectral methods

Table of Contents. Multivariate methods. Introduction II. Introduction I

Deriving Principal Component Analysis (PCA)

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer

Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2

PCA: Motivation Data compression Preprocessing (Feature Selection / Noisy Features) Data visualization 3

PCA: Example Representation of Digits as an m m pixel matrix The actual number of degrees of freedom is significantly smaller because many features Are meaningless or Are composites of several pixels Goal: Reduce to a d-dimensional subspace 4

PCA: Example Representation of faces as an m m pixel matrix The actual number of degrees of freedom is significantly smaller because many combinations of pixels cannot occur in faces Reduce to a d-dimensional subspace 5

PCA: Projection A Projection is an idempotent linear Transformation Center point: x = 1 n Covariance: n n i=1 x i Σ = 1 x n i x x i x T i=1 x x i Data points x i u Projected data points proj u x = ux i 6

PCA: Projection A Projection is an idempotent linear Transformation Let u 1 R m with u 1 T u 1 = 1 y 1 x = u 1 T x constitutes a projection onto a one-dimensional subspace x x i u For data in the projection s space, it follows that: T Center (mean): y 1 x = u 1 x Variance: 1 n u T T n 1 x i u 1 x 2 i=1 = u T 1 Σ u 1 7

PCA: Problem Setting Given: data X = Find matrix U = x 1 x n = u 1 u d x 11 x 1m x n1 x n1 Vectors u i are orthonormal basis. such that Vector u 1 preserves maximal variance of data: max u 1 T 1 u 1 : u 1 =1 n XXT u 1 Vector u i preserves maximal residual variance. T1 max u u i : u i =1,u i u 1,,u i u i n XXT u i i 1 8

PCA: Problem Setting Given: data X = Find matrix U = x 1 x n = u 1 u d x 11 x 1m x n1 x n1 such that Value u k T x i is projection of x i onto dimension u k. Vector U T x i is projection of x i onto coordinates U. x n u 1 x n u d Matrix Y = XU is projection of X onto coordinates U: x 1 u 1 x 1 u d Y = XU = 9

PCA: Assumption To simplify the notation, assume centered data. x = 0. Can be achieved by subtracting mean value x i c = x i x 10

PCA: Direction of Maximum Variance Find direction u 1 that maximizes projected variance Instances x~p X (assume mean x = 0). The projected variance onto (normalized) u 1 is E proj u1 x 2 = E u 1 T xx T u 1 = u 1 T E xx T Σ xx u 1 E xx T n x i1 = x i1 x im Covariance matrix i=1 x im n x i1 x i1 2 x i1 x i1 x im x im = i=1 x im x im x í1 x i1 x im x im 2 11

PCA: Direction of Maximum Variance Find direction w that maximizes projected variance Instances x~p X (assume mean x = 0). The projected variance onto (normalized) u 1 is E proj u1 x 2 = E u 1 T xx T u 1 = u 1 T E xx T Σ xx u 1 The empirical covariance matrix (of centered data) is Σ xx = 1 n XXT How can we find direction u 1 to maximize u 1 T Σ xx u 1? How can we kernelize it? 12

PCA: Optimization Problem Solution for u 1 : max variance of the projected data: max u T 1 Σ xx u 1, such that u 1 u T 1 u 1 = 1 Lagrangian: u 1 T Σ xx u 1 + λ 1 1 u 1 T u 1 13

PCA: Optimization Problem Solution for u 1 : max variance of the projected data: max u T 1 Σ xx u 1, such that u 1 u T 1 u 1 = 1 Lagrangian: u 1 T Σ xx u 1 + λ 1 1 u 1 T u 1 Taking its derivative & setting it to 0: Σ xx u 1 = λ 1 u 1 The solution u 1 must be an eigenvector of Σ xx Variance of the projected data: u 1 T Σ xx u 1 = u 1 T λ 1 u 1 = λ 1 The solution is the eigenvector u 1 of Σ xx with greatest eigenvalue λ 1, called the 1 st principal component 14

PCA: Optimization Problem Solution for u i : max variance of the projected data: max u T i Σ xx u i, such that u i u T i u i = 1 u i u 1,, u i u i 1 Lagrangian: u i T Σ xx u i + λ i 1 u i T u i Taking its derivative & setting it to 0: Σ xx u i = λ i u i The solution u i must be an eigenvector of Σ xx To maximize variance of the projected data: u i T Σ xx u i = u i T λ i u i = λ i And to assure that u i are orthogonal: u i is eigenvector with next-best eigenvalue λ i < λ i 1. 15

PCA: Optimization Problem Eigenvector decomposition implies: Σ xx = UΛU T With Λ = λ 1 0 0 0 0 0 0 λ m However, if U 1:d contains the first d eigenvectors, then Y = XU 1:d has only a fraction of the variance: d i=1 λ i tr(σ xx ) Choose d smaller than m but large enough to cover most of the variance. 16

PCA Projection of x to the eigenspace: y 1 x = u 1 T x y x = U T x with U = Y = XU u 1 u d Largest eigenvector is 1 st principal component The remaining principal components are orthogonal directions which maximize the residual variance d principal components vectors of the d largest eigenvalues 17

PCA: Reverse Projection Observation: u j form a basis for R m & y j x are the coordinates of x in that basis Data x i can thus be reconstructed in that basis: x i = m j=1 x i T u j u j or X = UU T X If data lies (mostly) in d-dimensional principal subspace, we can also reconstruct the data there: x i = d j=1 x i T u j u j or X = U 1:d U 1:d T X where U 1:d is the matrix of 1 st d eigenvectors 18

Reverse Projection: Example Morphace (Universität Basel) 3D face model of 200 persons (150,000 features) PCA with 199 principal components. 19

PCA: Algorithm PCA finds dataset s principal components, which maximize the projected variance Algorithm: 1. Compute data s mean: μ = 1 n n i=1 x i 2. Compute data s covariance: Σ xx = 1 n n i=1 x i μ x i μ T 3. Find principal axes: U = eigenvektors Σ xx 4. Project data onto 1 st d eigenvectors x i U 1:d T x i μ 20

Difference Between PCA and Autoencoder PCA: Linear mapping y x = U T x from x to y. Autoencoder: Linear mapping from x to h, then nonlinear activation function y = σ(x). Autoencoder with squared loss and linear activation function = PCA. Stacked autoencoder: more nonlinearity, more complex mappings. Kernel-PCA: linear mapping in feature space Φ. 21

Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 24

Kernel PCA PCA can only capture linear subspaces More complex features can capture non-linearity Want to use PCA in high-dimensional spaces Requirements: Data only interact through inner product 25

Kernel PCA: Example 26

Kernel PCA: Example PCA fails to capture the data s two ring structure rings are not separated in the first 2 components. 27

Kernel PCA: Kernel Recap Linear classifiers: Often adequate, but not always. Idea: data implicitly mapped to another space, in which they are linearly classifiable Image mapping: x φ x Associated kernel: κ x i, x j = φ x i T φ x j Kernel = inner product = similarity of Examples. - + - + - + - + (+) (+) (+) (+) (+) (-) (-) (-) + - - - - - (-) (-) (-) (-) (-) (-) 28

Kernel PCA Covariance of centered data: Σ xx = 1 n x i x i T i = Eigenvectors: Σ xx u = λu. In feature space, centered: i x i1 x i1 x im x im Σ φ x φ x = φ x i φ x i T i Eigenvectors: Σ φ x φ x u = λu φ x i φ x i T i u = λu All solutions live in the span of φ x 1, φ x n 29

Kernel PCA All solutions live in the span of φ x 1, φ x n Hence, all eigenvectors u must be linear combination of φ x 1, φ x n : α k : u = α i φ(x i ) n i=1 Hence, Σ φ x φ x u = λu is satisfied if n projected equations are satisfied: i: φ x T i Σ φ x φ x u = λφ x i u φ x i T φ x j φ x j T j = λφ x j T n k=1 n k=1 α k φ(x k ) α k φ(x k ) 30

Kernel PCA n projected equations: i: φ x i T Σ φ x φ x u = λφ x i u φ x i T 1 n 1 n j = λφ x j T j,k φ x j φ x j T n k=1 n α k φ(x k ) k=1 α k φ(x k ) α k φ x i T φ x j φ(x j ) T φ(x k ) n = λ α k φ x T i φ(x k ) k=1 K 2 α = nλkα Kα = nλα K = φ X φ X T 31

Kernel PCA Results in eigenvalue problem: Kα = nλα Centering data in feature space: K c ij = φ x i 1 n φ(x k ) φ x j 1 k n = K ij k i 1 j T 1 i k j T + k1 i 1 j T φ(x k ) k k i = 1 n K ik k k = 1 n 2 K jk j,k 32

Kernel-PCA Algorithm Kernel-PCA finds dataset s principal components in an implicitly defined feature space Algorithm: 1. Compute kernel matrix K: K ij = κ x i, x j 2. Center the kernel matrix: K = K 1 n 11T K 1 n K11T + 1T K1 n 2 3. Find its eigenvectors: U, V = eig K 11 T 4. 1 Find the dual vectors: α k = λ 2 k u k 5. Project the data onto the subspace: n x j α k,i K ij i=1 d k=1 = α T d k K,j k=1 34

Kernel PCA Ring Data Example Kernel PCA (RBF) does capture the data s structure & resulting projections separate the 2 rings 35

Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 39

Fisher-Discriminant Analysis (FDA) The subspace induced by PCA maximally captures variance from all data Not the correct criterion for classification 30 20 Original Space 30 20 PCA Subspace 10 0 X T u PCA 10 0 x 2 x 2-10 -10-20 -20-30 -30-40 -5-4 -3-2 -1 0 1 2 3 4 5 x 1-40 -1-0.5 0 0.5 1 x 1 Σu PCA = λ PCA u PCA 40

Fisher-Discriminant Analysis (FDA) Optimization criterion of PCA: Maximize the data s variance in the subspace. max ui u T Σu, where u i T u j = 1, u i u j Optimization criterion of FDA: Maximize between-class variance and minimize withinclass variance within the subspace. max u ut Σ b u u T Σ w u, where Variance per class Σ w = Σ +1 + Σ 1 Σ b = x +1 x 1 x +1 x 1 T 41

Fisher-Discriminant Analysis (FDA) Optimization criterion of FDA for k classes: Maximize between-class variance and minimize withinclass variance within the subspace. max u ut Σ b u u T Σ w u, where Leads to the generalized eigenvalue problem Σ b u = λσ w u Σ w = Σ 1 + + Σ k Σ b = n i x i x x i x T k i=1 Number of instances per class Generalized eigenvalue problem has k 1 different solutions 42

Fisher-Discriminant Analysis (FDA) The subspace induced by PCA maximally captures variance from all data Not the correct criterion for classification 30 20 Original Space 0.15 30 20 0.1 Fisher PCA Subspace Subpace 10 0 X T u PCA FDA 0.05 10 0 x 2 x 2-10 -0.05-10 -20-20 -0.1-30 -0.15-30 -40-5 -4-3 -2-1 0 1 2 3 4 5 x 1-40 -0.2-1 -1-0.5 0 0.5 11 x 1 Σu b u PCA FIS = λ PCA FIS Σu wpca u FIS 43

Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 44

t-sne Impossible to preserve all distances when projecting data into lower-dimensional space. PCA: Preserve maximum variance. Variance is squared distance. Sum is dominated by instances that are far apart. Instances that are far apart from each other shall remain as far apart in the projected space. Idea of t-sne: Preserve local neighborhood. Instances that are close to each other shall remain close in the projected space. Instances that are far apart may be moved further apart by the projection. 45

2D PCA for MNIST Handwritten Digits PAC is poor at preserving closeness between similar bitmaps. 46

Local Neighborhood in Original Space Probability that x i would pick x j as neighbor if neighbors were picked by Gaussian distribution centered at x i : 2 x i x j exp 2 2σ p j i = i 2 x i x j exp j i 2σ i 2 Set each σ i such that conditional has fixed entropy. 47

Distance in Projected Space Probability that x i would pick x j as neighbor if neighbors were picked by Student s t-distribution centered at x i : 2 1 1 + y i y j q j i = 2 1 1 + y i y j j i Student s t-distribution has heavier tails: very large distances are more likely than under Gaussian. Moving far instances further apart incurs less penalty. 48

t-sne: Optimization Criterion Move instances around in projected space to minimize Kullback-Leibler divergence: KL(p q = p j i log p j i j i q j i If p j i is large but q j i is small: large penalty. If q j i is large but p j i is small: smaller penalty. i Hence, preserves local neighborhood structure of the data. 49

t-sne: Optimization Move instances around in projected space to minimize Kullback-Leibler divergence Gradient for projected instance y i : KL(p q y i = 4 (pj i qj i) 1 + yi yj 2 1 (y i y j ) j i Implementation for large samples: Build quadtree over data Approximate p j i and q j i of instances in distinct branches by distances between centers of mass. 50

2D t-sne for MNIST Handwritten Digits Local similarities are preserved better. 51

Summary PCA constructs lower-dimensional space that preserves most of the variance. Kernel PCA works on the kernel matrix; good when there are fewer instances than there are features. Fisher linear discriminant analysis maximizes between-class and minimizes within-class variance t-sne finds a projection that preserves local neighborhood relations. 52