Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer

Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2

PCA: Motivation Data compression Preprocessing (Feature Selection / Noisy Features) Data visualization 3

PCA: Example Representation of Digits as an m m pixel matrix The actual number of degrees of freedom is significantly smaller because many features Are meaningless or Are composites of several pixels Goal: Reduce to a d-dimensional subspace 4

PCA: Example Representation of faces as an m m pixel matrix The actual number of degrees of freedom is significantly smaller because many combinations of pixels cannot occur in faces Reduce to a d-dimensional subspace 5

PCA: Projection A Projection is an idempotent linear Transformation Center point: x = 1 n Covariance: n n i=1 x i Σ = 1 x n i x x i x T i=1 x x i Data points x i u Projected data points proj u x = ux i 6

PCA: Projection A Projection is an idempotent linear Transformation Let u 1 R m with u 1 T u 1 = 1 y 1 x = u 1 T x constitutes a projection onto a one-dimensional subspace x x i u For data in the projection s space, it follows that: T Center (mean): y 1 x = u 1 x Variance: 1 n u T T n 1 x i u 1 x 2 i=1 = u T 1 Σ u 1 7

PCA: Problem Setting Given: data X = Find matrix U = x 1 x n = u 1 u d x 11 x 1m x n1 x n1 Vectors u i are orthonormal basis. such that Vector u 1 preserves maximal variance of data: max u 1 T 1 u 1 : u 1 =1 n XXT u 1 Vector u i preserves maximal residual variance. T1 max u u i : u i =1,u i u 1,,u i u i n XXT u i i 1 8

PCA: Problem Setting Given: data X = Find matrix U = x 1 x n = u 1 u d x 11 x 1m x n1 x n1 such that Value u k T x i is projection of x i onto dimension u k. Vector U T x i is projection of x i onto coordinates U. x n u 1 x n u d Matrix Y = XU is projection of X onto coordinates U: x 1 u 1 x 1 u d Y = XU = 9

PCA: Assumption To simplify the notation, assume centered data. x = 0. Can be achieved by subtracting mean value x i c = x i x 10

PCA: Direction of Maximum Variance Find direction u 1 that maximizes projected variance Instances x~p X (assume mean x = 0). The projected variance onto (normalized) u 1 is E proj u1 x 2 = E u 1 T xx T u 1 = u 1 T E xx T Σ xx u 1 E xx T n x i1 = x i1 x im Covariance matrix i=1 x im n x i1 x i1 2 x i1 x i1 x im x im = i=1 x im x im x í1 x i1 x im x im 2 11

PCA: Direction of Maximum Variance Find direction w that maximizes projected variance Instances x~p X (assume mean x = 0). The projected variance onto (normalized) u 1 is E proj u1 x 2 = E u 1 T xx T u 1 = u 1 T E xx T Σ xx u 1 The empirical covariance matrix (of centered data) is Σ xx = 1 n XXT How can we find direction u 1 to maximize u 1 T Σ xx u 1? How can we kernelize it? 12

PCA: Optimization Problem Solution for u 1 : max variance of the projected data: max u T 1 Σ xx u 1, such that u 1 u T 1 u 1 = 1 Lagrangian: u 1 T Σ xx u 1 + λ 1 1 u 1 T u 1 13

PCA: Optimization Problem Solution for u 1 : max variance of the projected data: max u T 1 Σ xx u 1, such that u 1 u T 1 u 1 = 1 Lagrangian: u 1 T Σ xx u 1 + λ 1 1 u 1 T u 1 Taking its derivative & setting it to 0: Σ xx u 1 = λ 1 u 1 The solution u 1 must be an eigenvector of Σ xx Variance of the projected data: u 1 T Σ xx u 1 = u 1 T λ 1 u 1 = λ 1 The solution is the eigenvector u 1 of Σ xx with greatest eigenvalue λ 1, called the 1 st principal component 14

PCA: Optimization Problem Solution for u i : max variance of the projected data: max u T i Σ xx u i, such that u i u T i u i = 1 u i u 1,, u i u i 1 Lagrangian: u i T Σ xx u i + λ i 1 u i T u i Taking its derivative & setting it to 0: Σ xx u i = λ i u i The solution u i must be an eigenvector of Σ xx To maximize variance of the projected data: u i T Σ xx u i = u i T λ i u i = λ i And to assure that u i are orthogonal: u i is eigenvector with next-best eigenvalue λ i < λ i 1. 15

PCA: Optimization Problem Eigenvector decomposition implies: Σ xx = UΛU T With Λ = λ 1 0 0 0 0 0 0 λ m However, if U 1:d contains the first d eigenvectors, then Y = XU 1:d has only a fraction of the variance: d i=1 λ i tr(σ xx ) Choose d smaller than m but large enough to cover most of the variance. 16

PCA Projection of x to the eigenspace: y 1 x = u 1 T x y x = U T x with U = Y = XU u 1 u d Largest eigenvector is 1 st principal component The remaining principal components are orthogonal directions which maximize the residual variance d principal components vectors of the d largest eigenvalues 17

PCA: Reverse Projection Observation: u j form a basis for R m & y j x are the coordinates of x in that basis Data x i can thus be reconstructed in that basis: x i = m j=1 x i T u j u j or X = UU T X If data lies (mostly) in d-dimensional principal subspace, we can also reconstruct the data there: x i = d j=1 x i T u j u j or X = U 1:d U 1:d T X where U 1:d is the matrix of 1 st d eigenvectors 18

Reverse Projection: Example Morphace (Universität Basel) 3D face model of 200 persons (150,000 features) PCA with 199 principal components. 19

PCA: Algorithm PCA finds dataset s principal components, which maximize the projected variance Algorithm: 1. Compute data s mean: μ = 1 n n i=1 x i 2. Compute data s covariance: Σ xx = 1 n n i=1 x i μ x i μ T 3. Find principal axes: U = eigenvektors Σ xx 4. Project data onto 1 st d eigenvectors x i U 1:d T x i μ 20

Difference Between PCA and Autoencoder PCA: Linear mapping y x = U T x from x to y. Autoencoder: Linear mapping from x to h, then nonlinear activation function y = σ(x). Autoencoder with squared loss and linear activation function = PCA. Stacked autoencoder: more nonlinearity, more complex mappings. Kernel-PCA: linear mapping in feature space Φ. 21

Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 24

Kernel PCA PCA can only capture linear subspaces More complex features can capture non-linearity Want to use PCA in high-dimensional spaces Requirements: Data only interact through inner product 25

Kernel PCA: Example 26

Kernel PCA: Example PCA fails to capture the data s two ring structure rings are not separated in the first 2 components. 27

Kernel PCA: Kernel Recap Linear classifiers: Often adequate, but not always. Idea: data implicitly mapped to another space, in which they are linearly classifiable Image mapping: x φ x Associated kernel: κ x i, x j = φ x i T φ x j Kernel = inner product = similarity of Examples. - + - + - + - + (+) (+) (+) (+) (+) (-) (-) (-) + - - - - - (-) (-) (-) (-) (-) (-) 28

Kernel PCA Covariance of centered data: Σ xx = 1 n x i x i T i = Eigenvectors: Σ xx u = λu. In feature space, centered: i x i1 x i1 x im x im Σ φ x φ x = φ x i φ x i T i Eigenvectors: Σ φ x φ x u = λu φ x i φ x i T i u = λu All solutions live in the span of φ x 1, φ x n 29

Kernel PCA All solutions live in the span of φ x 1, φ x n Hence, all eigenvectors u must be linear combination of φ x 1, φ x n : α k : u = α i φ(x i ) n i=1 Hence, Σ φ x φ x u = λu is satisfied if n projected equations are satisfied: i: φ x T i Σ φ x φ x u = λφ x i u φ x i T φ x j φ x j T j = λφ x j T n k=1 n k=1 α k φ(x k ) α k φ(x k ) 30

Kernel PCA n projected equations: i: φ x i T Σ φ x φ x u = λφ x i u φ x i T 1 n 1 n j = λφ x j T j,k φ x j φ x j T n k=1 n α k φ(x k ) k=1 α k φ(x k ) α k φ x i T φ x j φ(x j ) T φ(x k ) n = λ α k φ x T i φ(x k ) k=1 K 2 α = nλkα Kα = nλα K = φ X φ X T 31

Kernel PCA Results in eigenvalue problem: Kα = nλα Centering data in feature space: K c ij = φ x i 1 n φ(x k ) φ x j 1 k n = K ij k i 1 j T 1 i k j T + k1 i 1 j T φ(x k ) k k i = 1 n K ik k k = 1 n 2 K jk j,k 32

Kernel-PCA Algorithm Kernel-PCA finds dataset s principal components in an implicitly defined feature space Algorithm: 1. Compute kernel matrix K: K ij = κ x i, x j 2. Center the kernel matrix: K = K 1 n 11T K 1 n K11T + 1T K1 n 2 3. Find its eigenvectors: U, V = eig K 11 T 4. 1 Find the dual vectors: α k = λ 2 k u k 5. Project the data onto the subspace: n x j α k,i K ij i=1 d k=1 = α T d k K,j k=1 34

Kernel PCA Ring Data Example Kernel PCA (RBF) does capture the data s structure & resulting projections separate the 2 rings 35

Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 39

Fisher-Discriminant Analysis (FDA) The subspace induced by PCA maximally captures variance from all data Not the correct criterion for classification 30 20 Original Space 30 20 PCA Subspace 10 0 X T u PCA 10 0 x 2 x 2-10 -10-20 -20-30 -30-40 -5-4 -3-2 -1 0 1 2 3 4 5 x 1-40 -1-0.5 0 0.5 1 x 1 Σu PCA = λ PCA u PCA 40

Fisher-Discriminant Analysis (FDA) Optimization criterion of PCA: Maximize the data s variance in the subspace. max ui u T Σu, where u i T u j = 1, u i u j Optimization criterion of FDA: Maximize between-class variance and minimize withinclass variance within the subspace. max u ut Σ b u u T Σ w u, where Variance per class Σ w = Σ +1 + Σ 1 Σ b = x +1 x 1 x +1 x 1 T 41

Fisher-Discriminant Analysis (FDA) Optimization criterion of FDA for k classes: Maximize between-class variance and minimize withinclass variance within the subspace. max u ut Σ b u u T Σ w u, where Leads to the generalized eigenvalue problem Σ b u = λσ w u Σ w = Σ 1 + + Σ k Σ b = n i x i x x i x T k i=1 Number of instances per class Generalized eigenvalue problem has k 1 different solutions 42

Fisher-Discriminant Analysis (FDA) The subspace induced by PCA maximally captures variance from all data Not the correct criterion for classification 30 20 Original Space 0.15 30 20 0.1 Fisher PCA Subspace Subpace 10 0 X T u PCA FDA 0.05 10 0 x 2 x 2-10 -0.05-10 -20-20 -0.1-30 -0.15-30 -40-5 -4-3 -2-1 0 1 2 3 4 5 x 1-40 -0.2-1 -1-0.5 0 0.5 11 x 1 Σu b u PCA FIS = λ PCA FIS Σu wpca u FIS 43

Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 44

t-sne Impossible to preserve all distances when projecting data into lower-dimensional space. PCA: Preserve maximum variance. Variance is squared distance. Sum is dominated by instances that are far apart. Instances that are far apart from each other shall remain as far apart in the projected space. Idea of t-sne: Preserve local neighborhood. Instances that are close to each other shall remain close in the projected space. Instances that are far apart may be moved further apart by the projection. 45

2D PCA for MNIST Handwritten Digits PAC is poor at preserving closeness between similar bitmaps. 46

Local Neighborhood in Original Space Probability that x i would pick x j as neighbor if neighbors were picked by Gaussian distribution centered at x i : 2 x i x j exp 2 2σ p j i = i 2 x i x j exp j i 2σ i 2 Set each σ i such that conditional has fixed entropy. 47

Distance in Projected Space Probability that x i would pick x j as neighbor if neighbors were picked by Student s t-distribution centered at x i : 2 1 1 + y i y j q j i = 2 1 1 + y i y j j i Student s t-distribution has heavier tails: very large distances are more likely than under Gaussian. Moving far instances further apart incurs less penalty. 48

t-sne: Optimization Criterion Move instances around in projected space to minimize Kullback-Leibler divergence: KL(p q = p j i log p j i j i q j i If p j i is large but q j i is small: large penalty. If q j i is large but p j i is small: smaller penalty. i Hence, preserves local neighborhood structure of the data. 49

t-sne: Optimization Move instances around in projected space to minimize Kullback-Leibler divergence Gradient for projected instance y i : KL(p q y i = 4 (pj i qj i) 1 + yi yj 2 1 (y i y j ) j i Implementation for large samples: Build quadtree over data Approximate p j i and q j i of instances in distinct branches by distances between centers of mass. 50

2D t-sne for MNIST Handwritten Digits Local similarities are preserved better. 51

Summary PCA constructs lower-dimensional space that preserves most of the variance. Kernel PCA works on the kernel matrix; good when there are fewer instances than there are features. Fisher linear discriminant analysis maximizes between-class and minimizes within-class variance t-sne finds a projection that preserves local neighborhood relations. 52