Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Size: px

Start display at page:

Download "Principal Component Analysis (PCA) for Sparse High-Dimensional Data"

Sharlene Douglas
5 years ago
Views:

1 AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko, Alexander Ilin, and Juha Karhunen Helsinki University of Technology, Finland Adaptive Informatics Research Center

2 Principal Component Analysis Data X consists of n d-dimensional vectors Matrix X is decomposed in to a product of smaller matrices such that the square reconstruction error is minimized X AS, C = X AS 2 F = d n (x ij c a ik s kj ) 2 i=1 j=1 k=1

3 Algorithms for PCA Eigenvalue decomposition (standard approach) Compute the covariance matrix and its eigenvectors

4 Algorithms for PCA Eigenvalue decomposition (standard approach) Compute the covariance matrix and its eigenvectors EM algorithm Iterates between: A XS T (SS T ) 1, S (A T A) 1 A T X.

5 Algorithms for PCA Eigenvalue decomposition (standard approach) Compute the covariance matrix and its eigenvectors EM algorithm Iterates between: A XS T (SS T ) 1, S (A T A) 1 A T X. Minimization of cost C (Oja s subspace rule) A A + γ(x AS)S T, S S + γa T (X AS).

6 PCA with Missing Values Red and blue data points are reconstructed based on only one of the two dimensions

computationally heavier S A s :j = (A T j A j) 1 A T j j = 1,.

7 Adapting the Algorithms for Iterative imputation Missing Values Alternately 1) fill in missing values and 2) solve normal PCA with the standard approach EM algorithm becomes computationally heavier S A s :j = (A T j A j) 1 A T j j = 1,..., n X :j A T i: = X T i: ST i (S is T i ) 1 i = 1,..., d Minimization of cost C Easy to adapt: Take error over observed values only

Speeding up Gradient Descent Newton s method is known to converge fast, but It requires computing the Hessian matrix which is computationally too demanding in highdimensional

8 Speeding up Gradient Descent Newton s method is known to converge fast, but It requires computing the Hessian matrix which is computationally too demanding in highdimensional problems We propose using only the diagonal part of the Hessian We also include a control parameter to interpolate between standard gradient descent (0) and the diagonal Newton s method (1)

9 The cost function: C = (i,j) O e 2 ij, with e ij = x ij c k=1 a ik s kj.

10 The cost function: C = (i,j) O e 2 ij, with e ij = x ij c k=1 a ik s kj. Its partial derivatives: C a il = 2 j (i,j) O e ij s lj, C s lj = 2 i (i,j) O e ij a il.

11 The cost function: C = (i,j) O e 2 ij, with e ij = x ij c k=1 a ik s kj. Its partial derivatives: C a il = 2 j (i,j) O e ij s lj, C s lj = 2 i (i,j) O e ij a il. Update rules: a il a il γ ( 2 C a 2 il s lj s lj γ ( 2 C s 2 lj ) α C a il = a il + γ ) α C s lj = s lj + γ j (i,j) O e ijs lj ( j (i,j) O s2 lj ) α, i (i,j) O e ija il ( i (i,j) O a2 il ) α.

12 Overfitting in Case of Sparse Data Overfitted solution Regularized solution

13 Regularization against Overfitting Penalizing the use of large parameter values Estimating the distribution of unknown parameters (Variational Bayesian learning)

Experiments with Netflix Data www.netflixprize.

other people s preferences d = 18 000 movies, n = 500 000 customers, N =

14 Experiments with Netflix Data Collaborative filtering task: predict people s preferences based on other people s preferences d = movies, n = customers, N = movie ratings from 1 to % of the values are missing Find c=15 principal components

15 Computational Performance Method Complexity Seconds/Iter Hours to E O = 0.85 Gradient O(N c + nc) Speed-up O(N c + nc) Natural Grad. O(Nc + nc 2 ) Imputation O(nd 2 ) EM O(Nc 2 + nc 3 ) Summary of the computational performance of differen N= , # of ratings c=15, # of components n= , # of people d=18 000, # of movies

Error on Training Data against computation time in hours 1.04 1 0.96 0.

16 Error on Training Data against computation time in hours Gradient Speed!up Natural Grad. Imputation EM

17 Error on Validation Data against computation time in hours Gradient Speed!up Natural Grad. Regularized VB1 VB

Summary PCA with sparse data and high dimensionality has two problems that require attention: Standard algorithms are computationally inefficient Overfitted model does not generalize well to new data

18 Summary PCA with sparse data and high dimensionality has two problems that require attention: Standard algorithms are computationally inefficient Overfitted model does not generalize well to new data We proposed solutions to both problems: Using gradient descent to minimize the cost, and speeding it up by an approximated Newton s method Regularization of the model by Variational Bayesian learning

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko Helsinki University of Technology, Finland Adaptive Informatics Research Center The Data Explosion We are facing an enormous