Principal Component Analysis (PCA) for Sparse High-Dimensional Data

AB Principal Component Analysis (PCA) for Sparse High-Dimensional Data Tapani Raiko Helsinki University of Technology, Finland Adaptive Informatics Research Center

The Data Explosion We are facing an enormous challenge in the ever increasing amount of data in electronic form First wave: text, second wave: real-world data Basically, any information that may have value will be made available, e.g., through the Web We need adaptive informatics which adds intelligence at the access point.

Adaptive Informatics: A field of research where automated learning algorithms are used to discover informative concepts, components, and their mutual relations from large amounts of real-world data The goal is to understand the underlying phenomena, structures, and patterns buried in the large data sets, in order to make the information usable.

Retrieval of multimodel objects:

Proactive Information Retrieval

Principal Component Analysis Data X consists of n d-dimensional vectors Matrix X is decomposed in to a product of smaller matrices such that the square reconstruction error is minimized X AS, C = X AS 2 F = d n (x ij c a ik s kj ) 2 i=1 j=1 k=1

Algorithms for PCA Eigenvalue decomposition (standard approach) Compute the covariance matrix and its eigenvectors

Algorithms for PCA Eigenvalue decomposition (standard approach) Compute the covariance matrix and its eigenvectors EM algorithm Iterates between: A XS T (SS T ) 1, S (A T A) 1 A T X.

Algorithms for PCA Eigenvalue decomposition (standard approach) Compute the covariance matrix and its eigenvectors EM algorithm Iterates between: A XS T (SS T ) 1, S (A T A) 1 A T X. Minimization of cost C (Oja s subspace rule) A A + γ(x AS)S T, S S + γa T (X AS).

PCA with Missing Values Red and blue data points are reconstructed based on only one of the two dimensions

Adapting the Algorithms for Iterative imputation Missing Values Alternately 1) fill in missing values and 2) solve normal PCA with the standard approach

Adapting the Algorithms for Iterative imputation Missing Values Alternately 1) fill in missing values and 2) solve normal PCA with the standard approach EM algorithm becomes computationally heavier S A s :j = (A T j A j) 1 A T j j = 1,..., n X :j A T i: = X T i: ST i (S is T i ) 1 i = 1,..., d

Speeding up Gradient Descent Newton s method is known to converge fast, but It requires computing the Hessian matrix which is computationally too demanding in highdimensional problems We propose using only the diagonal part of the Hessian We also include a control parameter to interpolate between standard gradient descent (0) and the diagonal Newton s method (1)

The cost function: C = (i,j) O e 2 ij, with e ij = x ij c k=1 a ik s kj.

The cost function: C = (i,j) O e 2 ij, with e ij = x ij c k=1 a ik s kj. Its partial derivatives: C a il = 2 j (i,j) O e ij s lj, C s lj = 2 i (i,j) O e ij a il.

The cost function: C = (i,j) O e 2 ij, with e ij = x ij c k=1 a ik s kj. Its partial derivatives: C a il = 2 j (i,j) O e ij s lj, C s lj = 2 i (i,j) O e ij a il. Update rules: a il a il γ ( 2 C a 2 il s lj s lj γ ( 2 C s 2 lj ) α C a il = a il + γ ) α C s lj = s lj + γ j (i,j) O e ijs lj ( j (i,j) O s2 lj ) α, i (i,j) O e ija il ( i (i,j) O a2 il ) α.

Overfitting in Case of Sparse Data Overfitted solution Regularized solution

Regularization against Overfitting Penalizing the use of large parameter values Estimating the distribution of unknown parameters (Variational Bayesian learning)

Experiments with Netflix Data www.netflixprize.com Collaborative filtering task: predict people s preferences based on other people s preferences d = 18 000 movies, n = 500 000 customers, N = 100 000 000 movie ratings from 1 to 5 98.8% of the values are missing Find c=15 principal components

Computational Performance Method Complexity Seconds/Iter Hours to E O = 0.85 Gradient O(N c + nc) 58 1.9 Speed-up O(N c + nc) 110 0.22 Natural Grad. O(Nc + nc 2 ) 75 3.5 Imputation O(nd 2 ) 110000 64 EM O(Nc 2 + nc 3 ) 45000 58 Summary of the computational performance of differen N=100 000 000, # of ratings c=15, # of components n=500 000, # of people d=18 000, # of movies

Error on Training Data against computation time in hours 1.04 1 0.96 0.92 Gradient Speed!up Natural Grad. Imputation EM 0.88 0.84 0.8 0.76 0 1 2 4 8 16 32 64

Error on Validation Data against computation time in hours 1.1 1.05 Gradient Speed!up Natural Grad. Regularized VB1 VB2 1 0.95 0 1 2 4 8 16 32

Variational Bayesian Learning The main issue in probabilistic machine learning models is to find the posterior distribution over the model parameters and latent variables Using a point estimate might overfit Sampling is prohibitively slow for large latent variable models Variational Bayesian (VB) learning is a good compromise

Overfitting An overfitted model explains the current data but does not generalize well to new data 6th order polynomial is fitted to 10 points by maximum likelihood and sampling

Posterior mass matters You want to make predictions about new data Y based on existing data X This is solved by fitting a model to the data and then predicting based on that p(y X) = p(y X, Z, θ)p(z, θ X)dZdθ Note how you need to integrate over the posterior)p(z, θ X) If you need to select a single solution Z, θ, it should represent the posterior mass well

Why early stopping might help

Variational Bayes VB works by fitting a distribution q over the unknown variables to the true posterior by minimizing the KL divergence: KL (q(z, θ) p(z, θ X)) = E q(z,θ) { ln } q(z, θ) p(z, θ X) The form of q can be chosen such that the expectations are tractable For instance, q(z, θ) = q(z)q(θ) is assumed almost always, allowing the VB-EM algorithm KL divergence can also be used for model comparison

VB-EM algorithm The VB-EM algorithm alternates between updates for the latent variables and parameters Steps are symmetric and they resemble the E-step of the EM algorithm VB-E step: q(z) argmin VB-M step: E q(θ) {KL (q(z) p(z X, θ))} q(z) { } ( ) { ( ( ) ( ))} q(θ) argmin q(θ) E q(z) {KL (q(θ) p(θ X, Z))}

tributions 0 of x1 and y Example 2 are shown in black contours. Maxte is plotted as a red star, Bayes estimate (or the expec-!2 1 0 2 4 Figure 2.1: Posterior distributions of x and y ar ) is plotted as a red circle. Variational!2 Bayesian!1 solution 0 erior y imum distributions a posterior of xestimate and model y are is plotted shownas ina black red st ior approximation with diagonal covariance is shown in tation posterior over the posterior) is plotted as a red circ d2 by ellipses. Left: model p(z) = N (z; xy, 0.02), obserwith a GaussianFigure 2.1: Posterior distr ) = N (x; 0, 1), p(y) maximum posterior approximation with = N (y; 0, 1). Right: model p(z) = 1 imum prior ationblue z = 2, asvb priors a dotp(x) surrounded = N (x; a 1, posterior by5), ellipses. p(y) = estimate N Left: (y; 0, mo 5). vation z = 1, priors tation p(x) over = N the (x; 0, posterior) 1), p(y) = i 0 N (z; y, mass exp( x)), centre observation z = 2, priors p(x)!1!2!2!1 0 1 2!2!1 0 1 2!2!2 r estimate is plotted as a red star, Bayes estimate osterior) is plotted as a red circle. Variational Ba with a Gaussian posterior vation z = 1, priors p(x) = osterior distributions. blue The models as a dot are not surrounded particularly data two unknown posterior variables), but they are chosen to highus posterior approximations, which are described in the!4 posterior approximation with diagonal covarian rrounded by ellipses. Left: model p(z) = N (z; x iors p(x) = N (x; 0, 1), p(y) = N (y; 0, 1). Right: ), observation z = 2, priors p(x) = N (x; 1, 5), p(y Figure 2.1 shows two posterior distributions. N (z; y, exp( x)), observat meaningful (having x just two unknown variables s two posterior distributions. The models are n light differences in various posterior approximat

0 1 2!2 Figure 0 2.1:2 Posterio 4!2 shown!2in black 0 contours. 2 4 6 imum a8 posterior e are Max!4 2. Bayesian probability theory rior distributions of x y are shown in black co tation over the post star, Bayes estimate (or and the expecis plotted as a red star, Bayes estimate ( circle. Variational Bayesian solution with a Gaussian p!2!1 0 1 2 xestimate and y are shown in black contours. Max!2 y th diagonal covariance is shown in 6 blue as a dot surro sterior) is plotted as a red circle. Variational Bay as a red star, Bayes estimate (or the expecmodel model p(z) = N (z; xy, 0.02), observation z = 1, prior posterior approximation with diagonal covarianc s 4a red circle. Variational Bayesian of solution gure 2.1: Posterior distributions x and y are = N (y; 0, 1). Right: model p(z) = VB Np(z) (z; y,= exp( x)), o rounded by ellipses. Left: model N (z; xy ation with diagonal covariance is shown inred sta um a posterior estimate is plotted as a maximum x)2 = N (x; 1, 5), p(y) = N (y; prior 0, 5). Example 2 ors p(x)model = N (x; 0, = 1),Np(y) =N (y; 0,obser1). Right: Left: p(z) (z; xy, 0.02), tion over the posterior) is plotted as a red circl mass centre observation =0,2,1). priors p(x)model = N (x; 1,=5), p(y) 1), (y; Right: p(z) 0 p(y) = N z th a Gaussian posterior approximation with Figure 2.1 shows t priorsposterior p(x) = N (x; 1, 5), p(y) = N (y; 0, 5). ue as amodels dot surrounded by ellipses. Left:(having mod.!2 The are not particularly meaningful bles), but they are chosen to=hightion z = 1, priors p(x) N (x; 0, 1), p(y) = N light differences in data!4 mations, which are described in the two posterior distributions. The models are no following. (z;!2 y, exp( x)), observation z = 2, priors p(x) = 0 2 4 6 8 ng just twothe unknown but they are ch ibutions. models variables), are x not particularly nd y are shown in black contours. Max-

By restricting the form of q(z), the inference (E-step) can be made faster A B C D E F G Completely factorized A Tree A Exact A B C B C B C D E D E D E F F F G G G

Pros and cons of VB + Robust against overfitting + Fast (compared to sampling) + Applicable to a large family of models - Intensive formulae (lots of integrals) - Prone to bad but locally optimal solutions (lot of work with arranging good initializations and other tricks to avoid them)

Bayes Blocks Software Package Bayes Block by Valpola et al. concentrates on continuous values fully factorial posterior approximation includes nonlinearities allows for variance modelling algorithm: message passing with line searches for speed-up