Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009

Motivation multivariate data Principal component analysis (PCA) classic Model based approach probabilistic PCA (ppca) Identifiability independent component analysis (ICA) InfoMax, smoothness and beyond.

Motivation multivariate data Data is often (but not always) represented as a matrix of d features and N samples: size(x) = [d N] In stats d = p, N = n and data matrix transposed X X T Collaborative filtering: Gene expression: Text analysis: X = item-user matrix X = gene-tissue matrix X = term-document matrix

Collaborative filtering

Collaborative filtering Netflix - online movie rental (DVDs). Collaborative filtering predict user rating from past behavior of user. Improve Netflix own system by 10% to win. training.txt R = 10 8 ratings, scale 1 to 5 for d = 17.770 movies and N = 480.189 users. qualifying.txt 2.817.131 movie-user pairs, (continuous) predictions submitted to Netflix returns a RMSE. Rating matrix X mostly missing values, 98.5%.

Collaborative filtering Collaborative filtering task Relatively large data set - 10 8 data points Very heterogeneous - viewers and movies with few ratings Ratings {1, 2, 3, 4, 5} noisy (subjective use of scale, non-stationary,... ) Complex model needed to capture latent structure Regularization!

Collaborative filtering Netflix prize - some key performance numbers Method RMSE % Improv. Cinematch 0.9514 0% PCA 0.89-0.92 5-6% Grand prize 0.8563 10% RMSE = root mean squared error Two teams (Ensemble and BellKor s Pragmatic Chaos) are above 10% but prize not handed out yet (of Aug 2009).

Gene Expression DNA Micro-Array d 6000 and N 60 cancer tissues. 1000 2000 3000 4000 5000 6000 10 20 30 40 50 60

Gene Expression Protein signalling network textbook Sachs et. al. Science, 2005.

Gene Expression Single cell flow cytometry measurements of 11 phosphorylated proteins and phospholipids. Data was generated from a series of stimulatory cues and inhibitory interventions. Observational data: 1755 general stimulatory conditions, Experimental data 80% not used in our approach. Not small n large p!

Latent semantic analysis (LSA) Bag of words representation term-document matrix

Principal component analysis Principal components (PCs) : the orthogonal directions with most variance. Empirical co-variance of (centered) data: S = 1 N XXT size(s) = [ d d ] PCs: eigen-vectors of S Su i = λ i u i 2 1.5 1 0.5 0 0.5 1 1.5 2 2 1 0 1 2 Plot axis λ i u i

Principal component analysis 100 90 2 2 80 70 0 0 60 50 40 2 4 6 2 2 0 2 2 2 0 2 Typical steps in PCA.

PCA maximum variance formulation Project data {x n } n=1,...,n onto directions {u i } i=1,...,m. We find directions sequentially, u 1 first. Mean of projected data: u T 1 x with x = 1 N Variance of projected data: 1 N N n=1 N x n. n=1 { u T 1 (x n x)} 2 = u T 1 Su 1.

PCA maximum variance formulation Maximize variance u T 1 Su 1 with respect to u 1. But we need a constraint to avoid u 1 : u T 1 Su 1 + λ 1 (1 u T 1 u 1) λ 1 Lagrange mulitplier. Solution - eigenvalue problem Su 1 = λ 1 u 1. Variance u T 1 Su 1 = u T 1 λ 1u 1 = λ 1.

PCA minimum error reconstruction x 2 x n u 1 x n x 1 Find best reconstructing orthonormal directions {u i }.

PCA minimum error reconstruction Orthonormal directions {u i }: u T i u j = δ ij. Lower dimensional subspace: x n = M D α ni u i + b i u i i=1 i=m+1 Minimize J({u i }) = 1 N N D D x n x n 2 =... = u T i Su i = λ i. n=1 i=m+1 i=m+1

PCA minimum error reconstruction Database of N, d = 28 28 = 784 pixel values Mean and first four PCs Mean Reconstruction x n = M (x T n u i )u i + i=1 D i=m+1 M ] ( x T n u u )u i = x+ [(x n x) T u i u i i=1

PCA minimum error reconstruction Where is the signal? x 10 5 3 2 1 3 x 106 2 1 0 0 200 400 600 (a) 0 0 200 400 600 (b) Original

Singular Value Decomposition (SVD) SVD a simpler way to do PCA: [U, D, V] = SVD(X) X = UDV T U and V are d d and N N orthonormal matrices: U T U = I d V T V = I N D diagonal d N of singular values ( 0) sorted: S = 1 N XT X = 1 N UDVT VD T U T = 1 N UDT DU T Columns of U are the eigenvectors of S, the PCs: Su i = D2 ii N u i and λ i = D2 ii N. Projection of all data on ith PC: u T i X = u T i UDV T = D ii v T i.

Singular Value Decomposition (SVD) Project onto PCs: U M, d M: X M U T M X = UT M UDVT = D M V T M D M, M M and V M, N M. Covariance of projected data: Whitening S = 1 N X M XT M = 1 N D MV T M V MD M = 1 N D2 M X M ND 1 M UT M X = NV T M Ŝ = I M. Lossy projection back to d-dim. space: Y d U M Y M.

Continuous Latent Variable Models Latent variable z is unobserved but can be learned from data. Translations (2) and rotation (1) latent variables in the example. Mapping (W, z n ) x n in general non-linear. We often consider simpler linear models: x n = Wz n + ɛ n.

Continuous Latent Variable Models Explain data by latent variables + noise: x = Wz + ɛ x, d dimensional data vector X = WZ + E W, d M dimensional mixing matrix. z, M dimensional latent variable or source vector. W im i = 1,..., d ɛ, d dimensional noise or residual σ 2 vector, i Often d > M, we want to come up with compact representation of data. x in ɛ in m = 1,..., M z mn n = 1,..., N

Continuous Latent Variable Models Linear latent variable model for collaborative filtering v n : M-dimensional taste vector of viewer n. u m : M-dimensional profile vector movie m. Latent variable h mn : h mn = u m v n + ɛ mn N (ɛ mn 0, σ 2 ) σ 2 is the noise level. Learn U and V training data and predict r m n u m v n

Probabilistic PCA Let us try to understand PCA as a latent variable model: x = Wz + ɛ SVD gives a hint: X = WZ + E = UDV T = U M D M V T M + E W = U M D M and Z = V T M

Probabilistic PCA Tipping and Bishop considered a specific assumption: P(z) = Norm(z; 0, I) P(ɛ; σ 2 ) = Norm(ɛ; 0, σ 2 I) Under this model x is Gaussian with x = Wz + ɛ = 0 xx T = Wzz T W T + ɛɛ T = WW T + σ 2 I Distribution of datum P(x; W, σ 2 ) = Norm(x; 0, WW T + σ 2 I)

Probabilistic PCA Log likelihood for W and σ 2 is joint distribution of all data: log L(θ; X) = n = N 2 log P(x n W, σ 2 ) { [ ]} log det 2πΣ + Tr Σ 1 S Model covariance: Σ = WW T + σ 2 I Empirical covariance: S = 1 N XXT The solution: the M PCs will the largest eigenvalues. The remaining variance will be explained σ 2 I. Example of structured covariance estimation.

Probabilistic PCA Let us try to solve the cocktail party problem: Recordings = Mixing Speakers Or x = Wz Use ppca to estimate W and z. Ignore complications of room acoustics.

Probabilistic PCA Stop sign! Non-uniqueness of solution! Likelihood only depends upon W through Σ = WW T + σ 2 I Rotate W: W WU leave covariance unchanged WW T = WUU T W = W WT. This can also be seen directly from the model: Wz = WUU T z = W z z = U T z The distribution is invariant z = Uz = 0 z z T = Uzz T U T = UIU T = I

Independent component analysis (ICA) Prior knowledge to the rescue! Real signals are not Gaussian Example x = w 1 z 1 + w 2 z 2 with z 1 and z 2 independent and heavy tailed. We exploit this information by putting this into our model! 15 10 5 0 5 10 40 20 0 20 40

Independent component analysis (ICA) Allow for more general z-distribution Still assume independent identically distribution (iid): P(Z) = mn P(z mn ) Many choices possible: heavy tailed (positive kurtosis), uniform, discrete (think of wireless communication), positive (used to decompose spectra, images, etc.) and negative kurtosis. Extension to temporal/spatial correlations (time series, images, etc.).

Independent component analysis (ICA) Summary linear generative models x = Wz + ɛ. Probabilistic PCA: p(z, ɛ) = N (z; 0, I)N (ɛ; 0, σ 2 I) Factor analysis p(z, ɛ) = N (z; 0, I)N (ɛ; 0, diag(σ1 2,..., σ2 D )) Independent component analysis M p(z, ɛ) = p(z m )p(ɛ) m=1 Encode a priori knowledge in p(z m ), e.g. heavy tails.

Independent component analysis (ICA) Bell and Sejnowski Algorithm aka InfoMax Assumption square mixing and no noise x = Wz W : d d Likelihood - one sample p(x W) = dz P(x W, z)p(z) = dz δ(x Wz)P(z) Make change of variables y = Wz and dy = W dz: p(x W) = 1 dyδ(x y)p(w 1 y) W = 1 W P(W 1 x) Maximize log likelihood: n log P(x n W).

Independent component analysis (ICA) Non-iid data temporal/spatial correlations z mn z m n = δ mm K m,nn It is easy to prove that rotation of z: Uz will no longer leave statistics of z unchanged, if kernels are different K m K m for different variables m m! Second statistics alone is therefore enough for identifiability! Molgedey and Schuster algorithm (aka MAF) one example using second order statistics Use of Gaussian processes (GP) another.

Beyond PCA and ICA Kernel PCA Component analysis in feature space x Φ(x) Nonlinear latent variable model x = Wf(z) + ɛ Fully probabilistic (and Bayesian) rather than one hammer (SVD) for all data Sparsity, Bayesian networks and latent variables models (Ricardo s talk) 1 0 1 1 0.5 0 1 0 1 0 0.5 1

Summary and reading PCA and SVD Generative model continuous latent variables Probabilistic PCA and ICA Non-linear and Bayesian extensions Books: C. Bishop, Pattern Recognition and Machine Learning, Springer and D. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge