Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Similar documents
Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Introduction to Machine Learning

Data Mining Techniques

Principal Component Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Lecture 7: Con3nuous Latent Variable Models

Principal Component Analysis (PCA) CSC411/2515 Tutorial

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Principal Component Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

PCA, Kernel PCA, ICA

Probabilistic & Unsupervised Learning

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Latent Variable Models and EM Algorithm

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Probabilistic Latent Semantic Analysis

Dimensionality Reduction with Principal Component Analysis

STA 414/2104: Machine Learning

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

CS281 Section 4: Factor Analysis and PCA

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Vector Space Models. wine_spectral.r

Advanced Introduction to Machine Learning CMU-10715

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

HST.582J/6.555J/16.456J

Dimension Reduction (PCA, ICA, CCA, FLD,

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Advanced Introduction to Machine Learning CMU-10715

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Dimension Reduction. David M. Blei. April 23, 2012

Nonlinear Dimensionality Reduction

Dimensionality Reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Machine learning for pervasive systems Classification in high-dimensional spaces

Data Mining Techniques

Factor Analysis (10/2/13)

Machine Learning - MT & 14. PCA and MDS

1 Principal Components Analysis

Principal Component Analysis

Linear Dimensionality Reduction

Maximum variance formulation

Principal Component Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Singular Value Decomposition and Principal Component Analysis (PCA) I

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Announcements (repeat) Principal Components Analysis

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Large-scale Ordinal Collaborative Filtering

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Principal Component Analysis (PCA)

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

PCA and admixture models

Dimensionality Reduction

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Statistical Machine Learning

Principal Component Analysis CS498

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

STA 414/2104: Lecture 8

An Introduction to Independent Components Analysis (ICA)

Nonparameteric Regression:

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Probabilistic Graphical Models

Latent Variable Models and EM algorithm

Matrix and Tensor Factorization from a Machine Learning Perspective

STA 414/2104: Lecture 8

Probabilistic Graphical Models

Notes on Latent Semantic Analysis

Approximate Inference Part 1 of 2

Principal components analysis COMS 4771

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection


Collaborative Filtering: A Machine Learning Perspective

Joint Factor Analysis for Speaker Verification

CS181 Midterm 2 Practice Solutions

MLCC 2015 Dimensionality Reduction and PCA

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture Notes 2: Matrices

Factor Analysis and Kalman Filtering (11/2/04)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Variational Autoencoders

Collaborative Filtering. Radek Pelánek

FSAN/ELEG815: Statistical Learning

Andriy Mnih and Ruslan Salakhutdinov

Deriving Principal Component Analysis (PCA)

Principal Component Analysis

Advanced Introduction to Machine Learning

Single-channel source separation using non-negative matrix factorization

Approximate Inference Part 1 of 2

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

All you want to know about GPs: Linear Dimensionality Reduction

Gaussian Models

Transcription:

PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009

Motivation multivariate data Principal component analysis (PCA) classic Model based approach probabilistic PCA (ppca) Identifiability independent component analysis (ICA) InfoMax, smoothness and beyond.

Motivation multivariate data Data is often (but not always) represented as a matrix of d features and N samples: size(x) = [d N] In stats d = p, N = n and data matrix transposed X X T Collaborative filtering: Gene expression: Text analysis: X = item-user matrix X = gene-tissue matrix X = term-document matrix

Collaborative filtering

Collaborative filtering Netflix - online movie rental (DVDs). Collaborative filtering predict user rating from past behavior of user. Improve Netflix own system by 10% to win. training.txt R = 10 8 ratings, scale 1 to 5 for d = 17.770 movies and N = 480.189 users. qualifying.txt 2.817.131 movie-user pairs, (continuous) predictions submitted to Netflix returns a RMSE. Rating matrix X mostly missing values, 98.5%.

Collaborative filtering Collaborative filtering task Relatively large data set - 10 8 data points Very heterogeneous - viewers and movies with few ratings Ratings {1, 2, 3, 4, 5} noisy (subjective use of scale, non-stationary,... ) Complex model needed to capture latent structure Regularization!

Collaborative filtering Netflix prize - some key performance numbers Method RMSE % Improv. Cinematch 0.9514 0% PCA 0.89-0.92 5-6% Grand prize 0.8563 10% RMSE = root mean squared error Two teams (Ensemble and BellKor s Pragmatic Chaos) are above 10% but prize not handed out yet (of Aug 2009).

Gene Expression DNA Micro-Array d 6000 and N 60 cancer tissues. 1000 2000 3000 4000 5000 6000 10 20 30 40 50 60

Gene Expression Protein signalling network textbook Sachs et. al. Science, 2005.

Gene Expression Single cell flow cytometry measurements of 11 phosphorylated proteins and phospholipids. Data was generated from a series of stimulatory cues and inhibitory interventions. Observational data: 1755 general stimulatory conditions, Experimental data 80% not used in our approach. Not small n large p!

Latent semantic analysis (LSA) Bag of words representation term-document matrix

Principal component analysis Principal components (PCs) : the orthogonal directions with most variance. Empirical co-variance of (centered) data: S = 1 N XXT size(s) = [ d d ] PCs: eigen-vectors of S Su i = λ i u i 2 1.5 1 0.5 0 0.5 1 1.5 2 2 1 0 1 2 Plot axis λ i u i

Principal component analysis 100 90 2 2 80 70 0 0 60 50 40 2 4 6 2 2 0 2 2 2 0 2 Typical steps in PCA.

PCA maximum variance formulation Project data {x n } n=1,...,n onto directions {u i } i=1,...,m. We find directions sequentially, u 1 first. Mean of projected data: u T 1 x with x = 1 N Variance of projected data: 1 N N n=1 N x n. n=1 { u T 1 (x n x)} 2 = u T 1 Su 1.

PCA maximum variance formulation Maximize variance u T 1 Su 1 with respect to u 1. But we need a constraint to avoid u 1 : u T 1 Su 1 + λ 1 (1 u T 1 u 1) λ 1 Lagrange mulitplier. Solution - eigenvalue problem Su 1 = λ 1 u 1. Variance u T 1 Su 1 = u T 1 λ 1u 1 = λ 1.

PCA minimum error reconstruction x 2 x n u 1 x n x 1 Find best reconstructing orthonormal directions {u i }.

PCA minimum error reconstruction Orthonormal directions {u i }: u T i u j = δ ij. Lower dimensional subspace: x n = M D α ni u i + b i u i i=1 i=m+1 Minimize J({u i }) = 1 N N D D x n x n 2 =... = u T i Su i = λ i. n=1 i=m+1 i=m+1

PCA minimum error reconstruction Database of N, d = 28 28 = 784 pixel values Mean and first four PCs Mean Reconstruction x n = M (x T n u i )u i + i=1 D i=m+1 M ] ( x T n u u )u i = x+ [(x n x) T u i u i i=1

PCA minimum error reconstruction Where is the signal? x 10 5 3 2 1 3 x 106 2 1 0 0 200 400 600 (a) 0 0 200 400 600 (b) Original

Singular Value Decomposition (SVD) SVD a simpler way to do PCA: [U, D, V] = SVD(X) X = UDV T U and V are d d and N N orthonormal matrices: U T U = I d V T V = I N D diagonal d N of singular values ( 0) sorted: S = 1 N XT X = 1 N UDVT VD T U T = 1 N UDT DU T Columns of U are the eigenvectors of S, the PCs: Su i = D2 ii N u i and λ i = D2 ii N. Projection of all data on ith PC: u T i X = u T i UDV T = D ii v T i.

Singular Value Decomposition (SVD) Project onto PCs: U M, d M: X M U T M X = UT M UDVT = D M V T M D M, M M and V M, N M. Covariance of projected data: Whitening S = 1 N X M XT M = 1 N D MV T M V MD M = 1 N D2 M X M ND 1 M UT M X = NV T M Ŝ = I M. Lossy projection back to d-dim. space: Y d U M Y M.

Continuous Latent Variable Models Latent variable z is unobserved but can be learned from data. Translations (2) and rotation (1) latent variables in the example. Mapping (W, z n ) x n in general non-linear. We often consider simpler linear models: x n = Wz n + ɛ n.

Continuous Latent Variable Models Explain data by latent variables + noise: x = Wz + ɛ x, d dimensional data vector X = WZ + E W, d M dimensional mixing matrix. z, M dimensional latent variable or source vector. W im i = 1,..., d ɛ, d dimensional noise or residual σ 2 vector, i Often d > M, we want to come up with compact representation of data. x in ɛ in m = 1,..., M z mn n = 1,..., N

Continuous Latent Variable Models Linear latent variable model for collaborative filtering v n : M-dimensional taste vector of viewer n. u m : M-dimensional profile vector movie m. Latent variable h mn : h mn = u m v n + ɛ mn N (ɛ mn 0, σ 2 ) σ 2 is the noise level. Learn U and V training data and predict r m n u m v n

Probabilistic PCA Let us try to understand PCA as a latent variable model: x = Wz + ɛ SVD gives a hint: X = WZ + E = UDV T = U M D M V T M + E W = U M D M and Z = V T M

Probabilistic PCA Tipping and Bishop considered a specific assumption: P(z) = Norm(z; 0, I) P(ɛ; σ 2 ) = Norm(ɛ; 0, σ 2 I) Under this model x is Gaussian with x = Wz + ɛ = 0 xx T = Wzz T W T + ɛɛ T = WW T + σ 2 I Distribution of datum P(x; W, σ 2 ) = Norm(x; 0, WW T + σ 2 I)

Probabilistic PCA Log likelihood for W and σ 2 is joint distribution of all data: log L(θ; X) = n = N 2 log P(x n W, σ 2 ) { [ ]} log det 2πΣ + Tr Σ 1 S Model covariance: Σ = WW T + σ 2 I Empirical covariance: S = 1 N XXT The solution: the M PCs will the largest eigenvalues. The remaining variance will be explained σ 2 I. Example of structured covariance estimation.

Probabilistic PCA Let us try to solve the cocktail party problem: Recordings = Mixing Speakers Or x = Wz Use ppca to estimate W and z. Ignore complications of room acoustics.

Probabilistic PCA Stop sign! Non-uniqueness of solution! Likelihood only depends upon W through Σ = WW T + σ 2 I Rotate W: W WU leave covariance unchanged WW T = WUU T W = W WT. This can also be seen directly from the model: Wz = WUU T z = W z z = U T z The distribution is invariant z = Uz = 0 z z T = Uzz T U T = UIU T = I

Independent component analysis (ICA) Prior knowledge to the rescue! Real signals are not Gaussian Example x = w 1 z 1 + w 2 z 2 with z 1 and z 2 independent and heavy tailed. We exploit this information by putting this into our model! 15 10 5 0 5 10 40 20 0 20 40

Independent component analysis (ICA) Allow for more general z-distribution Still assume independent identically distribution (iid): P(Z) = mn P(z mn ) Many choices possible: heavy tailed (positive kurtosis), uniform, discrete (think of wireless communication), positive (used to decompose spectra, images, etc.) and negative kurtosis. Extension to temporal/spatial correlations (time series, images, etc.).

Independent component analysis (ICA) Summary linear generative models x = Wz + ɛ. Probabilistic PCA: p(z, ɛ) = N (z; 0, I)N (ɛ; 0, σ 2 I) Factor analysis p(z, ɛ) = N (z; 0, I)N (ɛ; 0, diag(σ1 2,..., σ2 D )) Independent component analysis M p(z, ɛ) = p(z m )p(ɛ) m=1 Encode a priori knowledge in p(z m ), e.g. heavy tails.

Independent component analysis (ICA) Bell and Sejnowski Algorithm aka InfoMax Assumption square mixing and no noise x = Wz W : d d Likelihood - one sample p(x W) = dz P(x W, z)p(z) = dz δ(x Wz)P(z) Make change of variables y = Wz and dy = W dz: p(x W) = 1 dyδ(x y)p(w 1 y) W = 1 W P(W 1 x) Maximize log likelihood: n log P(x n W).

Independent component analysis (ICA) Non-iid data temporal/spatial correlations z mn z m n = δ mm K m,nn It is easy to prove that rotation of z: Uz will no longer leave statistics of z unchanged, if kernels are different K m K m for different variables m m! Second statistics alone is therefore enough for identifiability! Molgedey and Schuster algorithm (aka MAF) one example using second order statistics Use of Gaussian processes (GP) another.

Beyond PCA and ICA Kernel PCA Component analysis in feature space x Φ(x) Nonlinear latent variable model x = Wf(z) + ɛ Fully probabilistic (and Bayesian) rather than one hammer (SVD) for all data Sparsity, Bayesian networks and latent variables models (Ricardo s talk) 1 0 1 1 0.5 0 1 0 1 0 0.5 1

Summary and reading PCA and SVD Generative model continuous latent variables Probabilistic PCA and ICA Non-linear and Bayesian extensions Books: C. Bishop, Pattern Recognition and Machine Learning, Springer and D. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge