CIFAR Lectures: Non-Gaussian statistics and natural images

CIFAR Lectures: Non-Gaussian statistics and natural images Dept of Computer Science University of Helsinki, Finland

Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity Part II: Natural images and ICA Application of ICA and sparse coding on natural images Extensions of ICA with dependent components Part III: Estimation of unnormalized models Motivation by extensions of ICA Score matching Noise-contrastive estimation Part IV: Recent extensions of ICA and natural image statistics A three-layer model, towards deep learning

Part I: Theory of ICA Definition of ICA as non-gaussian generative model Importance of non-gaussianity Fundamental difference to PCA Estimation by maximization of non-gaussianity Measures of non-gaussianity

Problem of blind source separation There is a number of source signals : Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Due to some external circumstances, only linear mixtures of the source signals are observed. Estimate (separate) original signals!

A solution is possible Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity PCA does not recover original signals

A solution is possible Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity PCA does not recover original signals Use information on statistical independence to recover:

Independent Component Analysis (Hérault and Jutten, 1984-1991) Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Observed random variables x i are modelled as linear sums of hidden variables: m x i = a ij s j, i = 1...n (1) j=1 Mathematical formulation of blind source separation problem Not unlike factor analysis Matrix of a ij is parameter matrix, called mixing matrix. The s i are hidden random variables called independent components, or source signals Problem: Estimate both a ij and s j, observing only x i.

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity When can the ICA model be estimated? Must assume: The si are mutually statistically independent The si are nongaussian (non-normal) (Optional:) Number of independent components is equal to number of observed variables Then: mixing matrix and components can be identified (Comon, 1994) A very surprising result!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Reminder: Principal component analysis Basic idea: find directions i w ix i of maximum variance We must constrain the norm of w: i w2 i = 1, otherwise solution is that w i are infinite. For more than one component, find direction of max var orthogonal to components previously found. Classic factor analysis has essentially same idea as in PCA: explain maximal variance with limited number of components

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Comparison of ICA, factor analysis and principal component analysis ICA is nongaussian FA with no noise or specific factors. So many components that all variance is explained by them. No factor rotation left unknown because of identifiability result In contrast to FA and PCA, components really give the original source signals or underlying hidden variables Catch: only works when components are nongaussian Many psychological hidden variables (e.g. intelligence ) may be (practically) gaussian because sum of many independent variables (central limit theorem). But signals measured by sensors are usually quite nongaussian

Some examples of nongaussianity Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity 2 5 6 1.5 1 0.5 0 4 3 2 1 0 4 2 0 0.5 1 2 1 1.5 2 3 4 2 0 1 2 3 4 5 6 7 8 9 10 4 0 1 2 3 4 5 6 7 8 9 10 6 0 1 2 3 4 5 6 7 8 9 10 0.7 0.7 0.8 0.6 0.6 0.7 0.5 0.5 0.6 0.4 0.3 0.4 0.3 0.5 0.4 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 2 1.5 1 0.5 0 0.5 1 1.5 2 0 4 3 2 1 0 1 2 3 4 5 0 6 4 2 0 2 4 6

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Why classic methods cannot find original components or sources In PCA and FA: find components y i which are uncorrelated cov(y i,y j ) = E{y i y j } E{y i }E{y j } = 0 (2) and maximize explained variance (or variance of components) Such methods need only the covariances, cov(x i,x j ) However, there are many different component sets that are uncorrelated, because The number of covariances is n 2 /2 due to symmetry So, we cannot solve the n 2 factor loadings, not enough information! ( More variables than equations )

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Nongaussianity, with independence, gives more information For independent variables we have E{h 1 (y 1 )h 2 (y 2 )} E{h 1 (y 1 )}E{h 2 (y 2 )} = 0. (3) For nongaussian variables, nonlinear covariances give more information than just covariances. This is not true for multivariate gaussian distribution Distribution is completely determined by covariances Uncorrelated gaussian variables are independent, and their distribution (standardized) is same in all directions (see below) ICA model cannot be estimated for gaussian data.

Illustration Definition of ICA Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Two components with uniform distributions: Original components, observed mixtures, PCA, ICA PCA does not find original coordinates, ICA does!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Illustration of problem with gaussian distributions Original components, observed mixtures, PCA Distribution after PCA is the same as distribution before mixing! Factor rotation problem in classic FA

Maximization of non-gaussianity Measures of non-gaussianity Basic intuitive principle of ICA estimation Inspired the Central Limit Theorem: Average of many independent random variables will have a distribution that is close(r) to gaussian In the limit of an infinite number of random variables, the distribution tends to gaussian Consider a linear combination i w ix i = i q is i Because of theorem, i q is i should be more gaussian than s i. Maximizing the nongaussianity of i w ix i, we can find s i. Also known as projection pursuit. Cf. principal component analysis: maximize variance of i w ix i.

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Histogram and scatterplot, mixturescifar given Lectures: by PCA Non-Gaussian statistics and natural images Definition of ICA Maximization of non-gaussianity Measures of non-gaussianity Illustration of changes in nongaussianity 0 4 3 2 1 0 1 2 3 4 Histogram and scatterplot, original uniform distributions 0 4 3 2 1 0 1 2 3 4

Maximization of non-gaussianity Measures of non-gaussianity Sparsity is the dominant form of non-gaussianity In natural signals, fundamental non-gaussianity is sparsity Sparsity = probability density has heavy tails and peak at zero: gaussian 5 0 2 1.5 5 1 5 sparse 0 5 0.5 0 3 2 1 0 1 2 3 (Another form of non-gaussianity is skewness or asymmetry)

Maximization of non-gaussianity Measures of non-gaussianity Kurtosis as nongaussianity measure Problem: how to measure nongaussianity (sparsity)? Definition: kurt(x) = E{x 4 } 3(E{x 2 }) 2 (4) if variance constrained to unity, essentially 4th moment. Simple algebraic properties because it s a cumulant: kurt(s 1 +s 2 ) = kurt(s 1 )+kurt(s 2 ) (5) kurt(αs 1 ) = α 4 kurt(s 1 ) (6) zero for gaussian RV, non-zero for most nongaussian RV s. positive vs. negative kurtosis have typical forms of pdf. Variance must be constrained to measure non-gaussianity

Illustration of pos and neg kurtosis Maximization of non-gaussianity Measures of non-gaussianity Left: Laplacian pdf, positive kurt ( supergaussian ). Right: Uniform pdf, negative kurt ( subgaussian ).

Maximization of non-gaussianity Measures of non-gaussianity Why kurtosis is not optimal Sensitive to outliers: Consider a sample of 1000 values with unit var, and one value equal to 10. Kurtosis equals at least 10 4 /1000 3 = 7. For supergaussian variables, statistical performance not optimal even without outliers. Other measures of nongaussianity should be considered.

Maximization of non-gaussianity Measures of non-gaussianity Differential entropy as nongaussianity measure Generalization of ordinary discrete Shannon entropy: H(x) = E{ logp(x)} (7) for fixed variance, maximized by gaussian distribution. often normalized to give negentropy J(x) = H(x gauss ) H(x) (8) Good statistical properties, but computationally difficult.

Approximation of negentropy Maximization of non-gaussianity Measures of non-gaussianity Approximations of negentropy (Hyvärinen, 1998): J G (x) = (E{G(x)} E{G(x gauss )}) 2 (9) where G is a nonquadratic function. Generalization of (square of) kurtosis (which is G(x) = x 4 ). A good compromise? statistical properties not bad (for suitable choice of G) computationally simple Further possibility: Skewness (for nonsymmetric ICs)

Maximization of non-gaussianity Measures of non-gaussianity Basic ICA estimation procedure 1. Whiten the data to give z. 2. Set iteration count i = 1. 3. Take a random vector w i. 4. Maximize nongaussianity of w T i z, under constraints w i 2 = 1 and w T i w j = 0,j < i 5. increment iteration count by 1, go back to 3 Alternatively: maximize all the w i in parallel, keeping them orthogonal.

Development of ICA algorithms Maximization of non-gaussianity Measures of non-gaussianity Nongaussianity measure: Essential ingredient Kurtosis: global consistency, but nonrobust. Differential entropy: statistically justified, but difficult to compute. Essentially same as likelihood (Pham et al, 1992/97) or infomax (Bell and Sejnowski, 1995) Rough approximations of entropy: compromise Optimization methods Gradient methods (e.g. natural gradient; Amari et al, 1996) Fast fixed-point algorithm, FastICA (Hyvärinen, 1999)

Maximization of non-gaussianity Measures of non-gaussianity Conclusion: Theory of ICA ICA is a non-gaussian factor analysis Basic principle: maximize non-gaussianity of components (Really very different from PCA: maximize variance of components) Sparsity is a form of non-gaussianity prevalent in natural signals Measures of non-gaussianity crucial: kurtosis vs. differential entropy

Part II: Natural images and ICA Natural images have statistical regularities Statistical models show optimal processing Basic model is independent component analysis Components are not really independent Need and opportunit Instead of nongaussianity we could use temporal correlations A unifying framework: bubbles

Linear statistical models of images = s 1 + s 2 + + s k Each image (patch) is a linear sum of basis vectors (features) What are the best basis vectors for natural images?

The visual cortex of the brain LGN V1 retina Receptive field of a simple cell in V1:

Sparse coding Sparse coding means: For random vector x, find linear representation: x = As (10) so that the components s i are as sparse (=supergaussian) as possible. Important property: a given data point is represented using only a limited number of active (clearly non-zero) components s i. In contrast to PCA, the active components change from image patch to another. Cf. vocabulary of a language which can describe many different things by combining a small number of active words. Maximizes non-gaussianity, therefore like ICA!

Independent subspaces Topographic ICA ICA / sparse coding of natural images (Olshausen and Field, 1996; Bell and Sejnowski, 1997) Features similar to wavelets, Gabor functions, simple cells.

Independent subspaces Topographic ICA Dependence of independent components Components estimated from natural images are not really independent Next, we model some of the dependencies Independent subspaces + Topographic ICA

Independent subspaces Topographic ICA Correlation of squares What kind of dependence remains between the components? Answer: Squares s 2 i and s 2 j are correlated inside a subspace Dependence through variances Similar to the models by Simoncelli et al on wavelet coefficients; Valpola et al on variance sources Two signals that are uncorrelated but whose squares are correlated.

Independent subspaces Topographic ICA Grouping components (Cardoso, 1998; Hyvärinen and Hoyer, 2000) Assumption: the s i can be divided into groups (subspaces), such that the si in the same group are dependent on each other dependencies between different groups are not allowed We also need to specify the distributions inside the groups Invariant features given by norms of projections on the subspaces spherically symmetric inside subspaces

Independent subspaces Topographic ICA Independent subspaces of natural images Emergence of phase-invariance, as in complex cells in V1.

Independent subspaces Topographic ICA Topographic ICA (Hyvärinen, Hoyer and Inki, 2001) Components are arranged on a two-dimensional lattice Statistical dependency follows topography: The squares s 2 i are correlated for near-by components Each local region is like a subspace

Topographic ICA on natural images Independent subspaces Topographic ICA Topography similar to what is found in the cortex.

Independent subspaces Topographic ICA Temporally coherent components (Hurri and Hyvärinen, 2003) In image sequences (video) we can look at the temporal correlations An alternative to nongaussianity Linear correlations give only Fourier-like receptive fields We proposed temporal correlations of squares Similar to source separation using nonstationary variance (Matsuoka et al, 1995)

Temporal coherence Bubbles Temporally coherent features on natural image sequences Features similar to those obtained by ICA

Temporal coherence Bubbles Bubbles: a unifying framework Correlation of squares both over time and over components spatiotemporal modulating variance variables Simple approximation of the likelihood can be obtained like in topographic ICA ( T n n G h(i,j,τ)(wi T x(t τ)) )+T 2 log detw.(11) t=1 j=1 i=1 τ where h(i,j,τ) is neighbourhood function, and G a nonlinear function.

Temporal coherence Bubbles Illustration of four types of representation sparse sparse topographic 2 2 4 4 6 6 < position of filter > 8 10 12 14 < position of filter > 8 10 12 14 16 16 18 18 20 20 2 4 6 8 10 12 14 16 18 20 time > sparse temporally coherent 2 4 6 8 10 12 14 16 18 20 time > bubbles 2 2 4 4 6 6 < position of filter > 8 10 12 14 < position of filter > 8 10 12 14 16 16 18 18 20 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 time > time >

Temporal coherence Bubbles Conclusion: Natural images and ICA ICA is a non-gaussian factor analysis Basic principle: maximize non-gaussianity of components Measures of non-gaussianity crucial: kurtosis vs. differential entropy ICA and related models show optimal features for natural images. ICA models basic linear features. Independent subspaces and topographic ICA model basic dependencies or nonlinearities. Temporal coherence is an alternative approach, leading to bubbles.