Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Unit University College London 27 Feb 2017

Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity Part II: Natural images and ICA Application of ICA and sparse coding on natural images Extensions of ICA with dependent components Part III: Estimation of unnormalized models Motivation by extensions of ICA Score matching Noise-contrastive estimation Part IV: Recent extensions of ICA and natural image statistics A three-layer model, towards deep learning

Part I: Theory of ICA Definition of ICA as non-gaussian generative model Importance of non-gaussianity Fundamental difference to PCA Estimation by maximization of non-gaussianity Measures of non-gaussianity

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Problem of blind source separation There is a number of source signals : Due to some external circumstances, only linear mixtures of the source signals are observed. Estimate (separate) original signals!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity A solution is possible PCA does not recover original signals

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity A solution is possible PCA does not recover original signals Use information on statistical independence to recover:

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Independent Component Analysis (Hérault and Jutten, 1984-1991) Observed random variables x i are modelled as linear sums of hidden variables: m x i = a ij s j, i = 1...n (1) j=1 Mathematical formulation of blind source separation problem Not unlike factor analysis Matrix of a ij is parameter matrix, called mixing matrix. The s i are hidden random variables called independent components, or source signals Problem: Estimate both a ij and s j, observing only x i.

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity When can the ICA model be estimated? Must assume: The si are mutually statistically independent The si are nongaussian (non-normal) (Optional:) Number of independent components is equal to number of observed variables Then: mixing matrix and components can be identified (Comon, 1994) A very surprising result!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Reminder: Principal component analysis Basic idea: find directions i w ix i of maximum variance We must constrain the norm of w: i w2 i = 1, otherwise solution is that w i are infinite. For more than one component, find direction of max var orthogonal to components previously found. Classic factor analysis has essentially same idea as in PCA: explain maximal variance with limited number of components

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Comparison of ICA, PCA, and factor analysis In contrast to PCA and factor analysis, components really give the original source signals or underlying hidden variables In PCA and factor analysis, only a subspace is properly determined (although an arbitrary basis is given as output) Catch: ICA only works when components are nongaussian Many psychological or social-science hidden variables (e.g. intelligence ) may be (practically) gaussian because sum of many independent variables (central limit theorem). But signals measured by sensors are usually quite nongaussian

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Some examples of nongaussianity 2 5 6 1.5 1 0.5 0 4 3 2 1 0 4 2 0 0.5 1 2 1 1.5 2 3 4 2 0 1 2 3 4 5 6 7 8 9 10 4 0 1 2 3 4 5 6 7 8 9 10 6 0 1 2 3 4 5 6 7 8 9 10 0.7 0.7 0.8 0.6 0.6 0.7 0.5 0.5 0.6 0.4 0.3 0.4 0.3 0.5 0.4 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 2 1.5 1 0.5 0 0.5 1 1.5 2 0 4 3 2 1 0 1 2 3 4 5 0 6 4 2 0 2 4 6

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Why classic methods cannot find original components or sources In PCA and FA: find components y i which are uncorrelated cov(y i,y j ) = E{y i y j } E{y i }E{y j } = 0 (2) and maximize explained variance (or variance of components) Such methods need only the covariances, cov(x i,x j ) However, there are many different component sets that are uncorrelated, because The number of covariances is n 2 /2 due to symmetry So, we cannot solve the n 2 mixing coeffs, not enough information! ( More variables than equations )

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Nongaussianity, with independence, gives more information For independent variables we have E{h 1 (y 1 )h 2 (y 2 )} E{h 1 (y 1 )}E{h 2 (y 2 )} = 0. (3) For nongaussian variables, nonlinear covariances give more information than just covariances. This is not true for multivariate gaussian distribution Distribution is completely determined by covariances Uncorrelated gaussian variables are independent ICA model cannot be estimated for gaussian data.

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Whitening as preprocessing for ICA Whitening is usually done before ICA Whitening means decorrelation and standardization, E{xx T } = I. After whitening, A can be considered orthogonal. E{xx T } = I = AE{ss T }A T = AA T (4) Half of parameters estimated! (and other technical benefits)

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Illustration Two components with uniform distributions: Original components, observed mixtures, PCA, ICA PCA does not find original coordinates, ICA does!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Illustration of problem with gaussian distributions Original components, observed mixtures, PCA Distribution after PCA is the same as distribution before mixing! Factor rotation problem in classic factor analysis

Maximization of non-gaussianity Measures of non-gaussianity Basic intuitive principle of ICA estimation Inspired by the Central Limit Theorem: Average of many independent random variables will have a distribution that is close(r) to gaussian In the limit of an infinite number of random variables, the distribution tends to gaussian Consider a linear combination i w ix i = i q is i Because of theorem, i q is i should be more gaussian than s i. Maximizing the nongaussianity of i w ix i, we can find s i. Also known as projection pursuit. Cf. principal component analysis: maximize variance of i w ix i.

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Definition of ICA Maximization of non-gaussianity Measures of non-gaussianity Illustration of changes in nongaussianity 0 4 3 2 1 0 1 2 3 4 Histogram and scatterplot, original uniform distributions 0 4 3 2 1 0 1 2 3 4

Maximization of non-gaussianity Measures of non-gaussianity Sparsity is the dominant form of non-gaussianity In natural signals, fundamental non-gaussianity is sparsity Sparsity = probability density has heavy tails and peak at zero: gaussian 5 0 2 1.5 5 1 5 sparse 0 5 0.5 0 3 2 1 0 1 2 3 (Another form of non-gaussianity is skewness or asymmetry)

Maximization of non-gaussianity Measures of non-gaussianity Kurtosis as nongaussianity measure Problem: how to measure nongaussianity (sparsity)? Definition: kurt(x) = E{x 4 } 3(E{x 2 }) 2 (5) if variance constrained to unity, essentially 4th moment. Simple algebraic properties because it s a cumulant: kurt(s 1 +s 2 ) = kurt(s 1 )+kurt(s 2 ) (6) kurt(αs 1 ) = α 4 kurt(s 1 ) (7) zero for gaussian RV, non-zero for most nongaussian RV s. positive vs. negative kurtosis have typical forms of pdf. Variance must be constrained to measure non-gaussianity

Maximization of non-gaussianity Measures of non-gaussianity Illustration of pos and neg kurtosis Left: Laplacian pdf, positive kurt ( supergaussian ). Right: Uniform pdf, negative kurt ( subgaussian ).

Maximization of non-gaussianity Measures of non-gaussianity Why kurtosis is not optimal Sensitive to outliers: Consider a sample of 1000 values with unit var, and one value equal to 10. Kurtosis equals at least 10 4 /1000 3 = 7. For supergaussian variables, statistical performance not optimal even without outliers. Other measures of nongaussianity should be considered.

Maximization of non-gaussianity Measures of non-gaussianity Differential entropy as nongaussianity measure Generalization of ordinary discrete Shannon entropy: H(x) = E{ logp(x)} (8) for fixed variance, maximized by gaussian distribution. often normalized to give negentropy J(x) = H(x gauss ) H(x) (9) Good statistical properties, but computationally difficult.

Maximization of non-gaussianity Measures of non-gaussianity Approximation of negentropy Approximations of negentropy (Hyvärinen, 1998): J G (x) = (E{G(x)} E{G(x gauss )}) 2 (10) where G is a nonquadratic function. Generalization of (square of) kurtosis (which is G(x) = x 4 ). A good compromise? statistical properties not bad (for suitable choice of G) computationally simple Further possibility: If we know the data is sparse, negative of L1 norm E{ x } may be enough.

Maximization of non-gaussianity Measures of non-gaussianity Basic ICA estimation procedure 1. Whiten the data to give z. 2. Set iteration count i = 1. 3. Take a random vector w i. 4. Maximize nongaussianity of w T i z, under constraints w i 2 = 1 and w T i w j = 0,j < i 5. increment iteration count by 1, go back to 3 Alternatively: maximize all the w i in parallel, keeping them orthogonal.

Maximization of non-gaussianity Measures of non-gaussianity Development of ICA algorithms Nongaussianity measure: Essential ingredient Kurtosis: global consistency, but nonrobust. Differential entropy: statistically justified, but difficult to compute. Essentially same as likelihood (Pham et al, 1992/97) or infomax (Bell and Sejnowski, 1995) Rough approximations of entropy: compromise Optimization methods Gradient methods (e.g. natural gradient; Amari et al, 1996) Fast fixed-point algorithm, FastICA (Hyvärinen, 1999)

Maximization of non-gaussianity Measures of non-gaussianity Conclusion: Theory of ICA ICA is a non-gaussian linear generative model (factor analysis) Basic principle: maximize non-gaussianity of components (Really very different from PCA: maximize variance of components) Sparsity is a form of non-gaussianity prevalent in natural signals Measures of non-gaussianity crucial: kurtosis vs. differential entropy

Part II: Natural images and ICA Natural images have statistical regularities Statistical models show optimal processing Basic model is independent component analysis Components are not really independent Need and opportunity for better models Instead of nongaussianity we could use temporal correlations A unifying framework: bubbles

Linear statistical models of images = s 1 + s 2 + + s k Each image (patch) is a linear sum of basis vectors (features) What are the best basis vectors for natural images?

The visual cortex of the brain LGN V1 retina Receptive field of a simple cell in V1:

Sparse coding Sparse coding means: For random vector x, find linear representation: x = As (11) so that the components s i are as sparse (=supergaussian) as possible. Important property: a given data point is represented using only a limited number of active (clearly non-zero) components s i. In contrast to PCA, the active components change from image patch to another. Cf. vocabulary of a language which can describe many different things by combining a small number of active words. Maximizes non-gaussianity, therefore like ICA!

Independent subspaces Topographic ICA ICA / sparse coding of natural images (Olshausen and Field, 1996; Bell and Sejnowski, 1997) Features similar to wavelets, Gabor functions, simple cells.

Independent subspaces Topographic ICA Dependence of independent components Components estimated from natural images are not really independent Next, we model some of the dependencies Independent subspaces + Topographic ICA

Independent subspaces Topographic ICA Correlation of squares What kind of dependence remains between the components? Answer: Squares s 2 i and s 2 j are correlated inside a subspace Dependence through variances Similar to the models by Simoncelli et al on wavelet coefficients; Valpola et al on variance sources Two signals that are uncorrelated but whose squares are correlated.

Independent subspaces Topographic ICA Grouping components (Cardoso, 1998; Hyvärinen and Hoyer, 2000) Assumption: the s i can be divided into groups (subspaces), such that the si in the same group are dependent on each other dependencies between different groups are not allowed We also need to specify the distributions inside the groups, achieving correlations of squares Invariant features given by norms of projections on the subspaces spherically symmetric inside subspaces Inside each subspace, minimize k (wi T x) 2 (12) i=1

Independent subspaces Topographic ICA Computation of invariant features <w 1, I> (.) 2 <w 2, I> <w, I> 3 (.) 2 (.) 2 Σ <w 4, I> (.) 2 Input I <w 5, I> <w, I> 6 <w 7, I> (.) 2 (.) 2 (.) 2 Σ <w 8, I> (.) 2

Independent subspaces Topographic ICA Independent subspaces of natural images Emergence of phase-invariance, as in complex cells in V1.

Independent subspaces Topographic ICA Topographic ICA (Hyvärinen, Hoyer and Inki, 2001) Components are arranged on a two-dimensional lattice Statistical dependency follows topography: The squares s 2 i are correlated for near-by components Each local region is like a subspace

Independent subspaces Topographic ICA Topographic ICA on natural images Topography similar to what is found in the cortex.

Independent subspaces Topographic ICA Temporally coherent components (Hurri and Hyvärinen, 2003) In image sequences (video) we can look at the temporal correlations An alternative to nongaussianity Linear correlations give only Fourier-like receptive fields We proposed temporal correlations of squares Similar to source separation using nonstationary variance (Matsuoka et al, 1995)

Temporal coherence Bubbles Temporally coherent features on natural image sequences Features similar to those obtained by ICA

Temporal coherence Bubbles Bubbles: a unifying framework Correlation of squares both over time and over components spatiotemporal modulating variance variables Simple objective can be obtained where we pool squares over space and time ( T n n ) G h(i,j,τ)(wi T x(t τ)) 2. (13) t=1 j=1 i=1 τ where h(i,j,τ) is neighbourhood function, and G a nonlinear function.

Temporal coherence Bubbles Illustration of four types of representation sparse sparse topographic 2 2 4 4 6 6 < position of filter > 8 10 12 14 < position of filter > 8 10 12 14 16 16 18 18 20 20 2 4 6 8 10 12 14 16 18 20 time > sparse temporally coherent 2 4 6 8 10 12 14 16 18 20 time > bubbles 2 2 4 4 6 6 < position of filter > 8 10 12 14 < position of filter > 8 10 12 14 16 16 18 18 20 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 time > time >

Temporal coherence Bubbles Conclusion: Natural images and ICA ICA is a non-gaussian factor analysis Basic principle: maximize non-gaussianity of components Measures of non-gaussianity crucial: kurtosis vs. differential entropy ICA and related models show optimal features for natural images. ICA models basic linear features. Independent subspaces and topographic ICA model basic dependencies or nonlinearities. Temporal coherence is an alternative approach, leading to bubbles.