Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Similar documents
CIFAR Lectures: Non-Gaussian statistics and natural images

Independent Component Analysis

Independent Component Analysis

Natural Image Statistics

Independent Component Analysis and Blind Source Separation

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

Independent Component Analysis. Contents

EXTENSIONS OF ICA AS MODELS OF NATURAL IMAGES AND VISUAL PROCESSING. Aapo Hyvärinen, Patrik O. Hoyer and Jarmo Hurri

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Lecture'12:' SSMs;'Independent'Component'Analysis;' Canonical'Correla;on'Analysis'

Independent Component Analysis. PhD Seminar Jörgen Ungh

Advanced Introduction to Machine Learning CMU-10715

Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces

Independent Component Analysis

Different Estimation Methods for the Basic Independent Component Analysis Model

INDEPENDENT COMPONENT ANALYSIS

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Blind separation of sources that have spatiotemporal variance dependencies

From independent component analysis to score matching

Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

THE functional role of simple and complex cells has

Independent component analysis: algorithms and applications

HST.582J/6.555J/16.456J

Temporal Coherence, Natural Image Sequences, and the Visual Cortex

Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video

Nonlinear reverse-correlation with synthesized naturalistic noise

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video

Higher Order Statistics

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

A Canonical Genetic Algorithm for Blind Inversion of Linear Channels

A Constrained EM Algorithm for Independent Component Analysis

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA

Separation of Different Voices in Speech using Fast Ica Algorithm

A two-layer ICA-like model estimated by Score Matching

ACENTRAL problem in neural-network research, as well

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

ICA [6] ICA) [7, 8] ICA ICA ICA [9, 10] J-F. Cardoso. [13] Matlab ICA. Comon[3], Amari & Cardoso[4] ICA ICA

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

One-unit Learning Rules for Independent Component Analysis

ICA. Independent Component Analysis. Zakariás Mátyás

Final Report For Undergraduate Research Opportunities Project Name: Biomedical Signal Processing in EEG. Zhang Chuoyao 1 and Xu Jianxin 2

Independent Component Analysis

Recursive Generalized Eigendecomposition for Independent Component Analysis

An Introduction to Independent Components Analysis (ICA)

Independent Component Analysis

Estimating Overcomplete Independent Component Bases for Image Windows

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego

New Machine Learning Methods for Neuroimaging

Comparative Analysis of ICA Based Features

Lecture 6: April 19, 2002

A GUIDE TO INDEPENDENT COMPONENT ANALYSIS Theory and Practice

Independent Component Analysis

Efficient Coding. Odelia Schwartz 2017

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Donghoh Kim & Se-Kang Kim

Blind Machine Separation Te-Won Lee

Independent Component Analysis (ICA)

Unsupervised learning: beyond simple clustering and PCA

Independent Component Analysis of Incomplete Data

Slide11 Haykin Chapter 10: Information-Theoretic Models

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

Hierarchical Sparse Bayesian Learning. Pierre Garrigues UC Berkeley

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Robustness of Principal Components

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Linear Factor Models. Sargur N. Srihari

Speed and Accuracy Enhancement of Linear ICA Techniques Using Rational Nonlinear Functions

Blind Source Separation Using Artificial immune system

Analytical solution of the blind source separation problem using derivatives

Independent Component Analysis

An Improved Cumulant Based Method for Independent Component Analysis

Learning features by contrasting natural images with noise

c Springer, Reprinted with permission.

Uncorrelatedness and Independence

Non-Euclidean Independent Component Analysis and Oja's Learning

Estimation of linear non-gaussian acyclic models for latent factors

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Independent Component Analysis and Unsupervised Learning

Separation of the EEG Signal using Improved FastICA Based on Kurtosis Contrast Function

New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit

A survey of dimension reduction techniques

File: ica tutorial2.tex. James V Stone and John Porrill, Psychology Department, Sheeld University, Tel: Fax:

Independent Component Analysis and FastICA. Copyright Changwei Xiong June last update: July 7, 2016

Independent Component Analysis of Rock Magnetic Measurements

Optimization and Testing in Linear. Non-Gaussian Component Analysis

Semi-Blind approaches to source separation: introduction to the special session

Evaluating Models of Natural Image Patches

THE SPIKE PROCESS: A SIMPLE TEST CASE FOR INDEPENDENT OR SPARSE COMPONENT ANALYSIS. Naoki Saito and Bertrand Bénichou

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Probabilistic & Unsupervised Learning

Transcription:

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Unit University College London 27 Feb 2017

Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity Part II: Natural images and ICA Application of ICA and sparse coding on natural images Extensions of ICA with dependent components Part III: Estimation of unnormalized models Motivation by extensions of ICA Score matching Noise-contrastive estimation Part IV: Recent extensions of ICA and natural image statistics A three-layer model, towards deep learning

Part I: Theory of ICA Definition of ICA as non-gaussian generative model Importance of non-gaussianity Fundamental difference to PCA Estimation by maximization of non-gaussianity Measures of non-gaussianity

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Problem of blind source separation There is a number of source signals : Due to some external circumstances, only linear mixtures of the source signals are observed. Estimate (separate) original signals!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity A solution is possible PCA does not recover original signals

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity A solution is possible PCA does not recover original signals Use information on statistical independence to recover:

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Independent Component Analysis (Hérault and Jutten, 1984-1991) Observed random variables x i are modelled as linear sums of hidden variables: m x i = a ij s j, i = 1...n (1) j=1 Mathematical formulation of blind source separation problem Not unlike factor analysis Matrix of a ij is parameter matrix, called mixing matrix. The s i are hidden random variables called independent components, or source signals Problem: Estimate both a ij and s j, observing only x i.

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity When can the ICA model be estimated? Must assume: The si are mutually statistically independent The si are nongaussian (non-normal) (Optional:) Number of independent components is equal to number of observed variables Then: mixing matrix and components can be identified (Comon, 1994) A very surprising result!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Reminder: Principal component analysis Basic idea: find directions i w ix i of maximum variance We must constrain the norm of w: i w2 i = 1, otherwise solution is that w i are infinite. For more than one component, find direction of max var orthogonal to components previously found. Classic factor analysis has essentially same idea as in PCA: explain maximal variance with limited number of components

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Comparison of ICA, PCA, and factor analysis In contrast to PCA and factor analysis, components really give the original source signals or underlying hidden variables In PCA and factor analysis, only a subspace is properly determined (although an arbitrary basis is given as output) Catch: ICA only works when components are nongaussian Many psychological or social-science hidden variables (e.g. intelligence ) may be (practically) gaussian because sum of many independent variables (central limit theorem). But signals measured by sensors are usually quite nongaussian

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Some examples of nongaussianity 2 5 6 1.5 1 0.5 0 4 3 2 1 0 4 2 0 0.5 1 2 1 1.5 2 3 4 2 0 1 2 3 4 5 6 7 8 9 10 4 0 1 2 3 4 5 6 7 8 9 10 6 0 1 2 3 4 5 6 7 8 9 10 0.7 0.7 0.8 0.6 0.6 0.7 0.5 0.5 0.6 0.4 0.3 0.4 0.3 0.5 0.4 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 2 1.5 1 0.5 0 0.5 1 1.5 2 0 4 3 2 1 0 1 2 3 4 5 0 6 4 2 0 2 4 6

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Why classic methods cannot find original components or sources In PCA and FA: find components y i which are uncorrelated cov(y i,y j ) = E{y i y j } E{y i }E{y j } = 0 (2) and maximize explained variance (or variance of components) Such methods need only the covariances, cov(x i,x j ) However, there are many different component sets that are uncorrelated, because The number of covariances is n 2 /2 due to symmetry So, we cannot solve the n 2 mixing coeffs, not enough information! ( More variables than equations )

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Nongaussianity, with independence, gives more information For independent variables we have E{h 1 (y 1 )h 2 (y 2 )} E{h 1 (y 1 )}E{h 2 (y 2 )} = 0. (3) For nongaussian variables, nonlinear covariances give more information than just covariances. This is not true for multivariate gaussian distribution Distribution is completely determined by covariances Uncorrelated gaussian variables are independent ICA model cannot be estimated for gaussian data.

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Whitening as preprocessing for ICA Whitening is usually done before ICA Whitening means decorrelation and standardization, E{xx T } = I. After whitening, A can be considered orthogonal. E{xx T } = I = AE{ss T }A T = AA T (4) Half of parameters estimated! (and other technical benefits)

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Illustration Two components with uniform distributions: Original components, observed mixtures, PCA, ICA PCA does not find original coordinates, ICA does!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Illustration of problem with gaussian distributions Original components, observed mixtures, PCA Distribution after PCA is the same as distribution before mixing! Factor rotation problem in classic factor analysis

Maximization of non-gaussianity Measures of non-gaussianity Basic intuitive principle of ICA estimation Inspired by the Central Limit Theorem: Average of many independent random variables will have a distribution that is close(r) to gaussian In the limit of an infinite number of random variables, the distribution tends to gaussian Consider a linear combination i w ix i = i q is i Because of theorem, i q is i should be more gaussian than s i. Maximizing the nongaussianity of i w ix i, we can find s i. Also known as projection pursuit. Cf. principal component analysis: maximize variance of i w ix i.

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Definition of ICA Maximization of non-gaussianity Measures of non-gaussianity Illustration of changes in nongaussianity 0 4 3 2 1 0 1 2 3 4 Histogram and scatterplot, original uniform distributions 0 4 3 2 1 0 1 2 3 4

Maximization of non-gaussianity Measures of non-gaussianity Sparsity is the dominant form of non-gaussianity In natural signals, fundamental non-gaussianity is sparsity Sparsity = probability density has heavy tails and peak at zero: gaussian 5 0 2 1.5 5 1 5 sparse 0 5 0.5 0 3 2 1 0 1 2 3 (Another form of non-gaussianity is skewness or asymmetry)

Maximization of non-gaussianity Measures of non-gaussianity Kurtosis as nongaussianity measure Problem: how to measure nongaussianity (sparsity)? Definition: kurt(x) = E{x 4 } 3(E{x 2 }) 2 (5) if variance constrained to unity, essentially 4th moment. Simple algebraic properties because it s a cumulant: kurt(s 1 +s 2 ) = kurt(s 1 )+kurt(s 2 ) (6) kurt(αs 1 ) = α 4 kurt(s 1 ) (7) zero for gaussian RV, non-zero for most nongaussian RV s. positive vs. negative kurtosis have typical forms of pdf. Variance must be constrained to measure non-gaussianity

Maximization of non-gaussianity Measures of non-gaussianity Illustration of pos and neg kurtosis Left: Laplacian pdf, positive kurt ( supergaussian ). Right: Uniform pdf, negative kurt ( subgaussian ).

Maximization of non-gaussianity Measures of non-gaussianity Why kurtosis is not optimal Sensitive to outliers: Consider a sample of 1000 values with unit var, and one value equal to 10. Kurtosis equals at least 10 4 /1000 3 = 7. For supergaussian variables, statistical performance not optimal even without outliers. Other measures of nongaussianity should be considered.

Maximization of non-gaussianity Measures of non-gaussianity Differential entropy as nongaussianity measure Generalization of ordinary discrete Shannon entropy: H(x) = E{ logp(x)} (8) for fixed variance, maximized by gaussian distribution. often normalized to give negentropy J(x) = H(x gauss ) H(x) (9) Good statistical properties, but computationally difficult.

Maximization of non-gaussianity Measures of non-gaussianity Approximation of negentropy Approximations of negentropy (Hyvärinen, 1998): J G (x) = (E{G(x)} E{G(x gauss )}) 2 (10) where G is a nonquadratic function. Generalization of (square of) kurtosis (which is G(x) = x 4 ). A good compromise? statistical properties not bad (for suitable choice of G) computationally simple Further possibility: If we know the data is sparse, negative of L1 norm E{ x } may be enough.

Maximization of non-gaussianity Measures of non-gaussianity Basic ICA estimation procedure 1. Whiten the data to give z. 2. Set iteration count i = 1. 3. Take a random vector w i. 4. Maximize nongaussianity of w T i z, under constraints w i 2 = 1 and w T i w j = 0,j < i 5. increment iteration count by 1, go back to 3 Alternatively: maximize all the w i in parallel, keeping them orthogonal.

Maximization of non-gaussianity Measures of non-gaussianity Development of ICA algorithms Nongaussianity measure: Essential ingredient Kurtosis: global consistency, but nonrobust. Differential entropy: statistically justified, but difficult to compute. Essentially same as likelihood (Pham et al, 1992/97) or infomax (Bell and Sejnowski, 1995) Rough approximations of entropy: compromise Optimization methods Gradient methods (e.g. natural gradient; Amari et al, 1996) Fast fixed-point algorithm, FastICA (Hyvärinen, 1999)

Maximization of non-gaussianity Measures of non-gaussianity Conclusion: Theory of ICA ICA is a non-gaussian linear generative model (factor analysis) Basic principle: maximize non-gaussianity of components (Really very different from PCA: maximize variance of components) Sparsity is a form of non-gaussianity prevalent in natural signals Measures of non-gaussianity crucial: kurtosis vs. differential entropy

Part II: Natural images and ICA Natural images have statistical regularities Statistical models show optimal processing Basic model is independent component analysis Components are not really independent Need and opportunity for better models Instead of nongaussianity we could use temporal correlations A unifying framework: bubbles

Linear statistical models of images = s 1 + s 2 + + s k Each image (patch) is a linear sum of basis vectors (features) What are the best basis vectors for natural images?

The visual cortex of the brain LGN V1 retina Receptive field of a simple cell in V1:

Sparse coding Sparse coding means: For random vector x, find linear representation: x = As (11) so that the components s i are as sparse (=supergaussian) as possible. Important property: a given data point is represented using only a limited number of active (clearly non-zero) components s i. In contrast to PCA, the active components change from image patch to another. Cf. vocabulary of a language which can describe many different things by combining a small number of active words. Maximizes non-gaussianity, therefore like ICA!

Independent subspaces Topographic ICA ICA / sparse coding of natural images (Olshausen and Field, 1996; Bell and Sejnowski, 1997) Features similar to wavelets, Gabor functions, simple cells.

Independent subspaces Topographic ICA Dependence of independent components Components estimated from natural images are not really independent Next, we model some of the dependencies Independent subspaces + Topographic ICA

Independent subspaces Topographic ICA Correlation of squares What kind of dependence remains between the components? Answer: Squares s 2 i and s 2 j are correlated inside a subspace Dependence through variances Similar to the models by Simoncelli et al on wavelet coefficients; Valpola et al on variance sources Two signals that are uncorrelated but whose squares are correlated.

Independent subspaces Topographic ICA Grouping components (Cardoso, 1998; Hyvärinen and Hoyer, 2000) Assumption: the s i can be divided into groups (subspaces), such that the si in the same group are dependent on each other dependencies between different groups are not allowed We also need to specify the distributions inside the groups, achieving correlations of squares Invariant features given by norms of projections on the subspaces spherically symmetric inside subspaces Inside each subspace, minimize k (wi T x) 2 (12) i=1

Independent subspaces Topographic ICA Computation of invariant features <w 1, I> (.) 2 <w 2, I> <w, I> 3 (.) 2 (.) 2 Σ <w 4, I> (.) 2 Input I <w 5, I> <w, I> 6 <w 7, I> (.) 2 (.) 2 (.) 2 Σ <w 8, I> (.) 2

Independent subspaces Topographic ICA Independent subspaces of natural images Emergence of phase-invariance, as in complex cells in V1.

Independent subspaces Topographic ICA Topographic ICA (Hyvärinen, Hoyer and Inki, 2001) Components are arranged on a two-dimensional lattice Statistical dependency follows topography: The squares s 2 i are correlated for near-by components Each local region is like a subspace

Independent subspaces Topographic ICA Topographic ICA on natural images Topography similar to what is found in the cortex.

Independent subspaces Topographic ICA Temporally coherent components (Hurri and Hyvärinen, 2003) In image sequences (video) we can look at the temporal correlations An alternative to nongaussianity Linear correlations give only Fourier-like receptive fields We proposed temporal correlations of squares Similar to source separation using nonstationary variance (Matsuoka et al, 1995)

Temporal coherence Bubbles Temporally coherent features on natural image sequences Features similar to those obtained by ICA

Temporal coherence Bubbles Bubbles: a unifying framework Correlation of squares both over time and over components spatiotemporal modulating variance variables Simple objective can be obtained where we pool squares over space and time ( T n n ) G h(i,j,τ)(wi T x(t τ)) 2. (13) t=1 j=1 i=1 τ where h(i,j,τ) is neighbourhood function, and G a nonlinear function.

Temporal coherence Bubbles Illustration of four types of representation sparse sparse topographic 2 2 4 4 6 6 < position of filter > 8 10 12 14 < position of filter > 8 10 12 14 16 16 18 18 20 20 2 4 6 8 10 12 14 16 18 20 time > sparse temporally coherent 2 4 6 8 10 12 14 16 18 20 time > bubbles 2 2 4 4 6 6 < position of filter > 8 10 12 14 < position of filter > 8 10 12 14 16 16 18 18 20 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 time > time >

Temporal coherence Bubbles Conclusion: Natural images and ICA ICA is a non-gaussian factor analysis Basic principle: maximize non-gaussianity of components Measures of non-gaussianity crucial: kurtosis vs. differential entropy ICA and related models show optimal features for natural images. ICA models basic linear features. Independent subspaces and topographic ICA model basic dependencies or nonlinearities. Temporal coherence is an alternative approach, leading to bubbles.