CIFAR Lectures: Non-Gaussian statistics and natural images

Similar documents
Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Independent Component Analysis

Independent Component Analysis

Natural Image Statistics

Independent Component Analysis and Blind Source Separation

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

Independent Component Analysis. Contents

EXTENSIONS OF ICA AS MODELS OF NATURAL IMAGES AND VISUAL PROCESSING. Aapo Hyvärinen, Patrik O. Hoyer and Jarmo Hurri

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Lecture'12:' SSMs;'Independent'Component'Analysis;' Canonical'Correla;on'Analysis'

Independent Component Analysis. PhD Seminar Jörgen Ungh

From independent component analysis to score matching

Advanced Introduction to Machine Learning CMU-10715

Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces

Independent Component Analysis

Blind separation of sources that have spatiotemporal variance dependencies

INDEPENDENT COMPONENT ANALYSIS

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

Independent component analysis: algorithms and applications

Different Estimation Methods for the Basic Independent Component Analysis Model

HST.582J/6.555J/16.456J

Higher Order Statistics

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Temporal Coherence, Natural Image Sequences, and the Visual Cortex

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA

THE functional role of simple and complex cells has

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video

Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005

Nonlinear reverse-correlation with synthesized naturalistic noise

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

A Canonical Genetic Algorithm for Blind Inversion of Linear Channels

Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video

A Constrained EM Algorithm for Independent Component Analysis

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

Independent Component Analysis

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

ACENTRAL problem in neural-network research, as well

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

One-unit Learning Rules for Independent Component Analysis

A two-layer ICA-like model estimated by Score Matching

Lecture 6: April 19, 2002

An Introduction to Independent Components Analysis (ICA)

Comparative Analysis of ICA Based Features

ICA. Independent Component Analysis. Zakariás Mátyás

Final Report For Undergraduate Research Opportunities Project Name: Biomedical Signal Processing in EEG. Zhang Chuoyao 1 and Xu Jianxin 2

Estimating Overcomplete Independent Component Bases for Image Windows

Slide11 Haykin Chapter 10: Information-Theoretic Models

Estimation of linear non-gaussian acyclic models for latent factors

ICA [6] ICA) [7, 8] ICA ICA ICA [9, 10] J-F. Cardoso. [13] Matlab ICA. Comon[3], Amari & Cardoso[4] ICA ICA

A GUIDE TO INDEPENDENT COMPONENT ANALYSIS Theory and Practice

Donghoh Kim & Se-Kang Kim

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Independent Component Analysis (ICA) Bhaskar D Rao University of California, San Diego

New Machine Learning Methods for Neuroimaging

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

Separation of Different Voices in Speech using Fast Ica Algorithm

New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit

Speed and Accuracy Enhancement of Linear ICA Techniques Using Rational Nonlinear Functions

Hierarchical Sparse Bayesian Learning. Pierre Garrigues UC Berkeley

Learning features by contrasting natural images with noise

Independent Component Analysis of Incomplete Data

Robustness of Principal Components

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Blind Machine Separation Te-Won Lee

Efficient Coding. Odelia Schwartz 2017

Non-Euclidean Independent Component Analysis and Oja's Learning

ADAPTIVE LATERAL INHIBITION FOR NON-NEGATIVE ICA. Mark Plumbley

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Independent Component Analysis

Independent Component Analysis

Blind Source Separation Using Artificial immune system

c Springer, Reprinted with permission.

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Uncorrelatedness and Independence

An Improved Cumulant Based Method for Independent Component Analysis

Independent Component Analysis (ICA)

Independent Component Analysis and Unsupervised Learning

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

File: ica tutorial2.tex. James V Stone and John Porrill, Psychology Department, Sheeld University, Tel: Fax:

Unsupervised learning: beyond simple clustering and PCA

Optimization and Testing in Linear. Non-Gaussian Component Analysis

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

Independent Component Analysis of Rock Magnetic Measurements

NON-NEGATIVE SPARSE CODING

Recursive Generalized Eigendecomposition for Independent Component Analysis

Linear Factor Models. Sargur N. Srihari

Independent Component Analysis

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

The statistical properties of local log-contrast in natural images

A survey of dimension reduction techniques

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.

Evaluating Models of Natural Image Patches

Factor Analysis (10/2/13)

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Transcription:

CIFAR Lectures: Non-Gaussian statistics and natural images Dept of Computer Science University of Helsinki, Finland

Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity Part II: Natural images and ICA Application of ICA and sparse coding on natural images Extensions of ICA with dependent components Part III: Estimation of unnormalized models Motivation by extensions of ICA Score matching Noise-contrastive estimation Part IV: Recent extensions of ICA and natural image statistics A three-layer model, towards deep learning

Part I: Theory of ICA Definition of ICA as non-gaussian generative model Importance of non-gaussianity Fundamental difference to PCA Estimation by maximization of non-gaussianity Measures of non-gaussianity

Problem of blind source separation There is a number of source signals : Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Due to some external circumstances, only linear mixtures of the source signals are observed. Estimate (separate) original signals!

A solution is possible Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity PCA does not recover original signals

A solution is possible Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity PCA does not recover original signals Use information on statistical independence to recover:

Independent Component Analysis (Hérault and Jutten, 1984-1991) Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Observed random variables x i are modelled as linear sums of hidden variables: m x i = a ij s j, i = 1...n (1) j=1 Mathematical formulation of blind source separation problem Not unlike factor analysis Matrix of a ij is parameter matrix, called mixing matrix. The s i are hidden random variables called independent components, or source signals Problem: Estimate both a ij and s j, observing only x i.

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity When can the ICA model be estimated? Must assume: The si are mutually statistically independent The si are nongaussian (non-normal) (Optional:) Number of independent components is equal to number of observed variables Then: mixing matrix and components can be identified (Comon, 1994) A very surprising result!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Reminder: Principal component analysis Basic idea: find directions i w ix i of maximum variance We must constrain the norm of w: i w2 i = 1, otherwise solution is that w i are infinite. For more than one component, find direction of max var orthogonal to components previously found. Classic factor analysis has essentially same idea as in PCA: explain maximal variance with limited number of components

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Comparison of ICA, factor analysis and principal component analysis ICA is nongaussian FA with no noise or specific factors. So many components that all variance is explained by them. No factor rotation left unknown because of identifiability result In contrast to FA and PCA, components really give the original source signals or underlying hidden variables Catch: only works when components are nongaussian Many psychological hidden variables (e.g. intelligence ) may be (practically) gaussian because sum of many independent variables (central limit theorem). But signals measured by sensors are usually quite nongaussian

Some examples of nongaussianity Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity 2 5 6 1.5 1 0.5 0 4 3 2 1 0 4 2 0 0.5 1 2 1 1.5 2 3 4 2 0 1 2 3 4 5 6 7 8 9 10 4 0 1 2 3 4 5 6 7 8 9 10 6 0 1 2 3 4 5 6 7 8 9 10 0.7 0.7 0.8 0.6 0.6 0.7 0.5 0.5 0.6 0.4 0.3 0.4 0.3 0.5 0.4 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 2 1.5 1 0.5 0 0.5 1 1.5 2 0 4 3 2 1 0 1 2 3 4 5 0 6 4 2 0 2 4 6

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Why classic methods cannot find original components or sources In PCA and FA: find components y i which are uncorrelated cov(y i,y j ) = E{y i y j } E{y i }E{y j } = 0 (2) and maximize explained variance (or variance of components) Such methods need only the covariances, cov(x i,x j ) However, there are many different component sets that are uncorrelated, because The number of covariances is n 2 /2 due to symmetry So, we cannot solve the n 2 factor loadings, not enough information! ( More variables than equations )

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Nongaussianity, with independence, gives more information For independent variables we have E{h 1 (y 1 )h 2 (y 2 )} E{h 1 (y 1 )}E{h 2 (y 2 )} = 0. (3) For nongaussian variables, nonlinear covariances give more information than just covariances. This is not true for multivariate gaussian distribution Distribution is completely determined by covariances Uncorrelated gaussian variables are independent, and their distribution (standardized) is same in all directions (see below) ICA model cannot be estimated for gaussian data.

Illustration Definition of ICA Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Two components with uniform distributions: Original components, observed mixtures, PCA, ICA PCA does not find original coordinates, ICA does!

Blind source separation Linear generative model Comparison to PCA Identifiability by nongaussianity Illustration of problem with gaussian distributions Original components, observed mixtures, PCA Distribution after PCA is the same as distribution before mixing! Factor rotation problem in classic FA

Maximization of non-gaussianity Measures of non-gaussianity Basic intuitive principle of ICA estimation Inspired the Central Limit Theorem: Average of many independent random variables will have a distribution that is close(r) to gaussian In the limit of an infinite number of random variables, the distribution tends to gaussian Consider a linear combination i w ix i = i q is i Because of theorem, i q is i should be more gaussian than s i. Maximizing the nongaussianity of i w ix i, we can find s i. Also known as projection pursuit. Cf. principal component analysis: maximize variance of i w ix i.

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Histogram and scatterplot, mixturescifar given Lectures: by PCA Non-Gaussian statistics and natural images Definition of ICA Maximization of non-gaussianity Measures of non-gaussianity Illustration of changes in nongaussianity 0 4 3 2 1 0 1 2 3 4 Histogram and scatterplot, original uniform distributions 0 4 3 2 1 0 1 2 3 4

Maximization of non-gaussianity Measures of non-gaussianity Sparsity is the dominant form of non-gaussianity In natural signals, fundamental non-gaussianity is sparsity Sparsity = probability density has heavy tails and peak at zero: gaussian 5 0 2 1.5 5 1 5 sparse 0 5 0.5 0 3 2 1 0 1 2 3 (Another form of non-gaussianity is skewness or asymmetry)

Maximization of non-gaussianity Measures of non-gaussianity Kurtosis as nongaussianity measure Problem: how to measure nongaussianity (sparsity)? Definition: kurt(x) = E{x 4 } 3(E{x 2 }) 2 (4) if variance constrained to unity, essentially 4th moment. Simple algebraic properties because it s a cumulant: kurt(s 1 +s 2 ) = kurt(s 1 )+kurt(s 2 ) (5) kurt(αs 1 ) = α 4 kurt(s 1 ) (6) zero for gaussian RV, non-zero for most nongaussian RV s. positive vs. negative kurtosis have typical forms of pdf. Variance must be constrained to measure non-gaussianity

Illustration of pos and neg kurtosis Maximization of non-gaussianity Measures of non-gaussianity Left: Laplacian pdf, positive kurt ( supergaussian ). Right: Uniform pdf, negative kurt ( subgaussian ).

Maximization of non-gaussianity Measures of non-gaussianity Why kurtosis is not optimal Sensitive to outliers: Consider a sample of 1000 values with unit var, and one value equal to 10. Kurtosis equals at least 10 4 /1000 3 = 7. For supergaussian variables, statistical performance not optimal even without outliers. Other measures of nongaussianity should be considered.

Maximization of non-gaussianity Measures of non-gaussianity Differential entropy as nongaussianity measure Generalization of ordinary discrete Shannon entropy: H(x) = E{ logp(x)} (7) for fixed variance, maximized by gaussian distribution. often normalized to give negentropy J(x) = H(x gauss ) H(x) (8) Good statistical properties, but computationally difficult.

Approximation of negentropy Maximization of non-gaussianity Measures of non-gaussianity Approximations of negentropy (Hyvärinen, 1998): J G (x) = (E{G(x)} E{G(x gauss )}) 2 (9) where G is a nonquadratic function. Generalization of (square of) kurtosis (which is G(x) = x 4 ). A good compromise? statistical properties not bad (for suitable choice of G) computationally simple Further possibility: Skewness (for nonsymmetric ICs)

Maximization of non-gaussianity Measures of non-gaussianity Basic ICA estimation procedure 1. Whiten the data to give z. 2. Set iteration count i = 1. 3. Take a random vector w i. 4. Maximize nongaussianity of w T i z, under constraints w i 2 = 1 and w T i w j = 0,j < i 5. increment iteration count by 1, go back to 3 Alternatively: maximize all the w i in parallel, keeping them orthogonal.

Development of ICA algorithms Maximization of non-gaussianity Measures of non-gaussianity Nongaussianity measure: Essential ingredient Kurtosis: global consistency, but nonrobust. Differential entropy: statistically justified, but difficult to compute. Essentially same as likelihood (Pham et al, 1992/97) or infomax (Bell and Sejnowski, 1995) Rough approximations of entropy: compromise Optimization methods Gradient methods (e.g. natural gradient; Amari et al, 1996) Fast fixed-point algorithm, FastICA (Hyvärinen, 1999)

Maximization of non-gaussianity Measures of non-gaussianity Conclusion: Theory of ICA ICA is a non-gaussian factor analysis Basic principle: maximize non-gaussianity of components (Really very different from PCA: maximize variance of components) Sparsity is a form of non-gaussianity prevalent in natural signals Measures of non-gaussianity crucial: kurtosis vs. differential entropy

Part II: Natural images and ICA Natural images have statistical regularities Statistical models show optimal processing Basic model is independent component analysis Components are not really independent Need and opportunit Instead of nongaussianity we could use temporal correlations A unifying framework: bubbles

Linear statistical models of images = s 1 + s 2 + + s k Each image (patch) is a linear sum of basis vectors (features) What are the best basis vectors for natural images?

The visual cortex of the brain LGN V1 retina Receptive field of a simple cell in V1:

Sparse coding Sparse coding means: For random vector x, find linear representation: x = As (10) so that the components s i are as sparse (=supergaussian) as possible. Important property: a given data point is represented using only a limited number of active (clearly non-zero) components s i. In contrast to PCA, the active components change from image patch to another. Cf. vocabulary of a language which can describe many different things by combining a small number of active words. Maximizes non-gaussianity, therefore like ICA!

Independent subspaces Topographic ICA ICA / sparse coding of natural images (Olshausen and Field, 1996; Bell and Sejnowski, 1997) Features similar to wavelets, Gabor functions, simple cells.

Independent subspaces Topographic ICA Dependence of independent components Components estimated from natural images are not really independent Next, we model some of the dependencies Independent subspaces + Topographic ICA

Independent subspaces Topographic ICA Correlation of squares What kind of dependence remains between the components? Answer: Squares s 2 i and s 2 j are correlated inside a subspace Dependence through variances Similar to the models by Simoncelli et al on wavelet coefficients; Valpola et al on variance sources Two signals that are uncorrelated but whose squares are correlated.

Independent subspaces Topographic ICA Grouping components (Cardoso, 1998; Hyvärinen and Hoyer, 2000) Assumption: the s i can be divided into groups (subspaces), such that the si in the same group are dependent on each other dependencies between different groups are not allowed We also need to specify the distributions inside the groups Invariant features given by norms of projections on the subspaces spherically symmetric inside subspaces

Independent subspaces Topographic ICA Independent subspaces of natural images Emergence of phase-invariance, as in complex cells in V1.

Independent subspaces Topographic ICA Topographic ICA (Hyvärinen, Hoyer and Inki, 2001) Components are arranged on a two-dimensional lattice Statistical dependency follows topography: The squares s 2 i are correlated for near-by components Each local region is like a subspace

Topographic ICA on natural images Independent subspaces Topographic ICA Topography similar to what is found in the cortex.

Independent subspaces Topographic ICA Temporally coherent components (Hurri and Hyvärinen, 2003) In image sequences (video) we can look at the temporal correlations An alternative to nongaussianity Linear correlations give only Fourier-like receptive fields We proposed temporal correlations of squares Similar to source separation using nonstationary variance (Matsuoka et al, 1995)

Temporal coherence Bubbles Temporally coherent features on natural image sequences Features similar to those obtained by ICA

Temporal coherence Bubbles Bubbles: a unifying framework Correlation of squares both over time and over components spatiotemporal modulating variance variables Simple approximation of the likelihood can be obtained like in topographic ICA ( T n n G h(i,j,τ)(wi T x(t τ)) )+T 2 log detw.(11) t=1 j=1 i=1 τ where h(i,j,τ) is neighbourhood function, and G a nonlinear function.

Temporal coherence Bubbles Illustration of four types of representation sparse sparse topographic 2 2 4 4 6 6 < position of filter > 8 10 12 14 < position of filter > 8 10 12 14 16 16 18 18 20 20 2 4 6 8 10 12 14 16 18 20 time > sparse temporally coherent 2 4 6 8 10 12 14 16 18 20 time > bubbles 2 2 4 4 6 6 < position of filter > 8 10 12 14 < position of filter > 8 10 12 14 16 16 18 18 20 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 time > time >

Temporal coherence Bubbles Conclusion: Natural images and ICA ICA is a non-gaussian factor analysis Basic principle: maximize non-gaussianity of components Measures of non-gaussianity crucial: kurtosis vs. differential entropy ICA and related models show optimal features for natural images. ICA models basic linear features. Independent subspaces and topographic ICA model basic dependencies or nonlinearities. Temporal coherence is an alternative approach, leading to bubbles.