From independent component analysis to score matching

From independent component analysis to score matching Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1

Abstract First, short introduction to independent component analysis non-gaussian Bayesian networks Main topic: Estimation of non-normalized models Problem: Parameterized density does not integrate to unity Partition function (normalization constant) is difficult to compute Solution: Fitting gradient of log-density ψ w.r.t. data variable Minimize squared distance of ψ of data and ψ of model ψ does not depend on normalization constant Using partial integration, distance be computed by a simple formula Estimator is optimal for reducing gaussian (infinitesimal) noise 2

Problem of blind source separation There is a number of source signals : Due to some external circumstances, only linear mixtures of the source signals are observed: Estimate (separate) original signals! 3

Principal component analysis does not recover original signals 4

Principal component analysis does not recover original signals A solution is possible Use information on statistical independence to recover: 5

Independent Component Analysis (Hérault and Jutten, 1984-1991) Observed random vector x is modelled by a linear latent variable model x i = m a i j s j, i = 1...n (1) j=1 or in matrix form: x = As (2) where The mixing matrix A is constant (a parameter matrix). The s i are latent random variables called the independent components. Estimate both A and s, observing only x. 6

Basic assumptions of the ICA model Must assume: The s i are mutually independent The s i are nongaussian. For simplicity: The matrix A is square. Then: mixing matrix and components can be identified (Comon, 1994) A very surprising result! The s i defined only up to a multiplicative constant, not ordered. 7

ICA and decorrelation First approach: decorrelate variables, i.e. find W so that y = Wx has uncorrelated components: E{y i y j } E{y i }E{y j } = 0 (3) But: Decorrelation (e.g. PCA) uses only correlation matrix: n 2 /2 equations, and model has n 2 parameters Not enough information! Fortunately, for independent variables we have something stronger: E{h 1 (y 1 )h 2 (y 2 )} E{h 1 (y 1 )}E{h 2 (y 2 )} = 0. (4) Gaussian data determined by correlations alone model cannot be estimated for gaussian data. 8

Basic intuitive principle of ICA estimation. (Very sloppy version of) the Central Limit Theorem. Consider a linear combination w T x = q T s q i s i + q j s j is more gaussian than s i. Maximizing the nongaussianity of q T s, we can find s i. Also known as projection pursuit. Cf. principal component analysis: maximize variance of w T x. A number of algorithms available, e.g. FastICA (Hyvärinen, 1999) 9

Linear Non-Gaussian Acyclic Model (LiNGAM; Shimizu et al 2006) Instead of components, we can estimate a network: x = Bx+e x1 Estimation possible based on ICA Assume e i independent and nongaussian -0.3-0.56 x2 0.82 0.89 We can rearrange to obtain ICA (almost): x3 0.14 0.37 x = Bx+e (I B)x = e 1-0.26-1 x4 0.12 So, ICA can be used to obtain I B (almost) Problem: ICA does not determine order of components x7 x6 1 x5 Solution: acyclicity defines order uniquely 10

Generalization of ICA to many components In basic ICA, number of components = dimension of data We could consider many more projections and maximize their non-gaussianity: m G(w T k k=1 for some function G measuring nongaussianity. To estimate w k, we interpret this as a log-density x) (5) However, it should be normalized to unit integral log p(x) = m Z G(w T k x) log k=1 exp( m k=1 G(w T k ξ))dξ (6) We find a very difficult integral! This leads to main topic. 11

Main talk topic: score matching Abstract How to estimate models which cannot be integrated analytically Maximum likelihood estimation computationally difficult: must compute integral We propose a computationally efficient method which avoids integration Can be shown to be statistically consistent and optimal according to a Bayesian denoising objective 12

General problem: Non-normalized model estimation We want to estimate a parametric model of a multivariate random vector x R n Density function is known only up to a multiplicative constant p(x;θ) = 1 Z(θ) q(x;θ) Z(θ) = Z q(ξ;θ) dξ ξ Rn Functional form of q is known (can be easily computed) Z cannot be computed with reasonable computing time 13

Previous solutions Monte Carlo methods Consistent estimators (convergence to real parameter values when sample size ) Computation very slow Various approximations, e.g. variational methods Computation often fast Consistency not known Pseudo-likelihood and contrastive divergence Presumably consistent Computations slow with continuous-valued variables: needs 1-D integration at every step, or sophisticated MCMC methods 14

Definition of score function (in this talk) Define model score function R n R n as log p(ξ;θ) ξ 1 ψ(ξ;θ) =. = ξ log p(ξ;θ) log p(ξ;θ) ξ n Similarly, define data score function as ψ x (ξ) = ξ log p x (ξ) where observed data is assumed to follow p x (.). In conventional terminology: Fisher score with respect to a hypothetical location parameter: p(x θ), evaluated at θ = 0. 15

Score matching: definition of objective function Estimate by minimizing a distance between model score function ψ(.;θ) and score function of observed data ψ x (.): J(θ) = 1 2 Z ξ R n p x(ξ) ψ(ξ;θ) ψ x (ξ) 2 dξ (7) ˆθ = argmin J(θ) θ This gives a consistent estimator almost by construction Does not depend on normalization constant Z(θ) because ψ(ξ;θ) = ξ logq(ξ;θ)+ ξ logz(θ) = ξ logq(ξ;θ)+0 (8) No need to compute normalization constant Z, non-normalized pdf q is enough. Computation of J quite simple due to theorem below 16

A computational trick In the objective function we have score function of data distribution ψ x (.). How to compute it? In fact, no need to compute it because Theorem 1 Assume some regularity conditions, and smooth densities. Then, the score matching objective function J can be expressed as Z J(θ) = [ i ψ i (ξ;θ)+ 12 ] ψ i(ξ;θ) 2 dξ+const. (9) ξ R n p x(ξ) n i=1 where the constant does not depend on θ, and ψ i (ξ;θ) = logq(ξ;θ), and i ψ i (ξ;θ) = 2 logq(ξ;θ) ξ i ξ 2 i 17

Simple explanation of trick Consider objective function J(θ): 1 2 Z p x (ξ) ψ x (ξ) 2 dξ+ 1 2 Z Z p x (ξ) ψ(ξ;θ) 2 dξ p x (ξ)ψ x (ξ) T ψ(ξ;θ)dξ First term does not depend on θ. Second term easy to compute. The trick is to use partial integration on third term. In one dimension: Z p x (x)(log p x ) (x)ψ(x;θ)dx = = Z Z p x (x) p x(x) p x (x) ψ(x;θ)dx p x(x)ψ(x;θ)dx = 0 Z p x (x)ψ (x;θ)dx This is why score function of data distribution p x (x) disappears! 18

Final method of score matching Replace integration over sample density p x (.) by sample average Given T observations x(1),...,x(t), minimize J(θ) = 1 T T n t=1 i=1 [ i ψ i (x(t);θ)+ 12 ψ i(x(t);θ) 2 ] (10) where ψ i is a partial derivative of non-normalized model log-density logq, and i ψ i a second partial derivative Only needs evaluation of some derivatives of the non-normalized (log)-density q which are simple to compute (by assumption) Thus: a new computationally simple and statistically consistent method 19

Interesting result: Closed-form solution in the exponential family Assume pdf can be expressed in the form log p(ξ;θ) = Define matrices of partial derivatives: m θ k F k (ξ) logz(θ) (11) k=1 K ki (ξ) = F k ξ i, and H ki (ξ) = 2 F k ξ 2 i (12) Then, the score matching estimator is given by: ˆθ = [ Ê{K(x)K(x) T } ] 1 ( Ê{h i (x)}) (13) i where Ê denotes the sample average, and the vector h i is the i-th column of the matrix H. 20

Extensions of score matching Can be extended to non-negative data Basic score matching cannot be directly use because density is typically not smooth over R n. Can be extended to binary variables However, utility questionable because pseudolikelihood is computationally efficient in that case Can be shown to be equivalent to a special case of contrastive divergence (equal in expectation when using Langevin MCMC method and infinitesimal step size) 21

Statistical optimality of score matching Question: is score matching optimal in any statistical sense? Consider observed signal y which is a noisy version of original signal x which comes from a prior distribution with parameter vector θ Assume: p(y,x,θ) = cexp( 1 2σ 2 y x 2 )p(x θ) (14) We infer the original signal by MAP inference. ˆx MAP (ˆθ,y) = argmax x p(y x)p(x ˆθ) = arg max log p(y x)+log p(x ˆθ) x We estimate parameters θ from a separate sample of noise-free signals x. What is the optimal method of estimating θ? (A single point estimate) 22

Statistical optimality (2): Difference to classical analysis Classical analysis of optimality of estimators considers errors in parameter values Here, we consider error in the restored (denoised) signal (Euclidean distance between x and its MAP estimate) These errors need to be related, cf. collinearity in linear regression Also: to be computationally realistic, we don t use a full Bayesian restoration, instead take point estimate of θ and use MAP estimate. We also assume that we can observe noise-free signals from which to estimate the parameters. 23

Statistical optimality (3): Analysis of estimation error Assume signal is corrupted by infinitely small gaussian noise as above Theorem 2 Assume that all the the log-pdf s are differentiable, and the estimation error in MAP estimation x = ˆx x is small. Then first-order approximation of error is x 2 = σ 4 = E{ E 1 2 }+E{ E 2 2 }+ smaller terms (15) wheree 1 = ψ 0 (x) ψ(x ˆθ) ande 2 = ψ 0 (x)+ψ(y x) Note thate 2 does not depend on θ Thus, optimal estimation of θ is by minimization of E px { E 1 2 }: This is just score matching! 24

An information geometry Considering p x fixed, we define a Hilbertian structure in the space of score functions. [ Z n ] Z p 1, p 2 = p x (ξ) ψ 1,i (ξ)ψ 2,i (ξ) dξ = p x (ξ)ψ 1 (ξ) T ψ 2 (ξ)dξ i=1 (16) Dot-product defines norm and distance Score matching is performed by minimization of distance of p x and p(. θ) in this metric. 25

An information geometry (2): Pythagorean decomposition Exponential family is linear subspace Estimation is orthogonal projection on that subspace Pythagorean equality p x 2 = dist 2 (p(. ˆθ), p x )+ p(. ˆθ) 2 (17) Can be interpreted in terms of denoising capability of MAP estimation: var of noise which can be removed by MAP denoising = noise var not removed due to imperfect prior +noise var removed by prior Intuitively, denoising is possible because of structure in the signal, which leads to a more speculative interpretation: Structure in data = Structure not modelled + Structure modelled 26

Experiment: overcomplete ICA basis of natural images Likelihood: log p(x) = m k=1 α kg(w T k x)+z(w 1,...,w n,α 1,...,α n ) Objective function J = m 1 α k k=1 T T t=1g (w T k x(t))+ 1 2 m α j α k w T 1 j w k j,k=1 T T g(w T k x(t))g(wt j x(t)) t=1 (18) 120 basis vectors from image 8 8 patches (no dimension reduction) 27

Conclusion Non-gaussianity is a very powerful property in multivariate statistics Finds hidden factors (independent component analysis) Estimates linear Bayesian network However, leads to computationally difficult models Even non-normalized models We propose to minimize the squared distance of the score functions (gradients of log-density) of model density and data distribution First consistent and computationally simple method (?) Closed-form solution in some exponential families Statistical optimality in the sense of denoising 28