From independent component analysis to score matching

Size: px

Start display at page:

Download "From independent component analysis to score matching"

Beatrice Shepherd
5 years ago
Views:

1 From independent component analysis to score matching Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1

2 Abstract First, short introduction to independent component analysis non-gaussian Bayesian networks Main topic: Estimation of non-normalized models Problem: Parameterized density does not integrate to unity Partition function (normalization constant) is difficult to compute Solution: Fitting gradient of log-density ψ w.r.t. data variable Minimize squared distance of ψ of data and ψ of model ψ does not depend on normalization constant Using partial integration, distance be computed by a simple formula Estimator is optimal for reducing gaussian (infinitesimal) noise 2

3 Problem of blind source separation There is a number of source signals : Due to some external circumstances, only linear mixtures of the source signals are observed: Estimate (separate) original signals! 3

4 Principal component analysis does not recover original signals 4

5 Principal component analysis does not recover original signals A solution is possible Use information on statistical independence to recover: 5

6 Independent Component Analysis (Hérault and Jutten, ) Observed random vector x is modelled by a linear latent variable model x i = m a i j s j, i = 1...n (1) j=1 or in matrix form: x = As (2) where The mixing matrix A is constant (a parameter matrix). The s i are latent random variables called the independent components. Estimate both A and s, observing only x. 6

7 Basic assumptions of the ICA model Must assume: The s i are mutually independent The s i are nongaussian. For simplicity: The matrix A is square. Then: mixing matrix and components can be identified (Comon, 1994) A very surprising result! The s i defined only up to a multiplicative constant, not ordered. 7

8 ICA and decorrelation First approach: decorrelate variables, i.e. find W so that y = Wx has uncorrelated components: E{y i y j } E{y i }E{y j } = 0 (3) But: Decorrelation (e.g. PCA) uses only correlation matrix: n 2 /2 equations, and model has n 2 parameters Not enough information! Fortunately, for independent variables we have something stronger: E{h 1 (y 1 )h 2 (y 2 )} E{h 1 (y 1 )}E{h 2 (y 2 )} = 0. (4) Gaussian data determined by correlations alone model cannot be estimated for gaussian data. 8

9 Basic intuitive principle of ICA estimation. (Very sloppy version of) the Central Limit Theorem. Consider a linear combination w T x = q T s q i s i + q j s j is more gaussian than s i. Maximizing the nongaussianity of q T s, we can find s i. Also known as projection pursuit. Cf. principal component analysis: maximize variance of w T x. A number of algorithms available, e.g. FastICA (Hyvärinen, 1999) 9

10 Linear Non-Gaussian Acyclic Model (LiNGAM; Shimizu et al 2006) Instead of components, we can estimate a network: x = Bx+e x1 Estimation possible based on ICA Assume e i independent and nongaussian x We can rearrange to obtain ICA (almost): x x = Bx+e (I B)x = e x So, ICA can be used to obtain I B (almost) Problem: ICA does not determine order of components x7 x6 1 x5 Solution: acyclicity defines order uniquely 10

11 Generalization of ICA to many components In basic ICA, number of components = dimension of data We could consider many more projections and maximize their non-gaussianity: m G(w T k k=1 for some function G measuring nongaussianity. To estimate w k, we interpret this as a log-density x) (5) However, it should be normalized to unit integral log p(x) = m Z G(w T k x) log k=1 exp( m k=1 G(w T k ξ))dξ (6) We find a very difficult integral! This leads to main topic. 11

12 Main talk topic: score matching Abstract How to estimate models which cannot be integrated analytically Maximum likelihood estimation computationally difficult: must compute integral We propose a computationally efficient method which avoids integration Can be shown to be statistically consistent and optimal according to a Bayesian denoising objective 12

13 General problem: Non-normalized model estimation We want to estimate a parametric model of a multivariate random vector x R n Density function is known only up to a multiplicative constant p(x;θ) = 1 Z(θ) q(x;θ) Z(θ) = Z q(ξ;θ) dξ ξ Rn Functional form of q is known (can be easily computed) Z cannot be computed with reasonable computing time 13

14 Previous solutions Monte Carlo methods Consistent estimators (convergence to real parameter values when sample size ) Computation very slow Various approximations, e.g. variational methods Computation often fast Consistency not known Pseudo-likelihood and contrastive divergence Presumably consistent Computations slow with continuous-valued variables: needs 1-D integration at every step, or sophisticated MCMC methods 14

15 Definition of score function (in this talk) Define model score function R n R n as log p(ξ;θ) ξ 1 ψ(ξ;θ) =. = ξ log p(ξ;θ) log p(ξ;θ) ξ n Similarly, define data score function as ψ x (ξ) = ξ log p x (ξ) where observed data is assumed to follow p x (.). In conventional terminology: Fisher score with respect to a hypothetical location parameter: p(x θ), evaluated at θ = 0. 15

16 Score matching: definition of objective function Estimate by minimizing a distance between model score function ψ(.;θ) and score function of observed data ψ x (.): J(θ) = 1 2 Z ξ R n p x(ξ) ψ(ξ;θ) ψ x (ξ) 2 dξ (7) ˆθ = argmin J(θ) θ This gives a consistent estimator almost by construction Does not depend on normalization constant Z(θ) because ψ(ξ;θ) = ξ logq(ξ;θ)+ ξ logz(θ) = ξ logq(ξ;θ)+0 (8) No need to compute normalization constant Z, non-normalized pdf q is enough. Computation of J quite simple due to theorem below 16

17 A computational trick In the objective function we have score function of data distribution ψ x (.). How to compute it? In fact, no need to compute it because Theorem 1 Assume some regularity conditions, and smooth densities. Then, the score matching objective function J can be expressed as Z J(θ) = [ i ψ i (ξ;θ)+ 12 ] ψ i(ξ;θ) 2 dξ+const. (9) ξ R n p x(ξ) n i=1 where the constant does not depend on θ, and ψ i (ξ;θ) = logq(ξ;θ), and i ψ i (ξ;θ) = 2 logq(ξ;θ) ξ i ξ 2 i 17

18 Simple explanation of trick Consider objective function J(θ): 1 2 Z p x (ξ) ψ x (ξ) 2 dξ+ 1 2 Z Z p x (ξ) ψ(ξ;θ) 2 dξ p x (ξ)ψ x (ξ) T ψ(ξ;θ)dξ First term does not depend on θ. Second term easy to compute. The trick is to use partial integration on third term. In one dimension: Z p x (x)(log p x ) (x)ψ(x;θ)dx = = Z Z p x (x) p x(x) p x (x) ψ(x;θ)dx p x(x)ψ(x;θ)dx = 0 Z p x (x)ψ (x;θ)dx This is why score function of data distribution p x (x) disappears! 18

19 Final method of score matching Replace integration over sample density p x (.) by sample average Given T observations x(1),...,x(t), minimize J(θ) = 1 T T n t=1 i=1 [ i ψ i (x(t);θ)+ 12 ψ i(x(t);θ) 2 ] (10) where ψ i is a partial derivative of non-normalized model log-density logq, and i ψ i a second partial derivative Only needs evaluation of some derivatives of the non-normalized (log)-density q which are simple to compute (by assumption) Thus: a new computationally simple and statistically consistent method 19

20 Interesting result: Closed-form solution in the exponential family Assume pdf can be expressed in the form log p(ξ;θ) = Define matrices of partial derivatives: m θ k F k (ξ) logz(θ) (11) k=1 K ki (ξ) = F k ξ i, and H ki (ξ) = 2 F k ξ 2 i (12) Then, the score matching estimator is given by: ˆθ = [ Ê{K(x)K(x) T } ] 1 ( Ê{h i (x)}) (13) i where Ê denotes the sample average, and the vector h i is the i-th column of the matrix H. 20

21 Extensions of score matching Can be extended to non-negative data Basic score matching cannot be directly use because density is typically not smooth over R n. Can be extended to binary variables However, utility questionable because pseudolikelihood is computationally efficient in that case Can be shown to be equivalent to a special case of contrastive divergence (equal in expectation when using Langevin MCMC method and infinitesimal step size) 21

22 Statistical optimality of score matching Question: is score matching optimal in any statistical sense? Consider observed signal y which is a noisy version of original signal x which comes from a prior distribution with parameter vector θ Assume: p(y,x,θ) = cexp( 1 2σ 2 y x 2 )p(x θ) (14) We infer the original signal by MAP inference. ˆx MAP (ˆθ,y) = argmax x p(y x)p(x ˆθ) = arg max log p(y x)+log p(x ˆθ) x We estimate parameters θ from a separate sample of noise-free signals x. What is the optimal method of estimating θ? (A single point estimate) 22

23 Statistical optimality (2): Difference to classical analysis Classical analysis of optimality of estimators considers errors in parameter values Here, we consider error in the restored (denoised) signal (Euclidean distance between x and its MAP estimate) These errors need to be related, cf. collinearity in linear regression Also: to be computationally realistic, we don t use a full Bayesian restoration, instead take point estimate of θ and use MAP estimate. We also assume that we can observe noise-free signals from which to estimate the parameters. 23

24 Statistical optimality (3): Analysis of estimation error Assume signal is corrupted by infinitely small gaussian noise as above Theorem 2 Assume that all the the log-pdf s are differentiable, and the estimation error in MAP estimation x = ˆx x is small. Then first-order approximation of error is x 2 = σ 4 = E{ E 1 2 }+E{ E 2 2 }+ smaller terms (15) wheree 1 = ψ 0 (x) ψ(x ˆθ) ande 2 = ψ 0 (x)+ψ(y x) Note thate 2 does not depend on θ Thus, optimal estimation of θ is by minimization of E px { E 1 2 }: This is just score matching! 24

25 An information geometry Considering p x fixed, we define a Hilbertian structure in the space of score functions. [ Z n ] Z p 1, p 2 = p x (ξ) ψ 1,i (ξ)ψ 2,i (ξ) dξ = p x (ξ)ψ 1 (ξ) T ψ 2 (ξ)dξ i=1 (16) Dot-product defines norm and distance Score matching is performed by minimization of distance of p x and p(. θ) in this metric. 25

26 An information geometry (2): Pythagorean decomposition Exponential family is linear subspace Estimation is orthogonal projection on that subspace Pythagorean equality p x 2 = dist 2 (p(. ˆθ), p x )+ p(. ˆθ) 2 (17) Can be interpreted in terms of denoising capability of MAP estimation: var of noise which can be removed by MAP denoising = noise var not removed due to imperfect prior +noise var removed by prior Intuitively, denoising is possible because of structure in the signal, which leads to a more speculative interpretation: Structure in data = Structure not modelled + Structure modelled 26

27 Experiment: overcomplete ICA basis of natural images Likelihood: log p(x) = m k=1 α kg(w T k x)+z(w 1,...,w n,α 1,...,α n ) Objective function J = m 1 α k k=1 T T t=1g (w T k x(t))+ 1 2 m α j α k w T 1 j w k j,k=1 T T g(w T k x(t))g(wt j x(t)) t=1 (18) 120 basis vectors from image 8 8 patches (no dimension reduction) 27

28 Conclusion Non-gaussianity is a very powerful property in multivariate statistics Finds hidden factors (independent component analysis) Estimates linear Bayesian network However, leads to computationally difficult models Even non-normalized models We propose to minimize the squared distance of the score functions (gradients of log-density) of model density and data distribution First consistent and computationally simple method (?) Closed-form solution in some exponential families Statistical optimality in the sense of denoising 28

Estimation theory and information geometry based on denoising

Estimation theory and information geometry based on denoising Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1 Abstract What is the best