Estimation theory and information geometry based on denoising

Estimation theory and information geometry based on denoising Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1

Abstract What is the best prior to be used in denoising by Bayesian inference? Consider infinitesimal gaussian noise Assume we can estimate prior parameters from noise-free data Solution: Fitting gradient of log-density ψ w.r.t. data variable Minimize squared distance of ψ of data and ψ of model Using partial integration, distance be computed by a simple formula Related problem: Estimation of non-normalized models Computationally simple solution provided by the same estimator No need to compute normalization constant (partition function) Leads to a new kind of information geometry 2

Starting point: Best prior for denoising Consider observed signal y which is a noisy version of original signal x which comes from a prior distribution with parameter vector θ Assume: p(y,x,θ) = cexp( 1 2σ 2 y x 2 )p(x θ) (1) We infer the original signal by MAP inference. ˆx MAP (ˆθ,y) = argmax x p(y x)p(x ˆθ) = arg max log p(y x)+log p(x ˆθ) x We estimate parameters θ from a separate sample of noise-free signals x. What is the optimal method of estimating θ? (A single point estimate) 3

Difference to classical optimality analysis Classical analysis of optimality of estimators considers errors in parameter values Here, we consider error in the restored (denoised) signal (Euclidean distance between x and its MAP estimate) These errors need to be related, cf. collinearity in linear regression Also: to be computationally realistic, we don t use a full Bayesian restoration, instead take point estimate of θ and use MAP estimate. We also assume that we can observe noise-free signals from which to estimate the parameters. 4

Analysis of estimation error Assume signal is corrupted by infinitely small gaussian noise as above Theorem 1 Assume that all the the log-pdf s are differentiable, and the estimation error in MAP estimation x = ˆx x is small. Then first-order approximation of error is x 2 = σ 4 E 1 2 + E 2 2 + smaller terms (2) wheree 1 = ψ 0 (x) ψ(x ˆθ) ande 2 = ψ 0 (x)+ψ(y x) Note thate 2 does not depend on θ Thus, optimal estimation of θ is by minimization of E px { E 1 2 } 5

Definition of score function (in this talk) Define model score function R n R n as ψ(ξ θ) = ( log p(ξ θ) ξ 1,..., log p(ξ θ) ξ n ) T = ξ log p(ξ θ) Similarly, define data score function as ψ x (ξ) = ξ log p x (ξ) where observed data is assumed to follow p x (.). Optimal estimator obtained by minimizing a distance between model score function ψ(. θ) and score function of observed data ψ x (.): J(θ) = 1 2 Estimator consistent almost by construction ξ R n p x(ξ) ψ(ξ θ) ψ x (ξ) 2 dξ (3) 6

Related problem: Non-normalized model estimation We want to estimate a parametric model of a multivariate random vector x R n Density function is known only up to a multiplicative constant p(x θ) = 1 (θ) q(x θ) (θ) = q(ξ θ) dξ ξ Rn Functional form of q is known (can be easily computed) cannot be computed with reasonable computing time Typical application: Markov Random Fields 7

Previous solutions to estimation of non-normalized models Monte Carlo methods for estimating Consistent estimators (convergence to real parameter values when sample size ) Computation very slow (I think) Various approximations, e.g. variational methods Computation often fast Consistency not known, or proven inconsistent Pseudo-likelihood and contrastive divergence Presumably consistent Computations slow with continuous-valued variables: needs 1-D integration at every step, or sophisticated MCMC methods 8

Score matching can be used for non-normalized models No need to compute normalization constant because ψ(ξ θ) = ξ logq(ξ θ)+ ξ log(θ) = ξ logq(ξ θ)+0 (4) In the objective function we have score function of data distribution ψ x (.). How to compute it? In fact, no need to compute it because Theorem 2 Assume some regularity conditions, and smooth densities. Then, the score matching objective function J can be expressed as J(θ) = [ i ψ i (ξ θ)+ 12 ] ψ i(ξ θ) 2 dξ+const. (5) ξ R n p x(ξ) n i=1 where the constant does not depend on θ, and ψ i (ξ θ) = logq(ξ θ), and i ψ i (ξ θ) = 2 logq(ξ θ) ξ i ξ 2 i 9

Simple explanation of trick Consider objective function J(θ): 1 2 p x (ξ) ψ x (ξ) 2 dξ+ 1 2 p x (ξ) ψ(ξ θ) 2 dξ p x (ξ)ψ x (ξ) T ψ(ξ θ)dξ First term does not depend on θ. Second term easy to compute. The trick is to use partial integration on third term. In one dimension: p x (x)(log p x ) (x)ψ(x θ)dx = = p x (x) p x(x) p x (x) ψ(x θ)dx p x(x)ψ(x θ)dx = 0 p x (x)ψ (x θ)dx This is why score function of data distribution p x (x) disappears! 10

Final method of score matching Replace integration over sample density p x (.) by sample average Given T observations x(1),...,x(t), minimize J(θ) = 1 T T n t=1 i=1 [ i ψ i (x(t) θ)+ 12 ψ i(x(t) θ) 2 ] (6) where ψ i is a partial derivative of non-normalized model log-density logq, and i ψ i a second partial derivative Only needs evaluation of some derivatives of the non-normalized (log)-density q which are simple to compute (by assumption) Thus: statistical optimality in denoising and computational simplicity for non-normalized models obtained with the same estimator 11

Interesting result: Closed-form solution in the exponential family Assume pdf can be expressed in the form log p(ξ θ) = Define matrices of partial derivatives: m θ k F k (ξ) log(θ) (7) k=1 K ki (ξ) = F k ξ i, and H ki (ξ) = 2 F k ξ 2 i (8) Then, the score matching estimator is given by: ˆθ = [ Ê{K(x)K(x) T } ] 1 ( Ê{h i (x)}) (9) i where Ê denotes the sample average, and the vector h i is the i-th column of the matrix H. 12

Extensions of score matching Can be extended to non-negative data Basic score matching cannot be directly use because density is typically not smooth over R n. Can be extended to binary variables However, utility questionable because pseudolikelihood is computationally efficient in that case Can be shown to be equivalent to a special case of contrastive divergence (equal in expectation when using Langevin MCMC method and infinitesimal step size) 13

An information geometry Considering p x fixed, we define a Hilbertian structure in the space of score functions. [ n ] p 1, p 2 = p x (ξ) ψ 1,i (ξ)ψ 2,i (ξ) dξ = p x (ξ)ψ 1 (ξ) T ψ 2 (ξ)dξ i=1 (10) Dot-product defines norm and distance Score matching is performed by minimization of distance of p x and p(. θ) in this metric. 14

Pythagorean decomposition for exponential families Exponential family is linear subspace Estimation is orthogonal projection on that subspace Pythagorean equality p x 2 = dist 2 (p(. ˆθ), p x )+ p(. ˆθ) 2 (11) Can be interpreted in terms of denoising capability of MAP estimation: var of noise which can be removed by MAP denoising = noise var not removed due to imperfect prior +noise var removed by prior Intuitively, denoising is possible because of structure in the signal, which leads to a more speculative interpretation: Structure in data = Structure not modelled + Structure modelled 15

Interesting point: We can even use improper densities Nothing in the method requires the densities to integrable at all We can use all kinds of functional forms for the densities For example, density can stay constant at infinity p(x;µ,σ) = [1+exp( x µ σ )] 1 (12) 16

Experiment: overcomplete basis of natural images Likelihood: log p(x) = m k=1 α kg(w T k x)+(w 1,...,w n,α 1,...,α n ) Objective function J = m 1 α k k=1 T T t=1g (w T k x(t))+ 1 2 m α j α k w T 1 j w k j,k=1 T T g(w T k x(t))g(wt j x(t)) t=1 (13) 120 basis vectors from image 8 8 patches (no dimension reduction) 17

Experiment 2: denoising p 0 : several 1-D densities of zero mean Modelled (approximated) by a logistic distribution with a location parameter θ: log p(x θ) = 2logcosh( π 2 (x θ)) log4 3 Gaussian noise added, parameter estimated by different methods, and MAP inference done. 18

Denoise by MAP gauss mixt 1 gauss mixt 2 chi square Laplacian SM: value of ˆθ -0.447-0.961-0.505-0.027 ML: value of ˆθ -0.335-0.385-0.225 0.002 noise variance = 0.05 SM: squared error in ˆx 0.0451832 0.0437251 0.0475274 0.0458882 ML: squared error in ˆx 0.0458262 0.0539467 0.0481536 0.0458603 PP: error in x 0.0409871 0.0232552 0.0445455 0.0458615 p-value of difference 0 0 1.41838e-08 0.983707 SM: performance index 53.4441 23.4623 45.3304 99.3535 ML: performance index 46.3097-14.757 33.8515 100.027 noise variance = 0.1 SM: squared error in ˆx 0.0872924 0.0862113 0.0905881 0.0884577 ML: squared error in ˆx 0.0889856 0.109962 0.0928176 0.0884553 PP: error in x 0.0765966 0.0491676 0.0840644 0.0884539 p-value of difference 0 0 2.22045e-16 0.534107 SM: performance index 54.298 27.1258 59.0617 99.967 ML: performance index 47.0631-19.5983 45.071 99.9879 noise variance = 0.2 SM: squared error in ˆx 0.157888 0.175383 0.168148 0.163455 ML: squared error in ˆx 0.162068 0.214278 0.171554 0.163322 PP: error in x 0.141524 0.129202 0.153341 0.163327 p-value of difference 0 0 2.00357e-09 0.989493 SM: performance index 72.0148 34.7715 68.2662 99.6507 ML: performance index 64.8668-20.1669 60.9658 100.014 noise variance = 0.5 SM: squared error in ˆx 0.359553 0.495427 0.353038 0.336505 19 ML: squared error in ˆx 0.363577 0.48297 0.344779 0.33623 PP: error in x 0.331788 0.45787 0.307364 0.336233 p-value of difference 0 0.999999 1 0.984166 SM: performance index 83.4939 10.8552 76.2902 99.8336

Conclusion We propose to estimate a parametric model by minimizing the squared distance of the score functions (gradients of log-density w.r.t. data variable) of model density and data distribution Statistically optimal prior for removing infinitesimal gaussian noise by MAP inference Computationally simple (no integration) for non-normalized densities, yet consistent Closed-form solution in some exponential families Geometric interpretations possible No need for densities to be integrable at all 20