Estimation theory and information geometry based on denoising

Similar documents
From independent component analysis to score matching

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Estimating Unnormalized models. Without Numerical Integration

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

ESTIMATION THEORY AND INFORMATION GEOMETRY BASED ON DENOISING. Aapo Hyvärinen

Some extensions of score matching

Estimating Unnormalised Models by Score Matching

Density Estimation. Seungjin Choi

Nonparametric Bayesian Methods (Gaussian Processes)

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Introduction to Gaussian Processes

Machine learning - HT Maximum Likelihood

Lecture : Probabilistic Machine Learning

Markov Random Fields

Adaptive HMC via the Infinite Exponential Family

CPSC 540: Machine Learning

Detection and Estimation Theory

Undirected Graphical Models

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

Learning features by contrasting natural images with noise

A two-layer ICA-like model estimated by Score Matching

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CIFAR Lectures: Non-Gaussian statistics and natural images

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

PATTERN RECOGNITION AND MACHINE LEARNING

The Bayesian approach to inverse problems

GAUSSIAN PROCESS REGRESSION

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Unsupervised Learning

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

Bayesian Inference by Density Ratio Estimation

Bayesian Regression Linear and Logistic Regression

ML estimation: Random-intercepts logistic model. and z

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Probabilistic & Unsupervised Learning

Introduction to the regression problem. Luca Martino

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 6: Graphical Models: Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Linear Regression and Its Applications

Lecture 2 Machine Learning Review

Natural Image Statistics

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Parameter Estimation in a Moving Horizon Perspective

STA414/2104 Statistical Methods for Machine Learning II

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

10. Linear Models and Maximum Likelihood Estimation

Probabilistic Graphical Models for Image Analysis - Lecture 4

Gaussian processes for inference in stochastic differential equations

MAP Examples. Sargur Srihari

Machine Learning Summer School

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Covariance function estimation in Gaussian process regression

Model Selection and Geometry

Naïve Bayes classification

Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Ch 4. Linear Models for Classification

Need for Sampling in Machine Learning. Sargur Srihari

Approximate inference in Energy-Based Models

Stat 5101 Lecture Notes

Basic math for biology

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Patterns of Scalable Bayesian Inference Background (Session 1)

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Machine Learning (CS 567) Lecture 5

Lecture Notes 1: Vector spaces

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Variational Inference via Stochastic Backpropagation

Metric-based classifiers. Nuno Vasconcelos UCSD

Introduction to Machine Learning CMU-10701

Basic Sampling Methods

Data assimilation in high dimensions

Machine Learning Basics III

STAT 730 Chapter 4: Estimation

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Lecture 3: Pattern Classification

Markov Chain Monte Carlo (MCMC)

Expectation Propagation for Approximate Bayesian Inference

ECE521 week 3: 23/26 January 2017

Variational Learning : From exponential families to multilinear systems

Gaussian with mean ( µ ) and standard deviation ( σ)

A Bayesian Treatment of Linear Gaussian Regression

Noise-contrastive estimation of unnormalized statistical models, and its application to natural image statistics

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Announcements. Proposals graded

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Riemannian Stein Variational Gradient Descent for Bayesian Inference

Machine Learning Basics: Maximum Likelihood Estimation

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Training an RBM: Contrastive Divergence. Sargur N. Srihari

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Transcription:

Estimation theory and information geometry based on denoising Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1

Abstract What is the best prior to be used in denoising by Bayesian inference? Consider infinitesimal gaussian noise Assume we can estimate prior parameters from noise-free data Solution: Fitting gradient of log-density ψ w.r.t. data variable Minimize squared distance of ψ of data and ψ of model Using partial integration, distance be computed by a simple formula Related problem: Estimation of non-normalized models Computationally simple solution provided by the same estimator No need to compute normalization constant (partition function) Leads to a new kind of information geometry 2

Starting point: Best prior for denoising Consider observed signal y which is a noisy version of original signal x which comes from a prior distribution with parameter vector θ Assume: p(y,x,θ) = cexp( 1 2σ 2 y x 2 )p(x θ) (1) We infer the original signal by MAP inference. ˆx MAP (ˆθ,y) = argmax x p(y x)p(x ˆθ) = arg max log p(y x)+log p(x ˆθ) x We estimate parameters θ from a separate sample of noise-free signals x. What is the optimal method of estimating θ? (A single point estimate) 3

Difference to classical optimality analysis Classical analysis of optimality of estimators considers errors in parameter values Here, we consider error in the restored (denoised) signal (Euclidean distance between x and its MAP estimate) These errors need to be related, cf. collinearity in linear regression Also: to be computationally realistic, we don t use a full Bayesian restoration, instead take point estimate of θ and use MAP estimate. We also assume that we can observe noise-free signals from which to estimate the parameters. 4

Analysis of estimation error Assume signal is corrupted by infinitely small gaussian noise as above Theorem 1 Assume that all the the log-pdf s are differentiable, and the estimation error in MAP estimation x = ˆx x is small. Then first-order approximation of error is x 2 = σ 4 E 1 2 + E 2 2 + smaller terms (2) wheree 1 = ψ 0 (x) ψ(x ˆθ) ande 2 = ψ 0 (x)+ψ(y x) Note thate 2 does not depend on θ Thus, optimal estimation of θ is by minimization of E px { E 1 2 } 5

Definition of score function (in this talk) Define model score function R n R n as ψ(ξ θ) = ( log p(ξ θ) ξ 1,..., log p(ξ θ) ξ n ) T = ξ log p(ξ θ) Similarly, define data score function as ψ x (ξ) = ξ log p x (ξ) where observed data is assumed to follow p x (.). Optimal estimator obtained by minimizing a distance between model score function ψ(. θ) and score function of observed data ψ x (.): J(θ) = 1 2 Estimator consistent almost by construction ξ R n p x(ξ) ψ(ξ θ) ψ x (ξ) 2 dξ (3) 6

Related problem: Non-normalized model estimation We want to estimate a parametric model of a multivariate random vector x R n Density function is known only up to a multiplicative constant p(x θ) = 1 (θ) q(x θ) (θ) = q(ξ θ) dξ ξ Rn Functional form of q is known (can be easily computed) cannot be computed with reasonable computing time Typical application: Markov Random Fields 7

Previous solutions to estimation of non-normalized models Monte Carlo methods for estimating Consistent estimators (convergence to real parameter values when sample size ) Computation very slow (I think) Various approximations, e.g. variational methods Computation often fast Consistency not known, or proven inconsistent Pseudo-likelihood and contrastive divergence Presumably consistent Computations slow with continuous-valued variables: needs 1-D integration at every step, or sophisticated MCMC methods 8

Score matching can be used for non-normalized models No need to compute normalization constant because ψ(ξ θ) = ξ logq(ξ θ)+ ξ log(θ) = ξ logq(ξ θ)+0 (4) In the objective function we have score function of data distribution ψ x (.). How to compute it? In fact, no need to compute it because Theorem 2 Assume some regularity conditions, and smooth densities. Then, the score matching objective function J can be expressed as J(θ) = [ i ψ i (ξ θ)+ 12 ] ψ i(ξ θ) 2 dξ+const. (5) ξ R n p x(ξ) n i=1 where the constant does not depend on θ, and ψ i (ξ θ) = logq(ξ θ), and i ψ i (ξ θ) = 2 logq(ξ θ) ξ i ξ 2 i 9

Simple explanation of trick Consider objective function J(θ): 1 2 p x (ξ) ψ x (ξ) 2 dξ+ 1 2 p x (ξ) ψ(ξ θ) 2 dξ p x (ξ)ψ x (ξ) T ψ(ξ θ)dξ First term does not depend on θ. Second term easy to compute. The trick is to use partial integration on third term. In one dimension: p x (x)(log p x ) (x)ψ(x θ)dx = = p x (x) p x(x) p x (x) ψ(x θ)dx p x(x)ψ(x θ)dx = 0 p x (x)ψ (x θ)dx This is why score function of data distribution p x (x) disappears! 10

Final method of score matching Replace integration over sample density p x (.) by sample average Given T observations x(1),...,x(t), minimize J(θ) = 1 T T n t=1 i=1 [ i ψ i (x(t) θ)+ 12 ψ i(x(t) θ) 2 ] (6) where ψ i is a partial derivative of non-normalized model log-density logq, and i ψ i a second partial derivative Only needs evaluation of some derivatives of the non-normalized (log)-density q which are simple to compute (by assumption) Thus: statistical optimality in denoising and computational simplicity for non-normalized models obtained with the same estimator 11

Interesting result: Closed-form solution in the exponential family Assume pdf can be expressed in the form log p(ξ θ) = Define matrices of partial derivatives: m θ k F k (ξ) log(θ) (7) k=1 K ki (ξ) = F k ξ i, and H ki (ξ) = 2 F k ξ 2 i (8) Then, the score matching estimator is given by: ˆθ = [ Ê{K(x)K(x) T } ] 1 ( Ê{h i (x)}) (9) i where Ê denotes the sample average, and the vector h i is the i-th column of the matrix H. 12

Extensions of score matching Can be extended to non-negative data Basic score matching cannot be directly use because density is typically not smooth over R n. Can be extended to binary variables However, utility questionable because pseudolikelihood is computationally efficient in that case Can be shown to be equivalent to a special case of contrastive divergence (equal in expectation when using Langevin MCMC method and infinitesimal step size) 13

An information geometry Considering p x fixed, we define a Hilbertian structure in the space of score functions. [ n ] p 1, p 2 = p x (ξ) ψ 1,i (ξ)ψ 2,i (ξ) dξ = p x (ξ)ψ 1 (ξ) T ψ 2 (ξ)dξ i=1 (10) Dot-product defines norm and distance Score matching is performed by minimization of distance of p x and p(. θ) in this metric. 14

Pythagorean decomposition for exponential families Exponential family is linear subspace Estimation is orthogonal projection on that subspace Pythagorean equality p x 2 = dist 2 (p(. ˆθ), p x )+ p(. ˆθ) 2 (11) Can be interpreted in terms of denoising capability of MAP estimation: var of noise which can be removed by MAP denoising = noise var not removed due to imperfect prior +noise var removed by prior Intuitively, denoising is possible because of structure in the signal, which leads to a more speculative interpretation: Structure in data = Structure not modelled + Structure modelled 15

Interesting point: We can even use improper densities Nothing in the method requires the densities to integrable at all We can use all kinds of functional forms for the densities For example, density can stay constant at infinity p(x;µ,σ) = [1+exp( x µ σ )] 1 (12) 16

Experiment: overcomplete basis of natural images Likelihood: log p(x) = m k=1 α kg(w T k x)+(w 1,...,w n,α 1,...,α n ) Objective function J = m 1 α k k=1 T T t=1g (w T k x(t))+ 1 2 m α j α k w T 1 j w k j,k=1 T T g(w T k x(t))g(wt j x(t)) t=1 (13) 120 basis vectors from image 8 8 patches (no dimension reduction) 17

Experiment 2: denoising p 0 : several 1-D densities of zero mean Modelled (approximated) by a logistic distribution with a location parameter θ: log p(x θ) = 2logcosh( π 2 (x θ)) log4 3 Gaussian noise added, parameter estimated by different methods, and MAP inference done. 18

Denoise by MAP gauss mixt 1 gauss mixt 2 chi square Laplacian SM: value of ˆθ -0.447-0.961-0.505-0.027 ML: value of ˆθ -0.335-0.385-0.225 0.002 noise variance = 0.05 SM: squared error in ˆx 0.0451832 0.0437251 0.0475274 0.0458882 ML: squared error in ˆx 0.0458262 0.0539467 0.0481536 0.0458603 PP: error in x 0.0409871 0.0232552 0.0445455 0.0458615 p-value of difference 0 0 1.41838e-08 0.983707 SM: performance index 53.4441 23.4623 45.3304 99.3535 ML: performance index 46.3097-14.757 33.8515 100.027 noise variance = 0.1 SM: squared error in ˆx 0.0872924 0.0862113 0.0905881 0.0884577 ML: squared error in ˆx 0.0889856 0.109962 0.0928176 0.0884553 PP: error in x 0.0765966 0.0491676 0.0840644 0.0884539 p-value of difference 0 0 2.22045e-16 0.534107 SM: performance index 54.298 27.1258 59.0617 99.967 ML: performance index 47.0631-19.5983 45.071 99.9879 noise variance = 0.2 SM: squared error in ˆx 0.157888 0.175383 0.168148 0.163455 ML: squared error in ˆx 0.162068 0.214278 0.171554 0.163322 PP: error in x 0.141524 0.129202 0.153341 0.163327 p-value of difference 0 0 2.00357e-09 0.989493 SM: performance index 72.0148 34.7715 68.2662 99.6507 ML: performance index 64.8668-20.1669 60.9658 100.014 noise variance = 0.5 SM: squared error in ˆx 0.359553 0.495427 0.353038 0.336505 19 ML: squared error in ˆx 0.363577 0.48297 0.344779 0.33623 PP: error in x 0.331788 0.45787 0.307364 0.336233 p-value of difference 0 0.999999 1 0.984166 SM: performance index 83.4939 10.8552 76.2902 99.8336

Conclusion We propose to estimate a parametric model by minimizing the squared distance of the score functions (gradients of log-density w.r.t. data variable) of model density and data distribution Statistically optimal prior for removing infinitesimal gaussian noise by MAP inference Computationally simple (no integration) for non-normalized densities, yet consistent Closed-form solution in some exponential families Geometric interpretations possible No need for densities to be integrable at all 20