Estimating Unnormalised Models by Score Matching

Estimating Unnormalised Models by Score Matching Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018

Program 1. Basics of score matching 2. Practical objective function for score matching Michael Gutmann Score Matching 2 / 18

Program 1. Basics of score matching Basic ideas of score matching Objective function that captures the basic ideas but cannot be computed 2. Practical objective function for score matching Michael Gutmann Score Matching 3 / 18

Problem formulation We want to estimate the parameters θ of a parametric statistical model for a random vector x R d. Given: iid data x i,..., x n that are assumed to be observations of x that has pdf p Further notation: p(ξ; θ) is the model pdf; ξ R d is a dummy variable. Assumptions: Model p(ξ; θ) is known only up the partition function p(ξ; θ) = p(ξ; θ) Z(θ) Z(θ) = ξ p(ξ; θ)dξ Functional form of p is known (can be easily computed) Partition function Z(θ) cannot be computed analytically in closed form and numerical approximation is expensive. Goal: Estimate the model without approximating the partition function Z(θ). Michael Gutmann Score Matching 4 / 18

Basic ideas of score matching Maximum likelihood estimation can be considered to find parameter values ˆθ so that p(ξ; ˆθ) p (ξ) or log p(ξ; ˆθ) log p (ξ) (as measured by Kullback-Leibler divergence, see Barber 8.7) Instead of estimating the parameters θ by matching (log) densities, score matching identifies parameter values ˆθ for which the derivatives (slopes) of the log densities match ξ log p(ξ; ˆθ) ξ log p (ξ) ξ log p(ξ; θ) does not depend on the partition function: ξ log p(ξ; θ) = ξ [log p(ξ; θ) log Z(θ)] = ξ log p(ξ; θ) Michael Gutmann Score Matching 5 / 18

The score function (in the context of score matching) Define the model score function R d R d as ψ(ξ; θ) = log p(ξ;θ) ξ 1. log p(ξ;θ) ξ d = ξ log p(ξ; θ) While defined in terms of p(ξ; θ), we also have ψ(ξ; θ) = ξ log p(ξ; θ) Similarly, define the data score function as ψ (ξ) = ξ log p (ξ) Michael Gutmann Score Matching 6 / 18

Definition of the SM objective function Estimate θ by minimising a distance between model score function ψ(ξ; θ) and score function of observed data ψ (ξ) J sm (θ) = 1 2 ξ R m p (ξ) ψ(ξ; θ) ψ (ξ) 2 dξ = 1 2 E ψ(x; θ) ψ (x) 2 (x p ) Since ψ(ξ; θ) = ξ log p(ξ; θ) does not depend on Z(θ) there is no need to compute the partition function. Knowing the unnormalised model p(ξ; θ) is enough. Expectation E with respect to p can be approximated as sample average over the observed data, but what about ψ? Michael Gutmann Score Matching 7 / 18

Program 1. Basics of score matching 2. Practical objective function for score matching Integration by parts to obtain a computable objective function Simple example Michael Gutmann Score Matching 9 / 18

Reformulation of the SM objective function In the objective function we have the score function of the data distribution ψ. How to compute it? In fact, no need to compute it because the score matching objective function J sm can be expressed as J sm (θ) = E d j=1 [ j ψ j (x; θ) + 1 ] 2 ψ2 j (x; θ) + const. where the constant does not depend on θ, and ψ j (ξ; θ) = log p(ξ; θ) ξ j j ψ j (ξ; θ) = 2 log p(ξ; θ) ξ 2 j Michael Gutmann Score Matching 10 / 18

Proof (general idea) Use Euclidean distance and expand the objective function J sm J sm (θ) = 1 2 E ψ(x; θ) ψ (x) 2 = 1 ] 2 E ψ(x; θ) 2 E [ψ(x; θ) ψ (x) + 1 2 E ψ (x) 2 = 1 2 E ψ(x; θ) 2 d E [ψ j (x; θ)ψ,j (x)] + const j=1 First term does not depend on ψ. The ψ j and ψ,j are the j-th elements of the vectors ψ and ψ, respectively. Constant does not depend on θ. The trick is to use integration by parts for the second term to get an objective function which does not involve ψ. Michael Gutmann Score Matching 11 / 18

Proof (not examinable) E [ψ j (x; θ)ψ,j (x)] = Use integration by parts ξ j = = = ξ ξ k i k i p (ξ)ψ,j (ξ)ψ j (ξ; θ)dξ p (ξ) log p (ξ) ψ j (ξ; θ)dξ ξ j ξ k ξ k ( ( p (ξ) log p (ξ) ψ j (ξ; θ)dξ j ξ j ξ j ξ j p (ξ) ψ j (ξ; θ)dξ j ξ j p (ξ) ψ j (ξ; θ)dξ j = [p (ξ)ψ j (ξ; θ)] b j a ξ j j = ξ j p (ξ) ψ j(ξ; θ) ξ j dξ j, ) dξ k ) dξ k p (ξ) ψ j(ξ; θ) dξ j ξ j ξ j where the a j and b j specify the boundaries of the data pdf p along dimension j and where we assume that [p (ξ)ψ j (ξ; θ)] b j a j = 0. Michael Gutmann Score Matching 12 / 18

Proof (not examinable) If [p (ξ)ψ j (ξ; θ)] b j a j = 0 so that E [ψ j (x; θ)ψ,j (x)] = k i = ξ ξ k ( ) p (ξ) ψ j(ξ; θ) dξ j dξ k ξ j ξ j p (ξ) ψ j(ξ; θ) dξ ξ j = E [ j ψ j (x; θ)] J sm (θ) = 1 2 E ψ(x; θ) 2 = E d j=1 d E [ j ψ j (x; θ)] + const j=1 [ j ψ j (x; θ) + 1 ] 2 ψ2 j (x; θ) + const Replacing the expectation / integration over the data density p by a sample average over the observed data gives a computable objective function for score matching. Michael Gutmann Score Matching 13 / 18

Final method of score matching Given iid data x 1,..., x n, the score matching estimate is ˆθ = argmin J(θ) θ n d [ j ψ j (x i ; θ) + 1 ] 2 ψ j(x i ; θ) 2 J(θ) = 1 n i=1 j=1 ψ j is the partial derivative of the log unnormalised model log p with respect to the j-th coordinate (slope) and j ψ j its second partial derivative (curvature). Parameter estimation with intractable partition functions without approximating the partition function. Michael Gutmann Score Matching 14 / 18

Requirements J(θ) = 1 n n i=1 d j=1 [ j ψ j (x i ; θ) + 1 2 ψ j(x i ; θ) 2] Requirements: technical (from proof): [p (ξ)ψ j (ξ; θ)] b j a j = 0, where the a j and b j specify the boundaries of the data pdf p along dimension j smoothness: second derivatives of log p(ξ; θ) with respect to the ξ j need to exist, and should be smooth with respect to θ so that J(θ) can be optimised with gradient-based methods. Michael Gutmann Score Matching 15 / 18

Simple example p(ξ; θ) = exp( θξ 2 /2), parameter θ > 0 is the precision. The slope and curvature of the log unnormalised model are ψ(ξ; θ) = ξ log p(ξ; θ) = θξ, ξ ψ(ξ; θ) = θ. If p is Gaussian, lim ξ ± p (ξ)ψ(ξ; θ) = 0 for all θ. Score matching objective 2 J(θ) = θ + 1 2 θ2 1 n ˆθ = ( 1 n n i=1 x 2 i n xi 2 i=1 ) 1 2 For Gaussians, same as the MLE. 0.5 1 1.5 2 1 0 1 Neg. score matching objective Term with score function derivative Term with squared score function Precision Michael Gutmann Score Matching 16 / 18

Extensions Score matching as presented here only works for x R d There are extensions for discrete and non-negative random variables (not examinable) https://www.cs.helsinki.fi/u/ahyvarin/papers/csda07.pdf Can be shown to be part of a general framework to estimate unnormalised models (not examinable) https://michaelgutmann.github.io/assets/papers/gutmann2011b.pdf Overall message: in some situations, other learning criteria than likelihood are preferable. Michael Gutmann Score Matching 17 / 18

Program recap 1. Basics of score matching Basic ideas of score matching Objective function that captures the basic ideas but cannot be computed 2. Practical objective function for score matching Integration by parts to obtain a computable objective function Simple example Michael Gutmann Score Matching 18 / 18