Estimating Unnormalised Models by Score Matching

Similar documents
Estimating Unnormalized models. Without Numerical Integration

Intractable Likelihood Functions

Estimation theory and information geometry based on denoising

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Some extensions of score matching

From independent component analysis to score matching

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

ESTIMATION THEORY AND INFORMATION GEOMETRY BASED ON DENOISING. Aapo Hyvärinen

Bayesian Inference by Density Ratio Estimation

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Estimation of Dynamic Regression Models

Minimum Message Length Analysis of the Behrens Fisher Problem

Econometrics I, Estimation

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Expectation Propagation Algorithm

CS Lecture 19. Exponential Families & Expectation Propagation

Notes on Noise Contrastive Estimation (NCE)

Machine learning - HT Maximum Likelihood

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Density Estimation: ML, MAP, Bayesian estimation

A minimalist s exposition of EM

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Bayesian Machine Learning - Lecture 7

Information geometry for bivariate distribution control

Chapter 3 : Likelihood function and inference

Tutorial on Approximate Bayesian Computation

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Foundations of Statistical Inference

Chapter 2. Poisson point processes

Methods of Estimation

Lecture 26: Likelihood ratio tests

Probabilistic and Bayesian Machine Learning

Graduate Econometrics I: Maximum Likelihood I

Graphical Models for Collaborative Filtering

Posterior Regularization

Learning MN Parameters with Approximation. Sargur Srihari

Efficient Likelihood-Free Inference

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

14 : Theory of Variational Inference: Inner and Outer Approximation

Inference in non-linear time series

Kernel methods, kernel SVM and ridge regression

Gaussian Graphical Models and Graphical Lasso

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

arxiv: v4 [stat.ml] 18 Dec 2018

Approximate Bayesian Computation

Probabilistic Graphical Models

Information Theory Based Estimator of the Number of Sources in a Sparse Linear Mixing Model

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

Bayesian Inference for directional data through ABC and homogeneous proper scoring rules

SENSITIVITY ANALYSIS IN NUMERICAL SIMULATION OF MULTIPHASE FLOW FOR CO 2 STORAGE IN SALINE AQUIFERS USING THE PROBABILISTIC COLLOCATION APPROACH

ECE 275A Homework 7 Solutions

Exercises Chapter 4 Statistical Hypothesis Testing

Maximum Likelihood Estimation. only training data is available to design a classifier

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

Variational Autoencoder

Generalized Linear Models

Introduction to Machine Learning

Series 7, May 22, 2018 (EM Convergence)

Likelihood inference in the presence of nuisance parameters

Learning Mixtures of Truncated Basis Functions from Data

Statistics: Learning models from data

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Conjugate Priors, Uninformative Priors

Minimax lower bounds I

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Lecture 9: PGM Learning

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

Stochastic Spectral Approaches to Bayesian Inference

Pattern Recognition. Parameter Estimation of Probability Density Functions

Machine Learning Techniques for Computer Vision

Notes on policy gradients and the log derivative trick for reinforcement learning

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Probabilistic Machine Learning. Industrial AI Lab.

Parametric Techniques Lecture 3

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Lecture 3. Linear Regression

Learning Parameters of Undirected Models. Sargur Srihari

Joint work with Nottingham colleagues Simon Preston and Michail Tsagris.

Cheng Soon Ong & Christian Walder. Canberra February June 2017

3 : Representation of Undirected GM

Parametric Techniques

Lecture 14. Clustering, K-means, and EM

Inductive Principles for Restricted Boltzmann Machine Learning

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Adaptive Monte Carlo methods

Estimation for two-phase designs: semiparametric models and Z theorems

Learning Parameters of Undirected Models. Sargur Srihari

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

Fast Likelihood-Free Inference via Bayesian Optimization

Bayesian estimation of the discrepancy with misspecified parametric models

Transcription:

Estimating Unnormalised Models by Score Matching Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018

Program 1. Basics of score matching 2. Practical objective function for score matching Michael Gutmann Score Matching 2 / 18

Program 1. Basics of score matching Basic ideas of score matching Objective function that captures the basic ideas but cannot be computed 2. Practical objective function for score matching Michael Gutmann Score Matching 3 / 18

Problem formulation We want to estimate the parameters θ of a parametric statistical model for a random vector x R d. Given: iid data x i,..., x n that are assumed to be observations of x that has pdf p Further notation: p(ξ; θ) is the model pdf; ξ R d is a dummy variable. Assumptions: Model p(ξ; θ) is known only up the partition function p(ξ; θ) = p(ξ; θ) Z(θ) Z(θ) = ξ p(ξ; θ)dξ Functional form of p is known (can be easily computed) Partition function Z(θ) cannot be computed analytically in closed form and numerical approximation is expensive. Goal: Estimate the model without approximating the partition function Z(θ). Michael Gutmann Score Matching 4 / 18

Basic ideas of score matching Maximum likelihood estimation can be considered to find parameter values ˆθ so that p(ξ; ˆθ) p (ξ) or log p(ξ; ˆθ) log p (ξ) (as measured by Kullback-Leibler divergence, see Barber 8.7) Instead of estimating the parameters θ by matching (log) densities, score matching identifies parameter values ˆθ for which the derivatives (slopes) of the log densities match ξ log p(ξ; ˆθ) ξ log p (ξ) ξ log p(ξ; θ) does not depend on the partition function: ξ log p(ξ; θ) = ξ [log p(ξ; θ) log Z(θ)] = ξ log p(ξ; θ) Michael Gutmann Score Matching 5 / 18

The score function (in the context of score matching) Define the model score function R d R d as ψ(ξ; θ) = log p(ξ;θ) ξ 1. log p(ξ;θ) ξ d = ξ log p(ξ; θ) While defined in terms of p(ξ; θ), we also have ψ(ξ; θ) = ξ log p(ξ; θ) Similarly, define the data score function as ψ (ξ) = ξ log p (ξ) Michael Gutmann Score Matching 6 / 18

Definition of the SM objective function Estimate θ by minimising a distance between model score function ψ(ξ; θ) and score function of observed data ψ (ξ) J sm (θ) = 1 2 ξ R m p (ξ) ψ(ξ; θ) ψ (ξ) 2 dξ = 1 2 E ψ(x; θ) ψ (x) 2 (x p ) Since ψ(ξ; θ) = ξ log p(ξ; θ) does not depend on Z(θ) there is no need to compute the partition function. Knowing the unnormalised model p(ξ; θ) is enough. Expectation E with respect to p can be approximated as sample average over the observed data, but what about ψ? Michael Gutmann Score Matching 7 / 18

Program 1. Basics of score matching Basic ideas of score matching Objective function that captures the basic ideas but cannot be computed 2. Practical objective function for score matching Michael Gutmann Score Matching 8 / 18

Program 1. Basics of score matching 2. Practical objective function for score matching Integration by parts to obtain a computable objective function Simple example Michael Gutmann Score Matching 9 / 18

Reformulation of the SM objective function In the objective function we have the score function of the data distribution ψ. How to compute it? In fact, no need to compute it because the score matching objective function J sm can be expressed as J sm (θ) = E d j=1 [ j ψ j (x; θ) + 1 ] 2 ψ2 j (x; θ) + const. where the constant does not depend on θ, and ψ j (ξ; θ) = log p(ξ; θ) ξ j j ψ j (ξ; θ) = 2 log p(ξ; θ) ξ 2 j Michael Gutmann Score Matching 10 / 18

Proof (general idea) Use Euclidean distance and expand the objective function J sm J sm (θ) = 1 2 E ψ(x; θ) ψ (x) 2 = 1 ] 2 E ψ(x; θ) 2 E [ψ(x; θ) ψ (x) + 1 2 E ψ (x) 2 = 1 2 E ψ(x; θ) 2 d E [ψ j (x; θ)ψ,j (x)] + const j=1 First term does not depend on ψ. The ψ j and ψ,j are the j-th elements of the vectors ψ and ψ, respectively. Constant does not depend on θ. The trick is to use integration by parts for the second term to get an objective function which does not involve ψ. Michael Gutmann Score Matching 11 / 18

Proof (not examinable) E [ψ j (x; θ)ψ,j (x)] = Use integration by parts ξ j = = = ξ ξ k i k i p (ξ)ψ,j (ξ)ψ j (ξ; θ)dξ p (ξ) log p (ξ) ψ j (ξ; θ)dξ ξ j ξ k ξ k ( ( p (ξ) log p (ξ) ψ j (ξ; θ)dξ j ξ j ξ j ξ j p (ξ) ψ j (ξ; θ)dξ j ξ j p (ξ) ψ j (ξ; θ)dξ j = [p (ξ)ψ j (ξ; θ)] b j a ξ j j = ξ j p (ξ) ψ j(ξ; θ) ξ j dξ j, ) dξ k ) dξ k p (ξ) ψ j(ξ; θ) dξ j ξ j ξ j where the a j and b j specify the boundaries of the data pdf p along dimension j and where we assume that [p (ξ)ψ j (ξ; θ)] b j a j = 0. Michael Gutmann Score Matching 12 / 18

Proof (not examinable) If [p (ξ)ψ j (ξ; θ)] b j a j = 0 so that E [ψ j (x; θ)ψ,j (x)] = k i = ξ ξ k ( ) p (ξ) ψ j(ξ; θ) dξ j dξ k ξ j ξ j p (ξ) ψ j(ξ; θ) dξ ξ j = E [ j ψ j (x; θ)] J sm (θ) = 1 2 E ψ(x; θ) 2 = E d j=1 d E [ j ψ j (x; θ)] + const j=1 [ j ψ j (x; θ) + 1 ] 2 ψ2 j (x; θ) + const Replacing the expectation / integration over the data density p by a sample average over the observed data gives a computable objective function for score matching. Michael Gutmann Score Matching 13 / 18

Final method of score matching Given iid data x 1,..., x n, the score matching estimate is ˆθ = argmin J(θ) θ n d [ j ψ j (x i ; θ) + 1 ] 2 ψ j(x i ; θ) 2 J(θ) = 1 n i=1 j=1 ψ j is the partial derivative of the log unnormalised model log p with respect to the j-th coordinate (slope) and j ψ j its second partial derivative (curvature). Parameter estimation with intractable partition functions without approximating the partition function. Michael Gutmann Score Matching 14 / 18

Requirements J(θ) = 1 n n i=1 d j=1 [ j ψ j (x i ; θ) + 1 2 ψ j(x i ; θ) 2] Requirements: technical (from proof): [p (ξ)ψ j (ξ; θ)] b j a j = 0, where the a j and b j specify the boundaries of the data pdf p along dimension j smoothness: second derivatives of log p(ξ; θ) with respect to the ξ j need to exist, and should be smooth with respect to θ so that J(θ) can be optimised with gradient-based methods. Michael Gutmann Score Matching 15 / 18

Simple example p(ξ; θ) = exp( θξ 2 /2), parameter θ > 0 is the precision. The slope and curvature of the log unnormalised model are ψ(ξ; θ) = ξ log p(ξ; θ) = θξ, ξ ψ(ξ; θ) = θ. If p is Gaussian, lim ξ ± p (ξ)ψ(ξ; θ) = 0 for all θ. Score matching objective 2 J(θ) = θ + 1 2 θ2 1 n ˆθ = ( 1 n n i=1 x 2 i n xi 2 i=1 ) 1 2 For Gaussians, same as the MLE. 0.5 1 1.5 2 1 0 1 Neg. score matching objective Term with score function derivative Term with squared score function Precision Michael Gutmann Score Matching 16 / 18

Extensions Score matching as presented here only works for x R d There are extensions for discrete and non-negative random variables (not examinable) https://www.cs.helsinki.fi/u/ahyvarin/papers/csda07.pdf Can be shown to be part of a general framework to estimate unnormalised models (not examinable) https://michaelgutmann.github.io/assets/papers/gutmann2011b.pdf Overall message: in some situations, other learning criteria than likelihood are preferable. Michael Gutmann Score Matching 17 / 18

Program recap 1. Basics of score matching Basic ideas of score matching Objective function that captures the basic ideas but cannot be computed 2. Practical objective function for score matching Integration by parts to obtain a computable objective function Simple example Michael Gutmann Score Matching 18 / 18