From independent component analysis to score matching

Similar documents
Estimation theory and information geometry based on denoising

Estimating Unnormalized models. Without Numerical Integration

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

ESTIMATION THEORY AND INFORMATION GEOMETRY BASED ON DENOISING. Aapo Hyvärinen

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Independent Component Analysis

CIFAR Lectures: Non-Gaussian statistics and natural images

Some extensions of score matching

Independent Component Analysis

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Estimating Unnormalised Models by Score Matching

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

Natural Image Statistics

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Estimation of linear non-gaussian acyclic models for latent factors

Lecture'12:' SSMs;'Independent'Component'Analysis;' Canonical'Correla;on'Analysis'

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Independent Component Analysis. Contents

Independent Component Analysis

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Discovery of non-gaussian linear causal models using ICA

Dimension Reduction (PCA, ICA, CCA, FLD,

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

Nonparametric Bayesian Methods (Gaussian Processes)

Probabilistic & Unsupervised Learning

Learning features by contrasting natural images with noise

Variational Autoencoders

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Covariance function estimation in Gaussian process regression

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Natural Gradient Learning for Over- and Under-Complete Bases in ICA

Theory of Maximum Likelihood Estimation. Konstantin Kashin

A two-layer ICA-like model estimated by Score Matching

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

Density Estimation. Seungjin Choi

Introduction to Gaussian Processes

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Pairwise measures of causal direction in linear non-gaussian acy

Introduction to Probabilistic Machine Learning

Independent component analysis: algorithms and applications

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Gaussian Mixture Models

Unsupervised Learning

INDEPENDENT COMPONENT ANALYSIS

Markov Random Fields

Computer Intensive Methods in Mathematical Statistics

Tensor intro 1. SIAM Rev., 51(3), Tensor Decompositions and Applications, Kolda, T.G. and Bader, B.W.,

Independent Component Analysis and Blind Source Separation

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning

Learning Energy-Based Models of High-Dimensional Data

Lecture : Probabilistic Machine Learning

Advanced Introduction to Machine Learning

Chris Bishop s PRML Ch. 8: Graphical Models

CPSC 540: Machine Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

GWAS V: Gaussian processes

Non-Parametric Bayes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Basic math for biology

Separation of Different Voices in Speech using Fast Ica Algorithm

CS281 Section 4: Factor Analysis and PCA

Basic Sampling Methods

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Kernel methods, kernel SVM and ridge regression

Lecture 3: Pattern Classification

ICA. Independent Component Analysis. Zakariás Mátyás

Lecture 6: Graphical Models: Learning

Bayesian Inference by Density Ratio Estimation

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Introduction to Machine Learning

A Constrained EM Algorithm for Independent Component Analysis

Detection and Estimation Theory

Bregman Divergences for Data Mining Meta-Algorithms

Undirected Graphical Models

Machine learning - HT Maximum Likelihood

Introduction to the regression problem. Luca Martino

Machine Learning (BSMC-GA 4439) Wenke Liu

Principal Component Analysis vs. Independent Component Analysis for Damage Detection

Machine Learning Summer School

GAUSSIAN PROCESS REGRESSION

NONLINEAR INDEPENDENT FACTOR ANALYSIS BY HIERARCHICAL MODELS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Donghoh Kim & Se-Kang Kim

Factor Analysis and Kalman Filtering (11/2/04)

Statistical Data Mining and Machine Learning Hilary Term 2016

A survey of dimension reduction techniques

Factor Analysis (10/2/13)

Gaussian processes for inference in stochastic differential equations

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Chapter 4: Factor Analysis

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Based on slides by Richard Zemel

Discovery of Linear Acyclic Models Using Independent Component Analysis

Transcription:

From independent component analysis to score matching Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1

Abstract First, short introduction to independent component analysis non-gaussian Bayesian networks Main topic: Estimation of non-normalized models Problem: Parameterized density does not integrate to unity Partition function (normalization constant) is difficult to compute Solution: Fitting gradient of log-density ψ w.r.t. data variable Minimize squared distance of ψ of data and ψ of model ψ does not depend on normalization constant Using partial integration, distance be computed by a simple formula Estimator is optimal for reducing gaussian (infinitesimal) noise 2

Problem of blind source separation There is a number of source signals : Due to some external circumstances, only linear mixtures of the source signals are observed: Estimate (separate) original signals! 3

Principal component analysis does not recover original signals 4

Principal component analysis does not recover original signals A solution is possible Use information on statistical independence to recover: 5

Independent Component Analysis (Hérault and Jutten, 1984-1991) Observed random vector x is modelled by a linear latent variable model x i = m a i j s j, i = 1...n (1) j=1 or in matrix form: x = As (2) where The mixing matrix A is constant (a parameter matrix). The s i are latent random variables called the independent components. Estimate both A and s, observing only x. 6

Basic assumptions of the ICA model Must assume: The s i are mutually independent The s i are nongaussian. For simplicity: The matrix A is square. Then: mixing matrix and components can be identified (Comon, 1994) A very surprising result! The s i defined only up to a multiplicative constant, not ordered. 7

ICA and decorrelation First approach: decorrelate variables, i.e. find W so that y = Wx has uncorrelated components: E{y i y j } E{y i }E{y j } = 0 (3) But: Decorrelation (e.g. PCA) uses only correlation matrix: n 2 /2 equations, and model has n 2 parameters Not enough information! Fortunately, for independent variables we have something stronger: E{h 1 (y 1 )h 2 (y 2 )} E{h 1 (y 1 )}E{h 2 (y 2 )} = 0. (4) Gaussian data determined by correlations alone model cannot be estimated for gaussian data. 8

Basic intuitive principle of ICA estimation. (Very sloppy version of) the Central Limit Theorem. Consider a linear combination w T x = q T s q i s i + q j s j is more gaussian than s i. Maximizing the nongaussianity of q T s, we can find s i. Also known as projection pursuit. Cf. principal component analysis: maximize variance of w T x. A number of algorithms available, e.g. FastICA (Hyvärinen, 1999) 9

Linear Non-Gaussian Acyclic Model (LiNGAM; Shimizu et al 2006) Instead of components, we can estimate a network: x = Bx+e x1 Estimation possible based on ICA Assume e i independent and nongaussian -0.3-0.56 x2 0.82 0.89 We can rearrange to obtain ICA (almost): x3 0.14 0.37 x = Bx+e (I B)x = e 1-0.26-1 x4 0.12 So, ICA can be used to obtain I B (almost) Problem: ICA does not determine order of components x7 x6 1 x5 Solution: acyclicity defines order uniquely 10

Generalization of ICA to many components In basic ICA, number of components = dimension of data We could consider many more projections and maximize their non-gaussianity: m G(w T k k=1 for some function G measuring nongaussianity. To estimate w k, we interpret this as a log-density x) (5) However, it should be normalized to unit integral log p(x) = m Z G(w T k x) log k=1 exp( m k=1 G(w T k ξ))dξ (6) We find a very difficult integral! This leads to main topic. 11

Main talk topic: score matching Abstract How to estimate models which cannot be integrated analytically Maximum likelihood estimation computationally difficult: must compute integral We propose a computationally efficient method which avoids integration Can be shown to be statistically consistent and optimal according to a Bayesian denoising objective 12

General problem: Non-normalized model estimation We want to estimate a parametric model of a multivariate random vector x R n Density function is known only up to a multiplicative constant p(x;θ) = 1 Z(θ) q(x;θ) Z(θ) = Z q(ξ;θ) dξ ξ Rn Functional form of q is known (can be easily computed) Z cannot be computed with reasonable computing time 13

Previous solutions Monte Carlo methods Consistent estimators (convergence to real parameter values when sample size ) Computation very slow Various approximations, e.g. variational methods Computation often fast Consistency not known Pseudo-likelihood and contrastive divergence Presumably consistent Computations slow with continuous-valued variables: needs 1-D integration at every step, or sophisticated MCMC methods 14

Definition of score function (in this talk) Define model score function R n R n as log p(ξ;θ) ξ 1 ψ(ξ;θ) =. = ξ log p(ξ;θ) log p(ξ;θ) ξ n Similarly, define data score function as ψ x (ξ) = ξ log p x (ξ) where observed data is assumed to follow p x (.). In conventional terminology: Fisher score with respect to a hypothetical location parameter: p(x θ), evaluated at θ = 0. 15

Score matching: definition of objective function Estimate by minimizing a distance between model score function ψ(.;θ) and score function of observed data ψ x (.): J(θ) = 1 2 Z ξ R n p x(ξ) ψ(ξ;θ) ψ x (ξ) 2 dξ (7) ˆθ = argmin J(θ) θ This gives a consistent estimator almost by construction Does not depend on normalization constant Z(θ) because ψ(ξ;θ) = ξ logq(ξ;θ)+ ξ logz(θ) = ξ logq(ξ;θ)+0 (8) No need to compute normalization constant Z, non-normalized pdf q is enough. Computation of J quite simple due to theorem below 16

A computational trick In the objective function we have score function of data distribution ψ x (.). How to compute it? In fact, no need to compute it because Theorem 1 Assume some regularity conditions, and smooth densities. Then, the score matching objective function J can be expressed as Z J(θ) = [ i ψ i (ξ;θ)+ 12 ] ψ i(ξ;θ) 2 dξ+const. (9) ξ R n p x(ξ) n i=1 where the constant does not depend on θ, and ψ i (ξ;θ) = logq(ξ;θ), and i ψ i (ξ;θ) = 2 logq(ξ;θ) ξ i ξ 2 i 17

Simple explanation of trick Consider objective function J(θ): 1 2 Z p x (ξ) ψ x (ξ) 2 dξ+ 1 2 Z Z p x (ξ) ψ(ξ;θ) 2 dξ p x (ξ)ψ x (ξ) T ψ(ξ;θ)dξ First term does not depend on θ. Second term easy to compute. The trick is to use partial integration on third term. In one dimension: Z p x (x)(log p x ) (x)ψ(x;θ)dx = = Z Z p x (x) p x(x) p x (x) ψ(x;θ)dx p x(x)ψ(x;θ)dx = 0 Z p x (x)ψ (x;θ)dx This is why score function of data distribution p x (x) disappears! 18

Final method of score matching Replace integration over sample density p x (.) by sample average Given T observations x(1),...,x(t), minimize J(θ) = 1 T T n t=1 i=1 [ i ψ i (x(t);θ)+ 12 ψ i(x(t);θ) 2 ] (10) where ψ i is a partial derivative of non-normalized model log-density logq, and i ψ i a second partial derivative Only needs evaluation of some derivatives of the non-normalized (log)-density q which are simple to compute (by assumption) Thus: a new computationally simple and statistically consistent method 19

Interesting result: Closed-form solution in the exponential family Assume pdf can be expressed in the form log p(ξ;θ) = Define matrices of partial derivatives: m θ k F k (ξ) logz(θ) (11) k=1 K ki (ξ) = F k ξ i, and H ki (ξ) = 2 F k ξ 2 i (12) Then, the score matching estimator is given by: ˆθ = [ Ê{K(x)K(x) T } ] 1 ( Ê{h i (x)}) (13) i where Ê denotes the sample average, and the vector h i is the i-th column of the matrix H. 20

Extensions of score matching Can be extended to non-negative data Basic score matching cannot be directly use because density is typically not smooth over R n. Can be extended to binary variables However, utility questionable because pseudolikelihood is computationally efficient in that case Can be shown to be equivalent to a special case of contrastive divergence (equal in expectation when using Langevin MCMC method and infinitesimal step size) 21

Statistical optimality of score matching Question: is score matching optimal in any statistical sense? Consider observed signal y which is a noisy version of original signal x which comes from a prior distribution with parameter vector θ Assume: p(y,x,θ) = cexp( 1 2σ 2 y x 2 )p(x θ) (14) We infer the original signal by MAP inference. ˆx MAP (ˆθ,y) = argmax x p(y x)p(x ˆθ) = arg max log p(y x)+log p(x ˆθ) x We estimate parameters θ from a separate sample of noise-free signals x. What is the optimal method of estimating θ? (A single point estimate) 22

Statistical optimality (2): Difference to classical analysis Classical analysis of optimality of estimators considers errors in parameter values Here, we consider error in the restored (denoised) signal (Euclidean distance between x and its MAP estimate) These errors need to be related, cf. collinearity in linear regression Also: to be computationally realistic, we don t use a full Bayesian restoration, instead take point estimate of θ and use MAP estimate. We also assume that we can observe noise-free signals from which to estimate the parameters. 23

Statistical optimality (3): Analysis of estimation error Assume signal is corrupted by infinitely small gaussian noise as above Theorem 2 Assume that all the the log-pdf s are differentiable, and the estimation error in MAP estimation x = ˆx x is small. Then first-order approximation of error is x 2 = σ 4 = E{ E 1 2 }+E{ E 2 2 }+ smaller terms (15) wheree 1 = ψ 0 (x) ψ(x ˆθ) ande 2 = ψ 0 (x)+ψ(y x) Note thate 2 does not depend on θ Thus, optimal estimation of θ is by minimization of E px { E 1 2 }: This is just score matching! 24

An information geometry Considering p x fixed, we define a Hilbertian structure in the space of score functions. [ Z n ] Z p 1, p 2 = p x (ξ) ψ 1,i (ξ)ψ 2,i (ξ) dξ = p x (ξ)ψ 1 (ξ) T ψ 2 (ξ)dξ i=1 (16) Dot-product defines norm and distance Score matching is performed by minimization of distance of p x and p(. θ) in this metric. 25

An information geometry (2): Pythagorean decomposition Exponential family is linear subspace Estimation is orthogonal projection on that subspace Pythagorean equality p x 2 = dist 2 (p(. ˆθ), p x )+ p(. ˆθ) 2 (17) Can be interpreted in terms of denoising capability of MAP estimation: var of noise which can be removed by MAP denoising = noise var not removed due to imperfect prior +noise var removed by prior Intuitively, denoising is possible because of structure in the signal, which leads to a more speculative interpretation: Structure in data = Structure not modelled + Structure modelled 26

Experiment: overcomplete ICA basis of natural images Likelihood: log p(x) = m k=1 α kg(w T k x)+z(w 1,...,w n,α 1,...,α n ) Objective function J = m 1 α k k=1 T T t=1g (w T k x(t))+ 1 2 m α j α k w T 1 j w k j,k=1 T T g(w T k x(t))g(wt j x(t)) t=1 (18) 120 basis vectors from image 8 8 patches (no dimension reduction) 27

Conclusion Non-gaussianity is a very powerful property in multivariate statistics Finds hidden factors (independent component analysis) Estimates linear Bayesian network However, leads to computationally difficult models Even non-normalized models We propose to minimize the squared distance of the score functions (gradients of log-density) of model density and data distribution First consistent and computationally simple method (?) Closed-form solution in some exponential families Statistical optimality in the sense of denoising 28