Density Modeling and Clustering Using Dirichlet Diffusion Trees

Similar documents
Density Modeling and Clustering Using Dirichlet Diffusion Trees

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Non-Parametric Bayes

STAT Advanced Bayesian Inference

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Foundations of Nonparametric Bayesian Methods

Chapter 1 Statistical Reasoning Why statistics? Section 1.1 Basics of Probability Theory

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian processes for inference in stochastic differential equations

CSC 2541: Bayesian Methods for Machine Learning

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

Finite difference method for solving Advection-Diffusion Problem in 1D

Lecture 11. Multivariate Normal theory

Multivariate Bayesian Linear Regression MLAI Lecture 11

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Spatial Normalized Gamma Process

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Multivariate random variables

Convexification and Multimodality of Random Probability Measures

where r n = dn+1 x(t)

ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process

Density Estimation. Seungjin Choi

Statistics: Learning models from data

RENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES

Introduction to Gaussian Processes

Infinitely Imbalanced Logistic Regression

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

Random Variables and Their Distributions

Multivariate Statistics

A Brief Overview of Nonparametric Bayesian Models

CPSC 540: Machine Learning

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Bayesian inverse problems with Laplacian noise

A Framework for Daily Spatio-Temporal Stochastic Weather Simulation

CSCI-6971 Lecture Notes: Probability theory

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

P(E t, Ω)dt, (2) 4t has an advantage with respect. to the compactly supported mollifiers, i.e., the function W (t)f satisfies a semigroup law:

PMR Learning as Inference

Expectation Maximization

Definition of a Stochastic Process

Lecture 14: Multivariate mgf s and chf s

Hybrid Dirichlet processes for functional data

Multivariate random variables

Notes on Random Vectors and Multivariate Normal

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Probability Distribution And Density For Functional Random Variables

Statistical study of spatial dependences in long memory random fields on a lattice, point processes and random geometry.

ELEG 3143 Probability & Stochastic Process Ch. 4 Multiple Random Variables

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Recitation 2: Probability

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Continuous Random Variables

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Machine Learning for Data Science (CS4786) Lecture 12

The Multivariate Gaussian Distribution [DRAFT]

Chapter 2: Random Variables

ECE Lecture #10 Overview

Multiple Random Variables

Gaussian Process Approximations of Stochastic Differential Equations

12 - Nonparametric Density Estimation

UNIT Define joint distribution and joint probability density function for the two random variables X and Y.

Lecture 10 Linear Quadratic Stochastic Control with Partial State Observation

WEYL S LEMMA, ONE OF MANY. Daniel W. Stroock

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

ECE534, Spring 2018: Solutions for Problem Set #3

Data Preprocessing. Cluster Similarity

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Student-t Process as Alternative to Gaussian Processes Discussion

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Asymptotics for posterior hazards

Image segmentation combining Markov Random Fields and Dirichlet Processes

Nonparametric Bayesian Methods (Gaussian Processes)

Gaussian, Markov and stationary processes

1: PROBABILITY REVIEW

BASICS OF PROBABILITY

GAUSSIAN PROCESS REGRESSION

Introduction to Normal Distribution

Motivating the Covariance Matrix

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Statistics for scientists and engineers

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

K-Means and Gaussian Mixture Models

Asymptotics for posterior hazards

A PECULIAR COIN-TOSSING MODEL

Chapter 5. The multivariate normal distribution. Probability Theory. Linear transformations. The mean vector and the covariance matrix

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Bayesian Modeling of Conditional Distributions

Large-scale Ordinal Collaborative Filtering

Bayesian Nonparametrics: Models Based on the Dirichlet Process

(Multivariate) Gaussian (Normal) Probability Densities

Multivariate Random Variable

Bayesian Methods with Monte Carlo Markov Chains II

Variational Principal Components

Curve Fitting Re-visited, Bishop1.2.5

Transcription:

p. 1/3 Density Modeling and Clustering Using Dirichlet Diffusion Trees Radford M. Neal Bayesian Statistics 7, 2003, pp. 619-629. Presenter: Ivo D. Shterev

p. 2/3 Outline Motivation. Data points generation. Probability of generating Dirichlet diffusion trees (DDT). Exchangeability. Data points generation from the DDT structure. Examples. Testing for absolute continuity. Simple relationships to other processes. Discussion.

Motivation Dirichlet Process (DP) produces distributions which are discrete with probability one, hence unsuitable for density modeling. Convolution of the distribution with continuous kernel can produce densities, i.e. using DP mixtures with countably infinite number of components. Parameters of one DP mixture component are independent of parameters of other components, because the parameter priors are independent. DP mixtures are inefficient when data exhibits hierarchical structure. Polya Trees (PT) is a generalization of DP that can produce hierarchical distributions, but these distributions have discontinuities. An alternative is to use DDT. p. 3/3

p. 4/3 Data Points Generation Suppose we want to generate n data points X R p from DDT. X 1 is generated from a Gaussian diffusion process. X 1 (t + dt) = X 1 (t) + N 1 (dt), 0 t T where N 1 (dt) N(0,σ 2 Idt), σ 2 is a parameter of the diffusion process, dt is infinitely small, and T is the period. It can be seen that X 1 (t + dt) N(0,σ 2 Idt). X 2 (t) is generated by following a path from the origin, initially following the path of X 1 (t), but diverging at some time. We need to introduce "divergence function" a(t). X 2 (t) will diverge during t + dt with probability P d = a(t)dt. After divergence, the remaining paths are independent.

p. 5/3 Data Points Generation X 3 (t) initially follows X 1 (t) and X 2 (t). X 3 (t) can diverge before X 1 (t) and X 2 (t) have diverged, with P d = a(t)dt 2, or follow X 1 (t) or X 2 (t) with probability P f = 0.5. If X 3 (t) follows X 1 (t) or X 2 (t), we have P d = a(t)dt. Generally P d i f = a(t)dt m i = arg max i m i, with probability P f m i where m i is the number of previous points following the ith path. Once divergence occurs, the new path moves independently from previous paths.

p. 6/3 Data Points Generation Examples for n = 4, p = 1, T = 1.

Probability of Generating DDT Probability of no divergence between times t and s (s < t) over a path previously traversed m times. k 1 ( t s P nd (s,t) = lim 1 a(s + i k k )t s ) km i=0 ( = exp log ( 1 a(s + i t s k )t s ) ) km = exp ( = exp ( i=0 t s a(u) m du) A(t) A(s) m where A(t) = t 0 a(u)du is the "cumulative divergence function". p. 7/3 )

p. 8/3 Probability of Generating DDT a(t) plays a central role. Sufficient condition for divergence, i.e. Pr ( X i (t) = X j (t) ) = 0 for any i j is T 0 a(t) =

p. 9/3 Probability of Generating DDT We don t need all details of the paths - we need only the tree structure and divergence times.

Probability of Generating DDT Probability (density) of obtaining the tree structure and divergence times (also called "tree factor"). P t = exp ( A(t a ) ) a(t a ), second path exp ( A(t b) )a(t b ), third path 2 2 exp ( A(t b) )1 3 3 exp ( A(t b ) A(t c ) ) a(t c ), fourth path "Data factor" is given as DF = N(x b,0,σ 2 It b )N ( x a,x b,i(t a t b ) ) N ( x 1,x a,σ 2 I(1 t a ) ) N ( x 2,x a,σ 2 I(1 t a ) ) N ( x c,x b,σ 2 I(t c t b ) ) N ( x 3,x c,σ 2 I(1 t c ) ) N ( x 4,x c,σ 2 I(1 t c ) ) where x N(x,µ,Σ), with mean µ and covariance Σ. p. 10/3

p. 11/3 Exchangeability We can rewrite the "tree factor" as P t = exp ( A(t b ) ) exp ( A(t b) )a(t b ) exp ( A(t b) )1 2 2 3 3 exp ( A(t b ) A(t a ) ) a(t a ) exp ( A(t b ) A(t c ) ) a(t c ) The first term is associated with the segment (0,0) (t b,x b ). If the term remains unchanged after any reordering of the data points (paths), then we have exchangeability.

Exchangeability Consider the segment (t u,x u ) (t v,x v ), that was traversed by m > 1 paths. the probability that the m 1 paths after the first will not diverge within the segment is P m 1 nd = m 1 i=1 exp ( A(t v) A(t u )) i if t v = T, this is the whole factor for this segment, otherwise, there must be some divergence at t v. suppose the i 1 (i > 2) paths do not diverge at t v, but the ith path diverges at t v, therefore P i d = a(t v) i 1. suppose n 1 is the number of points following the i 1 paths and n 2 is the number of points following the ith path (n 1 i 1, n 1 + n 2 = m). p. 12/3

Exchangeability the probability that path j (j > i) follows the i 1 paths is P f = c 1 j 1, where (i 1 c 1 n 1 1). the probability that path j (j > i) follows the ith path is P f = c 2 j 1, where (1 c 2 n 2 1). their product for all c 1, all c 2, and all j > i gives P all j d = n 1 1 c 1 =i 1 m j=i+1 n 2 1 c 1 c 2 =1 (j 1) c 2 = (n 1 1)! (i 2)! (n (i 1)! 2 1)! (m 1)! = (i 1) (n 1 1)!(n 2 1)! (m 1)! p. 13/3

Exchangeability taking the product of all probabilities associated with the segment under consideration, we obtain the "tree factor" associated with the segment, i.e. P t = Pd i P all j d P m 1 nd = a(t v) i 1 (i 1)(n 1 1)!(n 2 1)! (m 1)! = a(t v)(n 1 1)!(n 2 1)! (m 1)! m 1 i=1 m 1 i=1 exp ( A(t v) A(t u )) i exp ( A(t v) A(t u ) i ), which is independent of i, i.e. exchangeability. the above analysis does not incorporate the case when more than one path diverges at the same time, i.e. when a(t) has an infinite peak, but the proof can be modified to incorporate this. p. 14/3

p. 15/3 Data Points Generation from the DDT Structure Probability (now called "cumulative distribution function") of path i diverging at time t is C(t) = 1 exp ( A(t) ) i 1 where the assumption is that path i initially follows the previous i 1 paths. Divergence time can be generated as t d = C 1 (U) = A 1( (i 1) log(t U) ) where U U(0,T), and T = 1 in the examples. it is more convenient to work with A 1 ( ), since it avoids the infinite peaks of a(t).

p. 16/3 Examples a(t) = c, where c is a constant. 1 t cumulative divergence function A(t) = t A 1 (e) = 1 exp( e c ) 0 a(u)du = c log(1 t) distributions drawn from such a prior will be continuous, since a(t) =. 1 0

p. 17/3 Examples one-dimensional points

p. 18/3 Examples one-dimensional points

p. 19/3 Examples two-dimensional points

p. 20/3 Examples two-dimensional points

p. 21/3 Examples two-dimensional points

p. 22/3 Examples two-dimensional points

p. 23/3 Examples a(t) = b + c, where b and c are constants. (1 t) 2 cumulative divergence function A 1 (t) = A(t) = bt c + { c 1 t b+c+e (b+c+e) 2 4be 2b 1 c e+c if b 0 if b = 0 good choices are b = 1 and c = 1, giving well-separated 2 200 clusters with points smoothly distributed within each cluster.

p. 24/3 Examples two-dimensional points

p. 25/3 Testing for Absolute Continuity Distributions produced from a DDT prior will be continuous (with probability one) if T a(t)dt =. However, this does not 0 imply absolute continuity. Absolute continuity is required for distributions drawn from a DDT prior to have densities. Absolute continuity is tested by looking at distances to nearest neighbors in a sample from the distribution. for each x, compute Euclidean distances to the two nearest neighbors. compute their ratio r < 1. to have absolute continuity, there should be r p U(0, 1). This is just an empirical test, not a rigorous proof.

p. 26/3 Testing for Absolute Continuity cdf of r p, p = 1, a(t) = c/(1 t), sample size 4000.

p. 27/3 Testing for Absolute Continuity cdf of r p, p = 1, a(t) = c/(1 t), sample size 4000. for c = 1, c = 5, and c = 1, there is absolute continuity. 2 8

p. 28/3 Testing for Absolute Continuity cdf of r p, p = 2, a(t) = c/(1 t), sample size 1000.

p. 29/3 Testing for Absolute Continuity cdf of r p, p = 2, a(t) = c/(1 t), sample size 1000. for c = 3, c = 5, and c = 1, there is absolute continuity. 2 4

p. 30/3 Testing for Absolute Continuity cdf of r p, p = 3, a(t) = c/(1 t), sample size 1000, for c = 3/2 and c = 2, there is absolute continuity.

p. 31/3 Testing for Absolute Continuity cdf of r p, p = 2, a(t) = 1 2 + 1 200(1 t) 2, sample size 1000, there is absolute continuity. Conjecture: for a(t) = c 1 t, there is absolute continuity iff c > p 2.

p. 32/3 Testing for Absolute Continuity Consider the distribution of the difference ν between two paths, conditioned on their divergence time t. We have f(ν t) N ( 0, 2σ 2 (1 t)i ). density of divergence time is a(t) exp ( A(t) ) = c(1 t) c 1. therefore f(ν) = = = 1 0 1 0 ( ) c(1 t) c 1exp ν 2 4σ 2 (1 t) ( 4πσ2 (1 t) ) p dt c(1 t) c 1 p 2 c (4πσ 2 ) p 2 1 exp ( (4πσ 2 ) p 2 2 ) ν 2 4σ 2 (1 t) dt u p 2 1 c exp ( ν 2 4σ 2 u) du

p. 33/3 Testing for Absolute Continuity continuing, we have f(ν) = = c (4πσ 2 ) p 2 c (4πσ 2 ) p 2 1 u p 2 1 c exp ( ν 2 4σ 2 u) du ( ν 2) c p 2 Γ ( p 4σ 2 2 c, ν 2 4σ 2 ) if p 2 > c, ν 0 if p 2 > c, ν = 0 c (4πσ 2 ) p 2 (c p 2 ) if p 2 < c, ν = 0 so at ν = 0 (for p 2 therefore c = p 2 continuity. > c), f(ν) is undefined. seems to be a critical point with respect to Conjecture: In contrast, DDT priors using a(t) = b + c (1 t) 2 (c > 0) always produce absolutely continuous distributions.

p. 34/3 Simple Relationships to other Processes DDT with variance σ 2, a(t) = 0 except for an infinite peak of mass log(1 + α) at t = 0, is equivalent to DP ( α, N(0,σ 2 I) ). to generate a simple DP mixture with components N(µ,σ 2 xi) and prior N(0,σ 2 µi), generate a DDT with A 1 (e) = { 0 if e < log(1 + α) σ 2 µ σ 2 if e log(1 + α) where σ 2 = σ 2 µ + σ 2 x.

Simple Relationships to other Processes p. 35/3

Discussion DDT can be used to produce variety of distributions. Their properties vary with the choice of divergence function. The parameters of the divergence function can also be given prior distributions, for greater flexibility. Alternatively, one can substitute the divergence function with a stochastic process. DDT can model vectors of latent variables, if the observed data is perturbed by noise. Univariate data (even parameters of the DDT itself) can be modeled by multivariate DDTs, like z = x exp(y), where (x,y) DDT. σ 2 can be made time-varying, or with a DDT prior. Gaussian and Wishart priors can be used for the mean and covariance respectively of N(t). p. 36/3