Nonparametric Bayes Uncertainty Quantification

Nonparametric Bayes Uncertainty Quantification David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & ONR

Review of Bayes Intro to Nonparametric Bayes Density estimation More advanced applications

Bayesian modeling Provides a probabilistic framework for characterizing uncertainty Components of Bayesian inference Prior distribution: a probability distribution characterizing uncertainty in model parameters θ Likelihood function: sampling distribution characterizing uncertainty in the measurements conditionally on θ Loss function: quantifies price paid for errors in decisions

Components of Bayes Prior distribution p(θ) characterizes uncertainty in θ prior to observing current data y p(θ) chosen to be one s best guess based on previous experiments & knowledge in the field (subjective Bayes) Alternatively use a default approach that uses a prior that is flat or non-informative in some sense (objective Bayes) The prior is then updated with the likelihood p(y θ) via Bayes Theorem, p(θ y) = p(y θ)p(θ) Θ p(y θ)p( θ)d θ, with Θ = parameter space

Comments on the Posterior Posterior distribution: p(θ y) quantifies the current state of knowledge about θ p(θ y) is a full probability distribution - one can obtain a mean, covariance, intervals/regions characterizing uncertainty etc The process of updating the prior with the likelihood to obtain the posterior is known as Bayesian updating This calculation can be challenging due to the normalizing constant in the denominator of p(θ y) = p(y θ)p(θ) Θ p(y θ)p( θ)d θ. In conjugate models we can calculate p(θ y) analytically but in more complex models Monte Carlo & other methods are used

Simple example - estimating a population proportion Suppose θ (0, 1) is the population proportion of individuals with diabetes in the US A prior distribution for θ would correspond to some distribution that distributes probability across (0, 1) A very precise prior corresponding to abundant prior knowledge would be concentrated tightly in a small sub-interval of (0, 1) A vague prior may be distributed widely across (0, 1) - e.g., a uniform distribution would be one choice

Some possible prior densities beta densities p(θ) 0 2 4 6 8 10 Distributions beta(1,1) beta(1,11) beta(2,10) 0.0 0.2 0.4 0.6 0.8 1.0 θ

Collecting data & calculating likelihood To update your prior knowledge & learn more about θ, one can collect data y = (y 1,..., y n ) that relate to θ The likelihood of the data is denoted L(y; θ) For example, suppose θ is the population proportion of type II diabetes in 20-25 year olds A random sample of n individuals is surveyed & asked type II diabetes status, with y i = 1 for individuals reporting disease & y i = 0 otherwise The likelihood is then L(y; θ) = n θ y i (1 θ) 1 y i. i=1

Beta-binomial example The prior is π(θ) = B(a, b) 1 θ a 1 (1 θ) b 1 The likelihood is L(y; θ) = n i=1 θy i (1 θ) 1 y i The posterior is then B(a, b) 1 θ a 1 (1 θ) b 1 n i=1 p(θ y) = θy i (1 θ) 1 y i 1 0 B(a, b) 1 θ a 1 (1 θ) b 1 n i=1 θy i (1 θ) 1 y i dθ = c(a, b, y)θ a+ n i=1 y i 1 (1 θ) b+n ( n i=1 y i ) 1 beta ( a + ny, b + n(1 y) ) where c(x) is a function of (x) but not of θ Updating a beta prior with a Bernoulli likelihood leads to a beta posterior - we have conjugacy

Generalizing to more interesting settings Although this is a super simple example, the same machinery of Bayes rule can be used in much more complex settings In particular, the unknown parameters θ in the model can be conceptually essentially anything θ can be a vector, matrix, tensor, a function or surface or shape, an unknown density, missing data, etc Generalizing Bayesian inference to such complex settings can be challenging but there is a rich literature to leverage on Today I ll be focusing on introducing nonparametric Bayes approaches & some corresponding tools

What is nonparametric Bayes? Nonparametric (np) Bayes is fundamentally different from classical nonparametric methods, particularly based on ranks, etc One defines a fully generative probabilistic model for the available data However, unlike in parametric models some of the model unknowns are infinite-dimensional For example, there may be unknown functions or densities involved in the model Also, np Bayes is defined by a large support property guaranteeing flexibility

Simple example - Function estimation One of the canonical examples is the simple setting in which y i = µ(x i ) + ɛ i, ɛ i N(0, σ 2 ). µ : X R is an unknown function x i = one or more inputs for observation i y i = output for observation i ɛ i = measurement error σ 2 = measurement error variance

Function estimation - continued In this model the unknowns are θ = {µ, σ 2 } For the measurement error variance σ 2 the usual prior would correspond to an inverse-gamma distribution However, µ( ) is an infinite-dimensional unknown in being defined at every point in X The likelihood function is simply n [ (2πσ 2 ) 1/2 exp 1 ] 2σ 2 {y i µ(x i )} 2, i=1 which is parametric The nonparametric part is µ - how to choose an np Bayes prior?

What is np Bayes? From a nonparametric Bayes perspective, there are several key properties for the prior Large support: This means that the prior can generate functions arbitrarily close to any function in a large class Interpretability: The prior shouldn t be entirely a black box but one should be able to put in their prior beliefs about where & how much the prior is concentrated Computability: Unless we can conduct posterior computation (at least approximately) without too much headache it s not very useful There is also an increasing literature showing more involved properties - e.g., posterior consistency, minimax optimal rates, etc.

Some examples Motivated by these properties & general practical performance, there are two canonical priors that are most broadly used in Bayesian nonparametrics Gaussian processes (GPs): provide a broadly useful prior for unknown functions & surfaces (e.g., µ in the above example) Dirichlet processes (DPs): a prior for discrete random measures providing a building block for priors for densities, clustering, etc There is also a rich literature proposing generalizations such as beta processes, kernel stick-breaking processes, etc

Bayes density estimation Introduced by Ferguson (1973) as a nonparametric prior for an unknown distribution that satisfies the three desirable properties listed above Suppose we have the simple setting in which y i f, i = 1,..., n f is an unknown density on the real line R Taking a Bayesian nonparametric UQ approach, we d like to choose a prior for f - how to do this?

Random probability measures (RPMs) In defining priors for unknown distributions, it is particularly convenient to work with probability measures A given distribution (e.g., N(0,1)) has a corresponding probability measure & samples from that distribution obey that probability law. To allow uncertainty in whether samples are N(0, 1) or come from some other unknown distribution, we can choose a random probability measure (RPM). Each realization from the RPM gives a different probability measure & hence different distribution of the samples. A parametric model would always yield distributions in a particular parametric class (e.g., Gaussian) while a nonparametric model can generate PMs close to any PM in a broad class.

Probability Measures Let (Ω, B, P) denote a probability space, with Ω the sample space, B the Borel σ-algebra of subsets of Ω, and P a probability measure For example, we may have Ω = R corresponding to the real line, with P a probability measure corresponding to a density f wrt to a reference measure µ For continuous univariate densities, µ corresponds to the Lesbesgue measure.

Random Probability Measures (RPMs) Let P denote the set of all probability measures P on (Ω, B) To allow P to be unknown, we let P π P, where π P is a prior over P. P is then a random probability measure By allowing P to be unknown, we automatically allow the corresponding distribution of the data to be unknown How to choose π P?

A simple motivating model - Bayesian histograms The goal is to obtain a Bayes estimate of the density f with y i f. From a frequentist perspective, a very common strategy is to rely on a simple histogram. Assume for simplicity we have pre-specified knots ξ = (ξ 0, ξ 1,..., ξ k ), with ξ 0 < ξ 1 < < ξ k 1 < ξ k and y i [ξ 0, ξ k ] The model for the density is as follows f (y) = k π h 1(ξ h 1 < y ξ h ) (ξ h ξ h 1 ), y R, h=1 with π = (π 1,..., π k ) an unknown probability vector

Choosing priors for probability vectors In the fixed knot model, the only unknown parameters are the probability weights π on the bins By choosing a prior for π we induce a prior on the density f Earlier we used the beta prior for a single probability but now we have a vector of probabilities We need to choose a probability distribution on the simplex A simple choice that provides a multivariate generalization of the beta is the Dirichlet distribution

Dirichlet Prior Assume a Dirichlet(a 1,..., a k ) prior for π, k h=1 Γ(a h) k ) Γ( h=1 a h k h=1 π a h 1 h The hyperparameter vector can be re-expressed as a = απ 0, where E(π) = π 0 = {a 1 / h a h,..., a k / h a h} is the prior mean and α is a scale (the prior sample size) Note that an appealing aspects of histograms is that we can easily incorporate prior data & knowledge to elicit α and π 0. Very simple & interpretable & previous data often come in the form of counts in bins After choosing the prior, we use Bayesian updating to obtain the posterior distribution for π

Posterior distribution for bin probabilities The posterior distribution of π is calculated as ( π y n ) k h=1 k h=1 π a h 1 h π a h+n h 1 h i:y i (ξ h 1,ξ h ] π h ξ h ξ h 1 D = Diri(a 1 + n 1,..., a k + n k ), where n h = i 1(ξ h 1 < y i ξ h ) Hence, we have conjugacy and the posterior for the bin probabilities has a simple form We can easily sample from the Dirichlet to obtain realizations from the induced posterior for the density f These samples can be used to quantify uncertainty through point & interval estimates

Simulation Experiment To evaluate the Bayes histogram method, I simulated data from a mixture of two betas, f (y) = 0.75beta(y; 1, 5) + 0.25beta(y; 20, 2). n = 100 samples were obtained from this density Assuming data between [0,1] and choosing a 10 equally-spaced knots, I applied the Bayes histogram approach The true density and Bayes posterior mean are plotted on the next slide

Bayes Histogram Estimate for Simulation Example density 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 y

Some Comments Procedure is really easy in that we have conjugacy Results very sensitive to knots & allowing unknown knots is computationally demanding Allows prior information to be included in frequentist histogram estimates easily Can we eliminate sensitivity to choice of bins by thinking of an RPM characterization & defining a prior for all possible bins?

Dirichlet processes (Ferguson, 1973; 1974) Let B 1,..., B k denote a partition of the sample space Ω - e.g., histogram bins Let P correspond to a random probability measure (RPM) that assigns probability to any subset B Ω Then we could let {P(B 1 ),..., P(B k )} Diri ( αp 0 (B 1 ),..., αp 0 (B k ) ). (1) P 0 is a base probability measure providing an initial guess at P & α is a prior concentration parameter

Dirichlet Dirichlet process However, we don t want to be sensitive to the bins B 1,..., B k or to k It would be really cool if there was an RPM P such that (1) were satisfied for all B 1,..., B k and all k This RPM does indeed exist (as shown by Ferguson) & corresponds to a Dirichlet process (DP) The DP has many wonderful properties making it widely used in practice. For example, E{P(B)} = P 0 (B) and V{P(B)} = P 0(B){1 P 0 (B)}, 1 + α for all B B, with B the Borel σ-algebra of subsets of Ω. Hence, the prior is centered on P 0 and α controls the variance

Some basic properties of the DP Using the notation P DP(αP 0 ), the DP prior has large support - assigning probability to arbitrarily small balls around any PM Q over Ω iid Also, if we let y i P then we obtain conjugacy so that ( ( ) P y1,..., y n DP αp 0 + δ yi ). i Posterior is also a DP but updated to add the empirical measure i δ y i to the base measure αp 0 Updated precision is α + n so α is a prior sample size The posterior expectation of P is defined as ( ) ( ) α n E{P(B) y n 1 } = P 0 (B) + α + n α + n n δ y i. i

DP DPMs Realizations from the DP are almost surely discrete, having masses at the observed data points If we convert the probability measure to a cumulative distribution function it will have jumps at all data points Not ideal for characterizing continuous data but really good as a prior for mixing measure In particular, instead of using the DP directly as a prior for (the probability measure corresponding to) the distribution that generated the data, let f (y) = K(y; θ)dp(θ), P DP(αP 0 ), (2) with K( ; θ) a parametric density on Ω parameterized by θ & P an unknown mixing measure

Samples from the Dirichlet process with precision α α=.5 α=1 1 0.8 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 3 2 1 0 1 2 3 4 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 3 2 1 0 1 2 3 4 0.35 α=5 0.2 α=10 0.3 0.18 0.16 0.25 0.14 0.2 0.12 0.1 0.15 0.08 0.1 0.06 0.05 0.04 0.02 0 4 3 2 1 0 1 2 3 0 4 3 2 1 0 1 2 3

Dirichlet process mixtures (DPM)s DP mixtures provide an incredibly powerful & useful framework for UQ in a very broad variety of models Although expression (2) focuses on a simple setting involving univariate density estimation, we can take any parametric model & develop a more realistic & flexible model by DP mixing the whole model or one or more components This can allow for unknown residual densities that change in shape and variance with predictors, flexible modeling of large tabular data & many other settings At this point I m going to provide some basic details on computation & inference in simple DPMs & then I ll provide a more involved example illustrating what is possible

Density estimation via mixtures Considering the DPM in expression (2) above, it seems we have an intractable integral to deal with However, Sethuraman (1994) provided a constructive representation of the DP showing: P DP(αP 0 ) P = iid π h δ θh, θ h P0, h=1 with π h = V h l<h (1 V l), V h Be(1, α). This is referred to as the stick-breaking representation & implies that (2) can be expressed as f (y) = π h K(y; θ h ), (3) h=1 which is a discrete mixture model.

Approximating as finite mixtures Although not necessary using slice sampling & other recent(ish) samplers, (3) can be approximated accurately as a finite mixture This is because the weights {π h } have a prior that strongly favors stochastic decreases in the index h After the first small number of weights the prior is concentrated near zero for the future ones Hence, we can truncated the DP using k components as an upper bound & the extra components will be effectively deleted We consider a motivating application to make these ideas concrete

Gestational Length vs DDE (mg/l) Gestational age at delivery 30 35 40 45 0 50 100 150 DDE (mg/l)

Gestational Length Densities within DDE Categories DDE <15 mg/l [15,30) Density 0.00 0.10 0.20 Density 0.00 0.10 0.20 30 35 40 45 Gestational length 30 35 40 45 Gestational length [30,45) [45,60) Density 0.00 0.10 0.20 Density 0.00 0.10 0.20 30 35 40 45 Gestational length 30 35 40 45 Gestational length DDE >60 Density 0.00 0.10 0.20 30 35 40 45 Gestational length

Finite mixture application We illustrate finite mixtures using a simple location mixture of Gaussians, k f (y) = π h N(y; µ h, τ 1 ). h=1 We apply the model to data on length of gestation Preterm birth defined as delivery occurring prior to 37 weeks of completed gestation Cutoff is somewhat arbitrary & the shorter the length of gestation, the more adverse the associated health effects Appealing to model the distribution of gestational age at delivery as unknown & then allow predictors to impact this distribution

Comments on Gestational Length Data Data are non-gaussian with a left skew Not straightforward to transform the data to approximate normality A different transformation would be needed within each DDE category First question: how to characterize gestational age at delivery distribution without considering predictors?

Mixture Models Initially ignoring DDE Letting y i = gestational age at delivery for woman i, f (y i ) = N(y i ; µ, σ 2 ) dg(µ, σ 2 ), where G=mixture distribution for θ = (µ, σ 2 ) Mixtures of normals can approximate any smooth density Location Gaussian mixture with k components one possibility: k f (y i ) = p h N(y i ; µ h, σ 2 ). h=1

Mixture components for gestational age at delivery Component 1 (86% of deliveries) Density 0.25 0.2 0.15 0.1 0.05 0 25 30 35 40 45 Component 2 (12% of deliveries) Density 0.25 0.2 0.15 0.1 0.05 Density 0.3 0.25 0.2 0.15 0.1 0.05 0 25 30 35 40 45 Component 3 (2% of deliveries) 0 25 30 35 40 45 Gestational week of delivery

Mixture-based density of gestational age at delivery 0.25 0.2 0.15 Density 0.1 0.05 0 25 30 35 40 45 Gestational week at delivery

Approximate DPM computation The finite mixture of normals can be equivalently expressed as y i N(µ Si, τ 1 S i ), S i k π h δ h h=1 δ h = probability measure concentrated at the integer h S i {1,..., k} = indexes the mixture component for subject i A Bayesian specification is completed with priors, π = (π 1,..., π k ) Dirichlet(a 1,..., a k ) (µ h, τ h ) N(µ h ; µ 0, κτ 1 )Ga(τ; a τ, b τ ), h = 1,..., k with a j = α/k approximating the Dirichlet process (as k increases).

Posterior Computation via Gibbs Sampling 1. Update S i from its multinomial conditional posterior with Pr(S i = h ) = π hn(y i ; µ h, τ 1 h ) k l=1 π ln(y i ; µ l, τ 1, h = 1,..., k. l ) 2. Update (µ h, τ 1 ) from its conditional posterior h (µ h, τ 1 h ) = N(µ h ; µ h, κτ 1 h )Ga(τ h; â τh, ˆb τh ), κ h = (κ 1 + n h ) 1, µ h = κ(κ 1 µ 0 + n h y h ), â τh = a τ + n h /2, ˆb τh = b τ + 1 { ( } (y i y 2 h ) 2 nh + )(y 1 + κn h µ 0 ) 2. h i:z i =h 3. Update (π ) Dir(a 1 + n 1,..., a k + n k )

Some Comments Gibbs sampler is trivial to implement Discarding a burn-in, monitor f (y) = k h=1 π hn(y; µ h, τ 1 for a large number of iterations & a dense grid of y values h ) Bayes estimate of f (y) under squared error loss averages the samples Can also obtain 95% pointwise intervals for unknown density

Nonparametric residual densities The above focus was on univariate densities but the approach is very easy to generalize to other settings For example, suppose we have the nonparametric regression model: y i = µ(x i ) + ɛ i, ɛ i f. Then, we can use a DPM to allow the residual density to have an unknown form - allowing possible skewness & multimodality Due to the modular nature of MCMC, computational steps for DPM part of the model are essentially no different from described above

Nonparametric hierarchical models In hierarchical probability models, random effects often assigned Gaussian distributions Doesn t account for uncertainty in the random effect distributions We can use a DPM to allow unknown distributions of random effects Such a direction can be applied well beyond settings involving univariate continuous responses

Probabilistic tensor factorizations We instead propose a low rank & sparse factorization of pr(y i1 = c 1,..., y ip = c p ) = π c1 c p, with these probabilities forming an array Π = {π c1 c p }. Express array as a weighted average of rank-one arrays, π c1 c p = k p ν h h=1 j=1 λ (j) hc j, with appropriate priors on the components Leads to a simple & scalable approach - allowing uncertainty in rank & other parameters in factorization

High-dimensional contingency table analysis One application is nonparametric modeling of a high-dimensional pmf for categorical data y i = (y i1,..., y ip ) Many different categorical items recorded for the same subjects We have unknown dependence in the occurrences of these different categorical items We could use a log-linear model but flexible log-linear models aren t scalable beyond small p

Probabilistic Parafac Considering the Parafac factorization, π c1 c p = k p ν h h=1 j=1 λ (j) hc j, we can place a stick-breaking prior on the weights {ν h } Essentially a DPM of product multinomial distributions - automatically infers the rank of the tensor Leads to a simple Gibbs sampling algorithm Full probabilistic characterization of uncertainty in the elements of π

Beyond DPMs There are also rich classes of models that go well beyond DPMs For example, in the premature delivery application from above & many other settings we may want to model f (y x) This conditional response density may not factorize as µ(x) + ɛ with ɛ iid Instead we may need to characterize flexible changes in mean, variance & shape of the density with predictors

Gestational Length vs DDE (mg/l) Gestational age at delivery 30 35 40 45 0 50 100 150 DDE (mg/l)

Predictor-dependent RPMs One highly flexible method to characterize conditional densities lets f (y x) = N { y; µ(x; θ), τ 1} dg x (θ, τ), µ(x; θ) is a regression function parameterized by θ G X = {G x, x X } is a mixing measure varying with predictors Even if µ is linear, obtain a fully flexible specification with a flexible enough prior for G X

Kernel-stick breaking The kernel stick-breaking process is one popular prior for G X Generalizes the stick-breaking representation of the DP to include predictor dependence in the weights Leads to consistent estimation of the conditional density function & essentially as easy to do computation as for a DPM

DDE & Gestational Age at Delivery Application We focus on the mixture model: f (y i x i ) = N(y i ; x iβ i, σi 2 ) dg xi (β i, σi 2 ) y i = Gestational age at delivery in days, x i = DDE (mg/l) Collection of mixture distributions, G X = {G x : x X }, assigned an adaptive kernel mixture of DPs prior

Posterior Simulation Overview MCMC algorithm run 10,000 iterations Convergence and mixing good Fit excellent based on pivotal statistic-based approach (Johnson, 2006) Can estimate gestational age density for any dose of DDE & dose response curves for any quantile

Results: DDE & Gestational Age at Delivery 320 300 280 Gestational age at delivery 260 240 220 200 180 0 20 40 60 80 100 120 140 160 180 DDE (mg/l)

Results: DDE & Gestational Age at Delivery 10th percentile of DDE(12.57) 30th percentile of DDE(18.69) 0.03 0.03 f(y x) f(y x) f(y x) 0.02 0.01 0 200 220 240 260 280 300 Gestational length 60th percentile of DDE(28.44) 0.03 0.02 0.01 0 200 220 240 260 280 300 Gestational length 99th percentile of DDE(105.48) 0.03 0.02 0.01 0 200 220 240 260 280 300 Gestational length f(y x) f(y x) 0.02 0.01 0 200 220 240 260 280 300 Gestational length 90th percentile of DDE(53.72) 0.03 0.02 0.01 0 200 220 240 260 280 300 Gestational length

Results: DDE & Gestational Age at Delivery Pr(Gestational length <33) 0.8 0.6 0.4 0.2 Pr(Gestational length <35) 0.8 0.6 0.4 0.2 0 0 50 100 150 DDE (mg/l) 0 0 50 100 150 DDE (mg/l) Pr(Gestational length <37) 0.8 0.6 0.4 0.2 0 0 50 100 150 DDE (mg/l) Pr(Gestational length <40) 0.8 0.6 0.4 0.2 0 0 50 100 150 DDE (mg/l)

Notes Have provided a very brief overview of nonparametric Bayes with an eye towards UQ Focus on Dirichlet process mixtures, their applications & closely related models Gaussian processes at least as useful in UQ if not more so - they provide another np Bayes prior I assume O Hagan will focus on the GP in his lectures so purposely avoided it

Some References Dunson & Park (2008) Kernel stick-breaking processes. Biometrika, 95, 307-323. Dunson & Xing (2009) Nonparametric Bayes modeling of multivariate categorical data. JASA, 104, 1042-1051. Gelman et al. (2013) Bayesian Data Analysis (BDA3). CRC Press. Contains chapters on np Bayes.