Bayesian Nonparametric Autoregressive Models via Latent Variable Representation

Similar documents
Nonparametric Bayesian modeling for dynamic ordinal regression relationships

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Image segmentation combining Markov Random Fields and Dirichlet Processes

Hybrid Dirichlet processes for functional data

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

STAT Advanced Bayesian Inference

Order-q stochastic processes. Bayesian nonparametric applications

Compound Random Measures

Non-Parametric Bayes

Bayesian non-parametric model to longitudinally predict churn

Modeling for Dynamic Ordinal Regression Relationships: An. Application to Estimating Maturity of Rockfish in California

Slice Sampling Mixture Models

Bayesian Nonparametric Regression through Mixture Models

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Particle Learning for General Mixtures

On the Support of MacEachern s Dependent Dirichlet Processes and Extensions

Foundations of Nonparametric Bayesian Methods

arxiv: v3 [stat.me] 3 May 2016

Normalized kernel-weighted random measures

Dirichlet Processes: Tutorial and Practical Course

Bayesian Point Process Modeling for Extreme Value Analysis, with an Application to Systemic Risk Assessment in Correlated Financial Markets

An adaptive truncation method for inference in Bayesian nonparametric models

Construction of Dependent Dirichlet Processes based on Poisson Processes

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Bayesian Nonparametrics: Dirichlet Process

A Nonparametric Model for Stationary Time Series

Nonparametric Bayes Uncertainty Quantification

Lecture 3a: Dirichlet processes

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Bayesian semiparametric inference for the accelerated failure time model using hierarchical mixture modeling with N-IG priors

Nonparametric Bayesian models through probit stick-breaking processes

Spatio-Temporal Models for Areal Data

Bayesian nonparametrics

A comparative review of variable selection techniques for covariate dependent Dirichlet process mixture models

Analysing geoadditive regression data: a mixed model approach

CPSC 540: Machine Learning

Truncation error of a superposed gamma process in a decreasing order representation

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

Bayesian Nonparametric Inference Methods for Mean Residual Life Functions

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Introduction to Probabilistic Machine Learning

Modelling geoadditive survival data

Slice sampling σ stable Poisson Kingman mixture models

Bayesian nonparametric Poisson process modeling with applications

Sequential Monte Carlo Methods for Bayesian Computation

Spatial modeling for risk assessment of extreme values from environmental time series: A Bayesian nonparametric approach

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Chapter 3. Dirichlet Process

STA 216, GLM, Lecture 16. October 29, 2007

Bayes methods for categorical data. April 25, 2017

Modeling conditional distributions with mixture models: Applications in finance and financial decision-making

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

A marginal sampler for σ-stable Poisson-Kingman mixture models

Modelling and computation using NCoRM mixtures for. density regression

Bayesian Methods for Machine Learning

Gibbs Sampling in Endogenous Variables Models

Bayesian semiparametric modeling for stochastic precedence, with applications in epidemiology and survival analysis

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Bayesian Hidden Markov Models and Extensions

BAYESIAN NONPARAMETRIC MODELLING WITH THE DIRICHLET PROCESS REGRESSION SMOOTHER

Mixture Modeling for Marked Poisson Processes

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Density Estimation. Seungjin Choi

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Disease mapping with Gaussian processes

Nonparametric Bayes tensor factorizations for big data

Bayesian semiparametric analysis of short- and long- term hazard ratios with covariates

Bayesian Nonparametrics: some contributions to construction and properties of prior distributions

A Bayesian Nonparametric Approach to Causal Inference for Semi-competing risks

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Modeling conditional distributions with mixture models: Theory and Inference

Bayesian spatial quantile regression

On Simulations form the Two-Parameter. Poisson-Dirichlet Process and the Normalized. Inverse-Gaussian Process

Markov Chain Monte Carlo methods

A Nonparametric Model-based Approach to Inference for Quantile Regression

Dependent mixture models: clustering and borrowing information

Bayesian Nonparametric Spatio-Temporal Models for Disease Incidence Data

On the posterior structure of NRMI

An ANOVA Model for Dependent Random Measures

arxiv: v1 [math.st] 28 Feb 2015

Collapsed Variational Dirichlet Process Mixture Models

Chapter 2. Data Analysis

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian covariate models in extreme value analysis

Semiparametric Bayesian Inference for. Multilevel Repeated Measurement Data

Probabilistic Graphical Models

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Bayesian Hierarchical Models

Default priors for density estimation with mixture models

Markov Chain Monte Carlo (MCMC)

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles

Penalized Loss functions for Bayesian Model Choice

Multivariate spatial modeling

CMPS 242: Project Report

Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother

VCMC: Variational Consensus Monte Carlo

Gaussian Multiscale Spatio-temporal Models for Areal Data

Transcription:

Bayesian Nonparametric Autoregressive Models via Latent Variable Representation Maria De Iorio Yale-NUS College Dept of Statistical Science, University College London Collaborators: Lifeng Ye (UCL, London, UK) Alessandra Guglielmi (Polimi, Milano, Italy) Stefano Favaro (Universitá degli Studi di Torino, Italy) 29th August 2018 Bayesian Computations for High-Dimensional Statistical Models

Motivation T=1 T=2 T=3

Want flexible model for time-evolving distribution data driven clustering allow for covariates prediction feasible posterior inference framework for automatic information sharing across time possible easily generalizable tools for efficient sharing of information across data dimension (e.g. space)

Nonparametric Mixture At time t, data y 1t,..., y nt iid y it ft (y) = K(y θ)g t ( dθ) where G t is the mixing distribution at time t. Assign flexible prior to G, e.g. DP, PT, Pytman-Yor,... Bayesian Nonparametric Mixture Models

Dirichlet Process (DP) Probability model on distributions G DP(α, G 0 ), with measure G 0 = E(G) and precision parameter α. G is a.s. discrete

Sethuraman s stick breaking representation G = w h δ θh h=1 ξ h Beta(1, α) w h = h 1 ξ h (1 ξ i ), scaled Beta distribution i=1 θ h iid G0, h = 1, 2,... where δ(x) denotes a point mass at x, ψ h are weights of point masses at locations θ h.

Dirichlet Process Mixtures (DPM) In many data analysis applications the discreteness is inappropriate. To remove discreteness: convolution with a continuous kernel f (y) = p(y θ)dg(θ) G DP(α, G 0 )

Dirichlet Process Mixtures (DPM)... or with latent variables θ i G DP(α, G 0 ) θ i G f (y) = p(y θ i ) Nice feature: Mixture is discrete with probability one, and with small α, there can be high probabilities of a finite mixture. Often p(y θ) = N(β, σ 2 ) f (y) = h=1 w hn(β h, σ 2 )

Comment Under G : p(θ i = θ i ) > 0 Observations share the same θ belong to the same cluster DPM induces a random partition of the observations {1,..., n} Note: in the previous model the clustering of observations depends only on the the distribution of y.

Models for collection of distributions Observations are associated to different temporal coordinates ind y t f t (y) = K(y θ)g t ( dθ) Recent research focuses on developing models for a collection of random distributions {G t ; t T } Interest: T discrete set. Goal: induce dependence. Reason: properties of the distributions f t are thought to be similar in some way; e.g. similar means, similar tail behaviour, distance between them small.

Dependent Dirichlet Process If G DP(G 0, α), then using the constructive definition of the DP (Sethuraman 1994) G t = w tk δ θtk where (θ tk ) k=1 are iid from some G 0(θ) and (w tk ) is a stick breaking process w tk = z tk (1 z tj ) k=1 j<k z tj Beta(1, α t ) Dependence has been introduced mostly in regression context, conditioning on level of covariates x.

Dependence Structure Introduce dependence through the base distributions G 0t of conditionally independent nonparametric priors G t Simple but restrictive dependence only in the atoms of the G t efficient computations but not very flexible approach dependence structure in the weights complex and inefficient computational algorithms, limiting the applicability of the models alternative is to assume G t = π t G + (1 πt )G t

Dependence through Weights flexible strategy random prob measures share the same atoms under certain conditions we can approximate any density with any atoms varying the weights can provide prob measures very close (similar weights) or far apart

Temporal Dependence Literature: Griffin and Steel (2006), Caron, Davy and Doucet (2007), Rodriguez and ter Horst (2008), Rodriguez and Dunson (2009), Griffin and Steel (2009), Mena and Walker (2009), Nieto-Barajas et al. (2012), Di Lucca et al. (2013), Bassetti et al. (2014), Xiao et al. (2015), Gutirrez, Mena and Ruggiero (2016)... There exist many related constructions for dependent distributions defined through Poisson-Dirichlet Process (e.g. Leisen and Lijoi (2011) and Zhu and Leisen (2015)) or correlated normalized completely random measures (e.g. Griffin et al. 2013, Lijoi et al. 2014) Related approach: Covariate Dependent Random Partion Models Idea: dynamic DP extension can be developed by introducing temporal dependence in the weights through a transformation of the Beta random variables and specifying common atoms across G t.

Related Approaches G t = w tk δ θk, k=1 k 1 w tk = ξ tk (1 ξ ti ), ξ tk Beta(1, α) i=1 BAR Stick Breaking Process (Taddy 2010): defines evolution equation for w t : ξ t = (u t v t )ξ t 1 + (1 u t ) Beta(1, α) u t Beta(α, 1 ρ), v t Beta(ρ, 1 ρ) 0 < ρ < 1, u t v t DP marginal, corr(ξ t, ξ t k ) = [ρα/(1 + α ρ)] k > 0 Prior simulations show that number of clusters and number of singletons at time t = 1 is different from other times. Very different clustering can corresponds to relatively similar predictive distributions.

Latent Gaussian time-series (DeYoreo and Kottas 2018): ξ t = { exp ζ2 + ηt 2 } Beta(α, 1) 2α ζ N(0, 1) η t η t 1, φ N(φη t 1, 1 φ 2 ), φ < 1 η t AR(1) process. k 1 w t1 = 1 ξ t1, w tk = (1 ξ tk ) α 1 implies a 0.5 lower bound on corr(ξ t, ξ t k ). As a result, same weights (For example, the first weight at time 1 and first weight at time 2) always have correlation above 0.5. η (AR component) is squared so that the correlation between different times is positive. These problems can be overcome by assuming time-varying locations. i=1 ξ ti

Latent Autoregressive Process Want: Autoregressive model for a collection of random function G t. Goal: to obtain flexible time-dependent clustering. We borrows ideas from copula literature, in particular from Marginal Beta regression (Guolo & Varin 2014) Recall Probability Integral Transformation: If ɛ N(0, 1) and F is the cdf of a Beta distribution then Y = F 1 (Φ(ɛ); a, b) is marginally distributed as a Beta with parameters (a, b).

AR process: ARDP ɛ 1 N(0, 1) ɛ t = ψɛ t 1 + η t, t > 1 η t iid N(0, 1 ψ 2 ) Stick-breaking construction {ɛ tk } iid AR(1), k = 1,..., ξ tk = Ft 1 (Φ(ɛ tk ); a t, b t ) w tk = ξ tk (1 ξ tl ) l<k θ k iid G0 G t = w tk δ θk k=1

Comments Easy to generate {G t, t = 1,...}. The marginal distribution of ɛ t is N(0,1) and therefore the marginal distribution of ξ t is Beta with desired parameters. If a t = 1, b t = α, then marginally each G t is DP(α, G 0 ). The {ξ t } inherit the same Markov structure (AR(1)) of the {ɛ t } process, and therefore also the {G t, t = 1, 2,...} is AR(1). Easy to derive the k step predictive densities G t+k as it is easy to derive the weights.

We can derive: Evolution of the weights through time The conditional distribution of ξ t ξ t 1 : L(ξ t ξ t 1 ) d = 1 (1 Φ(Z)) 1/α Z N ( ψφ 1 (1 (1 ξ t 1 ) α ), 1 ψ 2). The conditional law of ξ t, given ɛ t 1 : L {ξ t ɛ t 1 } = L { Ft 1 } (Φ(ɛ t ); a, b) ɛ t 1 ɛ t ɛ t 1 N(ψɛ t 1, 1 ψ 2 ) ( ) Φ 1 (1 (1 x) α ) ψɛ t 1 P(ξ t x ɛ t 1 ) = Φ 1 ψ 2

L(ξ 2 ξ 1 ) 0.0 0.5 1.0 1.5 2.0 2.5 psi=0 psi=0.9 psi= 0.5 0 1 2 3 4 5 psi=0 psi=0.9 psi= 0.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ξ 1 = 0.5 ξ 1 = 0.9 α = 1. Ability of accommodate a variety of distributional shapes over the unit interval.

Posterior Inference MCMC based on truncation of DP to L component. Blocked Gibbs sampler (Ishwaran and James 2001) is extended to include a Particle MCMC update (Andrieu et al. 2010) for the weights of the DP. Let s t and w t be the allocation and the weight vector at time t, respectively. We need to sample from p(w 1,..., w T s 1..., s T, ψ) Standard algorithm such as the FFBS are not applicable since 1-step predictive distributions, p ψ (w t w t 1 ), cannot be derived analytically. Employ PMCMC, using the prior to generate samples ɛ t. We exploit SMC approximations: L ψ (w t w t 1 ) = L ψ (ɛ t ɛ t 1 ) p ψ (w 1,..., w T s 1..., s T ) = R ωt r δ ɛ 1:T r=1

Simulations T = 4, n = 100, independent time points, y it N. T = 4, n = 100. At t = 1 data generated from 2 clusters of equal size. For t = 2, 3, 4, individuals remain in the same cluster and the distribution of cluster will remain same over time. T = 4, n = 100. At t = 1 data generated from 2 clusters of equal size. For t = 2, 3, 4, individuals remain in the same cluster with probability 50% as t = i 1, and with probability 50%, people will switch to the other cluster. Note that the distribution of each cluster will change over time.

Disease Mapping Disease incidence or mortality data are typically available as rates or counts for specified regions, collected over time. Data: breast cancer incidence data of 100 MSA (Metropolitan Statistical Area) from United States Cancer Statistics: 1999 2014 Incidence, WONDER Online Database. We use the data from year 2004 to year 2014. Population data of year 2000 and year 2010 is obtained from U.S. Census Bureau, Population Division. The population data of the remaining years are estimated by linear regression. Primary goal of the analysis: identification of spatial and spatio-temporal patterns of disease (disease mapping) spatial smoothing and temporal prediction (forecasting) of disease risk.

Space-Time Clustering Y it = breast cancer incidence counts (number of cases), in region MSA i at time t N it = the number of individuals at risk R it = disease rate Model: Y kt N it, R it Poisson(N it R it ), i = 1,..., 100; t = 1,... 11 ln(r it ) = µ it + φ i µ it G t G t {G t, t 1} ARDP(1) G 0 = N(0, 10) φ C, τ 2, ρ N (0, τ 2 [ρc + (1 ρ)i n ] 1)

Spatial Component φ C, τ 2, ρ N (0, τ 2 [ρc + (1 ρ)i n ] 1) variance parameter τ 2 controls the amount of variation between the random effect. the weight parameter ρ controls the strength of the spatial correlation between the random effects the elements of C are equal to n k, if j = k c jk = 1, if j k 0, otherwise. where j k denotes area (j, k) are neighbours and n k denotes the number of neighbours of area (k). Prior specification: τ 2 U(0, 10), ρ discrete uniform(0, 0.05, 0.10, 0.15,..., 0.95)

Posterior Inference on Clustering

Co-Clustering Probability

Posterior Inference on Correlations

Dose-escalation Study Data: wbc over time for n = 52 patients receiving high doses of cancer chemotherapy CTX: anticancer agent, known to lower a person s wbc GM-CSF: drugs given to mitigate some of the side-effects of chemotherapy WBC profiles over time: initial baseline sudden decline when chemotherapy starts slow S-shaped recovery back to baseline after end of treatment interest in understanding the effect of dose on wbc in order to protect patients against severe toxicity

Model log(y it ) µ it, τ it N(µ it, (λτ it ) 1 ) µ it = m it + x it β t m it, τ it G t G t {G t, t 1} ARDP(1) G 0 NormalGamma(µ 0, λ, α, γ)

Posterior Inference on ψ

Posterior Inference on Clustering

Posterior Density Estimation

Conclusions Dependent process for time-evolving distributions based on the DP could be generalised to GDP Introduce temporal dependence through latent stochastic process Normal variables. Advantage more general process on ɛ, e.g ARMA flexible can accommodate different correlation pattern dependent clustering borrowing of information across time-periods general applicability allows for prediction of G t+1 posterior computations through PMCMC