Bayesian Nonparametric Regression through Mixture Models

Bayesian Nonparametric Regression through Mixture Models Sara Wade Bocconi University Advisor: Sonia Petrone October 7, 2013

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Introduction Aim: flexible regression, where both the mean and error distribution may evolve flexibly with x.

Introduction Aim: flexible regression, where both the mean and error distribution may evolve flexibly with x. Many approaches for flexible regression (ex. splines, wavelets, Gaussian processes), but typically assume iid Gaussian errors, Y i = m(x i ) + ɛ i, ɛ i σ 2 iid N(0, σ 2 ). For i.i.d. data, mixtures are a useful tool for flexible density estimation.

Mixture Models Given f, assume Y i are iid with density f p (y) = K(y θ)dp(θ), for some parametric density K(y θ).

Mixture Models Given f, assume Y i are iid with density f p (y) = K(y θ)dp(θ), for some parametric density K(y θ). Bayesian setting, define a prior on P, where typically P is discrete a.s., P = J J w j δ θj, f P (y) = w j K(y θ j ). j=1 j=1 How many components? Fixed J: overfitted mixtures (Rousseau and Mengersen (2011)) or post-processing techniques. Random J: reversible jump MCMC. J = : Dirichlet process (DP) mixtures and developments in BNP.

Extending Mixture Models Two main lines of research: Joint approach: Augment the observations to include X and assume that given f P, the (Y i, X i ) are iid with density f P (y, x) = K(y, x θ)dp(θ) = w j K(y, x θ j ). j=1 Conditional density estimates are obtained as a by-product through j=1 f P (y x) = w jk(y, x θ j ) j =1 w j K(x θ j ). Conditional approach: Allow the mixing measure to depend on x and assign a prior so that {P x } x X are dependent. Assume that given f Px and x, the Y i are independent with density f Px (y x) = K(y θ)dp x (θ) = w j (x)k(y θ j (x)). j=1

Mixture Models for Regression Many proposals in literature fall into these two general categories: 1. Joint Approach: Müller, Erkanli, and West (1996), Shahbaba and Neal (2009), Hannah, Blei, and Powell (2011), Park and Dunson (2010), Müller and Quintana (2010), Bhattacharya, Page, and Dunson (2012), Quintana (2011)... 2. Conditional Approach: a. Covariate-dependent atoms: MacEachern (1999; 2000; 2001), DeIorio et al. (2004), Gelfand, Kottas, and MacEachern (2005), Rodriguez and Horst (2008), De La Cruz, Quintana, and Müller (2007), De Iorio et al. (2009), Jara et al. (2010)... Example: f Px (y x) = j=1 w jn(y µ j (x), σj 2). b. Covariate-dependent weights: Griffin and Steele (2006; 2010), Dunson and Park (2008), Ren et al. (2011), Chung and Dunson (2009), Rodriguez and Dunson (2011)... (also models based on covariate-dependent urn schemes or partitions). Example: f Px (y x) = j=1 w j(x)n(y xβ j, σj 2).

Mixture Models for regression How to choose among the large number of proposals for the application at hand?

Mixture Models for regression How to choose among the large number of proposals for the application at hand? Computational reasons Frequentist properties: consistency (Hannah, Blei, and Powell (2011), Norets and Pelenis (2012), Pati, Dunson, and Tokdar (2012)), but may hide what happens in finite samples. Convergence rates? Our aim is to answer this question from a Bayesian perspective. We do so through a detailed study of predictive performance based on finite samples, identifying potential sources of improvement, developing novel models to improve predictions.

Motivating Application Motivating application: study of Alzheimer s Disease based on neuroimaging data. Definite diagnosis isn t known until autopsy, and current diagnosis is based on clinical information. Aim: show that neuroimages can be used to improve diagnostic accuracy and may be better outcomes measures for clinical trials particularly in early stages of the disease. Prior knowledge, need for dimension reduction, non-homogenity BNP approach.

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

EDP (with S. Mongelluzzo and S. Petrone) In many applications, the DP is a prior for a multivariate random probability measure. DP has many desirable properties (easy elicitation of parameters: α, P 0, conjugacy, large support) However, when considering set of probability measures on R k, a DP prior can be quite restrictive in the sense that the variability is determined by a single parameter, α. Enriched conjugate priors (Consonni and Veronese (2001)) have been proposed to address an analogous lack of flexibility of standard conjugate priors in a parametric setting. Solution: Construct a enriched DP that is more flexible, yet still conjugate by extending the idea of enriched conjugate priors to a nonparametric setting.

EDP: Definition Sample space: Θ Ψ, where Θ, Ψ are complete separable metric spaces. Enriched Dirichlet Process: P 0 is a probability measure on Θ Ψ, α θ > 0, and α ψ (θ) > 0 θ Θ. Assume: 1. P θ DP(α θ P 0 θ ). 2. θ θ, P ψ θ ( θ) DP(α ψ (θ)p 0 ψ θ ( θ)). 3. P ψ θ ( θ), θ Θ are independent among themselves. 4. P θ is independent of { P ψ θ ( θ) }. θ Θ The law of the random joint is obtained from the joint law of the marginal and conditionals through the mapping (P θ, P ψ θ ) ( ) P ψ θ( θ)dp θ (θ). Many more precision parameters, yet conjugacy and many other desirable properties maintained.

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Summary DP mixtures models are popular tools in BNP - known properties and analytically computable urn scheme. However, as p increases, x tends to dominate the posterior of the random partition, negatively affecting the prediction. Solution: replace the DP with the EDP, to allow local clustering, yet maintain an analytically computable urn scheme.

Joint DP mixture model Model (Müller et al. (1996)): f P (y, x) = N(y, x µ, Σ)dP(µ, Σ), P DP(αP 0 ), where P 0 is the conjugate normal-inverse Wishart. Problems: requires sampling the full p + 1xp + 1 covariance matrix and the inverse Wishart is poorly parametrized. Model (Shahbaba and Neal (2009)): p f P (y, x) = N(y xβ, σy x 2 ) N(x µ h, σx,h 2 )dp(θ, ψ), h=1 P DP(αP 0 θ P 0 ψ ), where θ = (β, σ 2 y x ), ψ = (µ, σ2 x), and P 0 θ, P 0 ψ are the conjugate normal-inverse gamma priors. Improvements: 1) eases computations, 2) more flexible prior for the variances, 3) flexible local correlations between Y and X, and 4) easy incorporation of other types of Y and X.

Examining the Partition P is discrete a.s. a random partition of the subjects. Notation: ρ n = (s 1,..., s n )-(partition), (θj, ψ j )-(unique parameters), xj, y j -(cluster-specific observations). Partition: p(ρ n x 1:n, y 1:n ) α k k k Γ(n j ) j=1 p j=1 h=1 g x,h (x j,h ) g y (y j x j ), where, for ex. g y (y j x j ) = i:s i =j K(y i x i, θ)dp 0 θ (θ).

Examining the Prediction How does this affect prediction? E[Y n+1, y 1:n, x 1:n+1] = P n w k+1 (x n+1)x n+1 β 0 + w j (x n+1) = c 1 n j p h=1 w k+1 (x n+1) = c 1 αg x(x n+1). k w j (x n+1)x n+1 βj dp(ρ n, θ, φ y 1:n, x 1:n+1), j=1 N(x n+1,h µ j,h, σ 2 x,j,h) for j = 1,..., k, As p increases, given the partition, predictive estimates are an average over the large number of components. within component, predictive estimates are unreliable and variable due to small sample sizes. rigid measure of similarity in the covariate space.

An EDP mixture model for regression (with S. Petrone) Model: f P (y, x) = N(y xβ, σ 2 y x ) p h=1 N(x µ h, σx,h 2 )dp(θ, ψ), P EDP(α θ, α ψ ( ), P 0 θ P 0 ψ ),

An EDP mixture model for regression (with S. Petrone) Model: f P (y, x) = N(y xβ, σ 2 y x ) p h=1 N(x µ h, σx,h 2 )dp(θ, ψ), P EDP(α θ, α ψ ( ), P 0 θ P 0 ψ ), Defines a nested partition structure ρ n = (ρ n,y, (ρ nj+,x)), where k represents the number of y-clusters of sizes n j+, and k j represents the number of x-clusters within the j th y-cluster of sizes n j,l. Additional notation: (θj, ψ j,l )-(unique parameters), (yj, x j,l )-(cluster specific observations).

Examining the Partition Under the assumption that α ψ (θ) = α ψ for all θ Θ, k k j k p(ρ n x 1:n, y 1:n) αθ k α k j Γ(α ψ )Γ(n j+ ) k j p ψ Γ(n j,l ) g y (yj xj ) g x(xj,l,h). Γ(α ψ + n j+ ) j=1 l=1 j=1 l=1 h=1 likelihood of y determines if a coarser y-partition is present smaller k and larger n 1+,..., n k+ with high posterior probability.

Examining the Prediction Prediction: E[Y n+1, y 1:n, x 1:n+1] = P n w k+1 (x n+1)x n+1 β 0 + α x k w j (x n+1)x n+1 βj dp(ρ n, θ, φ y 1:n, x 1:n+1), j=1 w j (x n+1) = c 1 1 n j+ ( g x(x n+1) + α x + n j+ w k+1 (x n+1) = c 1 1 α y g x(x n+1). k j l=1 n j,l α x + n j+ p h=1 N(x n+1,h µ l j,h, σ 2 x,l j,h)), given the partition, predictive estimates are an average over a smaller number of components. within component, predictive estimates are based on larger sample sizes. flexible measure of similarity in the covariate space.

Application to AD data Aim: Diagnosis of AD. Data: Y-binary indicator of CN (1) or AD (0); X-summaries of structural Magnetic Resonance Images (smri) (p = 15). Model: ind Y i x i, β i Bern(Φ((x i β i )/σ y,i )), p X i µ i, σx,i 2 ind N(µ i,h, σx,i,h 2 ), h=1 β i, µ i, σ 2 x,i P i.i.d. P, P Q. Q= DP or EDP with the same base measure and gamma hyperpriors on the precision parameters.

Results for n = 185 Posterior mode of k: 16 (DP) vs. 3 (EDP). 1200 1400 1600 1800 700 800 900 1000 1100 1200 1300 ICV BV 2.0 2.5 3.0 3.5 4.0 4.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 LHV RHV (a) DP 1200 1400 1600 1800 700 800 900 1000 1100 1200 1300 ICV BV 2.0 2.5 3.0 3.5 4.0 4.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 LHV RHV (b) EDP

Results for n = 185 Posterior mode of k: 16 (DP) vs. 3 (EDP). 1200 1400 1600 1800 700 800 900 1000 1100 1200 1300 ICV BV 2.0 2.5 3.0 3.5 4.0 4.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 LHV RHV (c) DP 1200 1400 1600 1800 700 800 900 1000 1100 1200 1300 ICV BV 2.0 2.5 3.0 3.5 4.0 4.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 LHV RHV (d) EDP Prediction for m = 192 subjects: Number of subjects correctly classified: 159 (DP) vs. 168 (EDP). Tighter credible intervals.

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Introduction Aim: Analyse the role of the partition in prediction for BNP mixtures and highlight an understated problem: huge dimension of the partition space.

Introduction Aim: Analyse the role of the partition in prediction for BNP mixtures and highlight an understated problem: huge dimension of the partition space. Assume both Y and X are univariate and continuous. Number of ways to partition the n subjects: S n,k = 1 k! B n = n S n,k, k=1 k ( ( 1) j k j j=0 ) (k j) n. Many of these partitions similar large number of partitions will fit the data.

Introducing Covariate Information Covariates typically provide information on the partition structure. Incorporating this covariate information (as for models with w j (x)) will improve posterior inference on partition. But posterior is still diluted this information needs to be strictly enforced. If aim is estimation of regression function, desirable partitions satisfy an ordering constraint of (x i ) Reduces the total number of partitions to 2 n 1. Note: if n=10, B 10 = 115, 975 and 2 10 1 = 512, 0.44% of the total partitions, and if n = 100, 2 100 1 /B 100 < 10 85.

Restricted DPM (with S.G. Walker and S. Petrone) Let x (1),..., x (n) denote the ordered values of x, and s (1),..., s (n) denote the corresponding values of s 1,..., s n. Prior for the random partition: p(ρ n x) = Γ(α)Γ(n + 1) α k Γ(α + n) k! k j=1 1 n j I s(1)... s (n). Introduces covariate dependency, greatly reduces the total number of partitions, improves prediction, and speeds up computations.

Simulated Example Data are simulated from: { xi /8 + 5 if x y i x i = i 6 2x i 12 if x i > 6, x i = (0, 0.25, 0.5,..., 8.75, 9).

Simulated Ex.: 3 partitions with highest posterior mass 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.3973 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0695 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0381 (e) DPM 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.5317 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0493 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0418 (f) jdpm 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.9031 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0302 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0129 (g) rdpm No. of partitions with positive post. prob.: 1,263, 918, and 43, respectively.

Simulated Example: Prediction 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 (h) DPM (i) joint DPM (j) restricted DPM Figure: Prediction for x = (3.3, 5.9, 6.2, 6.3, 7.9, 8.1, 10). The empirical L 2 prediction error, m (1/m (ŷ m,j,est ŷ m,j,true ) 2 ) 1/2, j=1 for the three models, respectively, is 2.19, 1.66, and 1.09.

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Introduction BNP mixture models with covariate dependent weights: f Px (y x) = K(y x, θ)dp x (θ), P x = w j (x)δ θj. Construction of covariate dependent weights (in literature): w j (x) = v j (x) v j (x)), j <j(1 j=1 where v j ( ) are independent stochastic processes. Drawbacks: non-interpretable weight function, typically exhibiting peculiarities. choice for functional shapes and hyper-parameters. extensions for continuous and discrete covariates are not straightforward.

Normalized weights (with S.G. Walker and I. Antoniano) Solution: Since w j (x) represents the weight assigned to regression model j at covariate x, an interpretable construction is obtained through Bayes Theorem by combining P(A j) = K(x ψ j )dx, the probability that an observation from regression model j has a covariate value in the region A X. P(j) = w j, the prior probability that an observation comes from model j. w w j (x) = j K(x ψ j ) j =1 w j K(x ψ ). j Inclusion of continuous and discrete covariates is straightforward via appropriate definition of K(x ψ).

Normalized weights: Computations Likelihood contains an intractable normalizing constant, where w j (x) = f P (y x) = w j (x)k(y x, θ j ), j=1 w j K(x ψ j ) j =1 w j K(x ψ j ).

Normalized weights: Computations Likelihood contains an intractable normalizing constant, where w j (x) = f P (y x) = w j (x)k(y x, θ j ), j=1 w j K(x ψ j ) j =1 w j K(x ψ j ). However, using results for the geometric series, for r < 1, k=0 the likelihood can be written as j=1 r k = 1 1 r, f P (y x) = w j K(x ψ j )K(y x, θ j ) (1 w j K(x ψ j )) k. k=0 j =1

Normalized weights: Computations The likelihood can be re-written as f P (y x) = w j K(x ψ j )K(y x, θ j ) ( w j (1 K(x ψ j ))) k. j=1 k=0 j =1 k = w j K(x ψ j )K(y x, θ j ) ( w jl (1 K(x ψ jl ))). j=1 k=0 l=1 j l =1 Introducing latent variables k, d, D, latent model is k f P (y, k, d, D x) = w d K(x ψ d )K(y x, θ d ) w Dl (1 K(x ψ Dl )). l=1

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Discussion BNP mixture models for regression are highly flexible and numerous. This thesis aimed to shed light on advantages and drawbacks of the various models.

Discussion BNP mixture models for regression are highly flexible and numerous. This thesis aimed to shed light on advantages and drawbacks of the various models. 1. Joint models: + easy computations, + flexible and easy to extend to other types of Y and X, posterior inference is based on the joint, problematic for large p. Can overcome with EDP prior (maintains + +).

Discussion 2. Conditional models with covariate dependent atoms: + moderate computations (increases with flexibility in θ j (x)), + posterior inference is based on the conditional, appropriate definition of θ j (x) for mixed x can be challenging, choice of θ j (x) has strong implications, inflexible: extremely poor prediction, over flexible: prediction may also suffer. 3. Conditional models with covariate dependent weights: + posterior inference is based on the conditional, + flexible and incorporate covariate proximity in partition, difficult computations, lack of interpretable weights that can accommodate mixed x. Can overcome with normalized weights.

Future Work Examining models for large p problems. Formalizing model comparison of various models, Consistency, Convergence rates, Finite sample bounds. AD study: Diagnosis based on smri and other imaging techniques. Investigate dynamics of AD biomarkers (jointly + incorporating longitudinal data).

References 1. Consonni, G. and Veronese, P. (2001). Conditionally reducible natural exponential families and enriched conjugate priors. Scand. J. Statist., 28, 377 406. 2. Ramamoorthi, R.V. and Sangalli, L. (2005). On a Characterization of Dirichlet Distribution. Proceedings of the International Conference on Bayesian Statistics and its Applications, Varanasi, India. 3. Hannah, L. and Blei, D. and Powell W. (2011). Dirichlet Process mixtures of Generalized Linear Models. Journal of Machine Learning Research,12,1923 1953. 4. Müller, P. Erkanli, A. and West, M. (1996). Bayesian curve fitting using multivariate normal mixtures. Biometrika, 83, 67-79. 5. Shahbaba, B. and Neal, R.M. (2009). Nonlinear models using Dirichlet Process Mixtures. Journal of Machine Learning Research, 10, 1829 1850.

References 6. MacEachern, S.N. (2000). Dependent Nonparametric Processes. Technical Report, Department of Statistics, Ohio State University. 7. Griffin, J.E. and Steele, M. (2006). Order-Based Dependent Dirichlet Processes. JASA, 10, 179 194. 8. Dunson, D.B. and Park, J.H. (2008). Kernel Stick-Breaking Process. Biometrika,95,307 323. 9. Rodriguez, A. and Dunson, D.B. (2011). Nonparametric Bayesian Models through Probit Stick-Breaking Processes. Bayesian Analysis, 6 145 178. 10. Fuentes-Garcia, R., Mena, R.H. and Walker, S.G. (2010). A probability for classification based on the mixture of Dirichlet process model. Journal of Classification, 27, 389 403.