Bayesian Nonparametric Regression through Mixture Models

Similar documents
Improving prediction from Dirichlet process mixtures via enrichment

STAT Advanced Bayesian Inference

Dirichlet Process Mixtures of Generalized Linear Models

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Nonparametric Bayes regression and classification through mixtures of product kernels

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian modeling for dynamic ordinal regression relationships

Foundations of Nonparametric Bayesian Methods

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Hybrid Dirichlet processes for functional data

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Latent factor density regression models

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Non-Parametric Bayes

Gibbs Sampling in Endogenous Variables Models

A comparative review of variable selection techniques for covariate dependent Dirichlet process mixture models

Bayesian non-parametric model to longitudinally predict churn

Bayesian Nonparametric Autoregressive Models via Latent Variable Representation

Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother

Dirichlet Process Mixtures of Generalized Linear Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Nonparametric Bayes tensor factorizations for big data

On the Support of MacEachern s Dependent Dirichlet Processes and Extensions

Normalized kernel-weighted random measures

Bayes methods for categorical data. April 25, 2017

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Chapter 2. Data Analysis

Dirichlet Process Mixtures of Generalized Linear Models

A Process over all Stationary Covariance Kernels

Bayesian Nonparametric Modeling for Multivariate Ordinal Regression

Dirichlet Process Mixtures of Generalized Linear Models

CPSC 540: Machine Learning

Bayesian Nonparametrics: Dirichlet Process

Nonparametric Bayes Uncertainty Quantification

PMR Learning as Inference

Bayesian Methods for Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Lecture 16: Mixtures of Generalized Linear Models

Bayesian Nonparametrics

A Nonparametric Bayesian Model for Multivariate Ordinal Data

A Nonparametric Model for Stationary Time Series

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Bayesian Machine Learning

Modeling and Predicting Healthcare Claims

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Density Estimation. Seungjin Choi

Bayesian Nonparametric Inference Methods for Mean Residual Life Functions

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Clustering using Mixture Models

STA 4273H: Statistical Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

STAT 518 Intro Student Presentation

POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES. By Andriy Norets and Justinas Pelenis

Nonparametric Bayes Density Estimation and Regression with High Dimensional Data

Lecture 2: Priors and Conjugacy

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

STA 216, GLM, Lecture 16. October 29, 2007

CMPS 242: Project Report

Bayesian estimation of the discrepancy with misspecified parametric models

Learning Bayesian network : Given structure and completely observed data

Nonparametric Bayes Inference on Manifolds with Applications

Curve Fitting Re-visited, Bishop1.2.5

STA 4273H: Statistical Machine Learning

Efficient Bayesian Multivariate Surface Regression

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Part 6: Multivariate Normal and Linear Models

On the posterior structure of NRMI

NONPARAMETRIC BAYESIAN INFERENCE ON PLANAR SHAPES

A Bayesian Nonparametric Approach to Inference for Quantile Regression

A Bayesian Nonparametric Hierarchical Framework for Uncertainty Quantification in Simulation

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Gaussian Processes (10/16/13)

Particle Learning for General Mixtures

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Bayesian Inference. Chapter 9. Linear models and regression

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles

Dirichlet Processes: Tutorial and Practical Course

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Gibbs Sampling in Linear Models #2

Bayesian linear regression

Bayesian Multivariate Logistic Regression

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Non-parametric Clustering with Dirichlet Processes

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Or How to select variables Using Bayesian LASSO

A Nonparametric Model-based Approach to Inference for Quantile Regression

Methods for the Comparability of Scores

Bayesian spatial quantile regression

Compound Random Measures

Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother

Lecture 3a: Dirichlet processes

Bayesian Hypothesis Testing in GLMs: One-Sided and Ordered Alternatives. 1(w i = h + 1)β h + ɛ i,

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

Sparse Linear Models (10/7/13)

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Scaling up Bayesian Inference

Bayesian Regularization

Machine Learning Overview

Transcription:

Bayesian Nonparametric Regression through Mixture Models Sara Wade Bocconi University Advisor: Sonia Petrone October 7, 2013

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Introduction Aim: flexible regression, where both the mean and error distribution may evolve flexibly with x.

Introduction Aim: flexible regression, where both the mean and error distribution may evolve flexibly with x. Many approaches for flexible regression (ex. splines, wavelets, Gaussian processes), but typically assume iid Gaussian errors, Y i = m(x i ) + ɛ i, ɛ i σ 2 iid N(0, σ 2 ).

Introduction Aim: flexible regression, where both the mean and error distribution may evolve flexibly with x. Many approaches for flexible regression (ex. splines, wavelets, Gaussian processes), but typically assume iid Gaussian errors, Y i = m(x i ) + ɛ i, ɛ i σ 2 iid N(0, σ 2 ). For i.i.d. data, mixtures are a useful tool for flexible density estimation.

Introduction Aim: flexible regression, where both the mean and error distribution may evolve flexibly with x. Many approaches for flexible regression (ex. splines, wavelets, Gaussian processes), but typically assume iid Gaussian errors, Y i = m(x i ) + ɛ i, ɛ i σ 2 iid N(0, σ 2 ). For i.i.d. data, mixtures are a useful tool for flexible density estimation. Extend mixture models for conditional density estimation.

Mixture Models Given f, assume Y i are iid with density f p (y) = K(y θ)dp(θ), for some parametric density K(y θ).

Mixture Models Given f, assume Y i are iid with density f p (y) = K(y θ)dp(θ), for some parametric density K(y θ). Bayesian setting, define a prior on P, where typically P is discrete a.s., P = J J w j δ θj, f P (y) = w j K(y θ j ). j=1 j=1

Mixture Models Given f, assume Y i are iid with density f p (y) = K(y θ)dp(θ), for some parametric density K(y θ). Bayesian setting, define a prior on P, where typically P is discrete a.s., P = J J w j δ θj, f P (y) = w j K(y θ j ). j=1 j=1 How many components? Fixed J: overfitted mixtures (Rousseau and Mengersen (2011)) or post-processing techniques. Random J: reversible jump MCMC. J = : Dirichlet process (DP) mixtures and developments in BNP.

Extending Mixture Models Two main lines of research: Joint approach: Augment the observations to include X and assume that given f P, the (Y i, X i ) are iid with density f P (y, x) = K(y, x θ)dp(θ) = w j K(y, x θ j ). j=1 Conditional density estimates are obtained as a by-product through j=1 f P (y x) = w jk(y, x θ j ) j =1 w j K(x θ j ). Conditional approach: Allow the mixing measure to depend on x and assign a prior so that {P x } x X are dependent. Assume that given f Px and x, the Y i are independent with density f Px (y x) = K(y θ)dp x (θ) = w j (x)k(y θ j (x)). j=1

Mixture Models for Regression Many proposals in literature fall into these two general categories: 1. Joint Approach: Müller, Erkanli, and West (1996), Shahbaba and Neal (2009), Hannah, Blei, and Powell (2011), Park and Dunson (2010), Müller and Quintana (2010), Bhattacharya, Page, and Dunson (2012), Quintana (2011)... 2. Conditional Approach: a. Covariate-dependent atoms: MacEachern (1999; 2000; 2001), DeIorio et al. (2004), Gelfand, Kottas, and MacEachern (2005), Rodriguez and Horst (2008), De La Cruz, Quintana, and Müller (2007), De Iorio et al. (2009), Jara et al. (2010)... Example: f Px (y x) = j=1 w jn(y µ j (x), σj 2). b. Covariate-dependent weights: Griffin and Steele (2006; 2010), Dunson and Park (2008), Ren et al. (2011), Chung and Dunson (2009), Rodriguez and Dunson (2011)... (also models based on covariate-dependent urn schemes or partitions). Example: f Px (y x) = j=1 w j(x)n(y xβ j, σj 2).

Mixture Models for regression How to choose among the large number of proposals for the application at hand?

Mixture Models for regression How to choose among the large number of proposals for the application at hand? Computational reasons Frequentist properties: consistency (Hannah, Blei, and Powell (2011), Norets and Pelenis (2012), Pati, Dunson, and Tokdar (2012)), but may hide what happens in finite samples. Convergence rates?

Mixture Models for regression How to choose among the large number of proposals for the application at hand? Computational reasons Frequentist properties: consistency (Hannah, Blei, and Powell (2011), Norets and Pelenis (2012), Pati, Dunson, and Tokdar (2012)), but may hide what happens in finite samples. Convergence rates? Our aim is to answer this question from a Bayesian perspective. We do so through a detailed study of predictive performance based on finite samples, identifying potential sources of improvement, developing novel models to improve predictions.

Motivating Application Motivating application: study of Alzheimer s Disease based on neuroimaging data. Definite diagnosis isn t known until autopsy, and current diagnosis is based on clinical information. Aim: show that neuroimages can be used to improve diagnostic accuracy and may be better outcomes measures for clinical trials particularly in early stages of the disease. Prior knowledge, need for dimension reduction, non-homogenity BNP approach.

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

EDP (with S. Mongelluzzo and S. Petrone) In many applications, the DP is a prior for a multivariate random probability measure. DP has many desirable properties (easy elicitation of parameters: α, P 0, conjugacy, large support) However, when considering set of probability measures on R k, a DP prior can be quite restrictive in the sense that the variability is determined by a single parameter, α. Enriched conjugate priors (Consonni and Veronese (2001)) have been proposed to address an analogous lack of flexibility of standard conjugate priors in a parametric setting. Solution: Construct a enriched DP that is more flexible, yet still conjugate by extending the idea of enriched conjugate priors to a nonparametric setting.

EDP: Definition Sample space: Θ Ψ, where Θ, Ψ are complete separable metric spaces. Enriched Dirichlet Process: P 0 is a probability measure on Θ Ψ, α θ > 0, and α ψ (θ) > 0 θ Θ. Assume: 1. P θ DP(α θ P 0 θ ). 2. θ θ, P ψ θ ( θ) DP(α ψ (θ)p 0 ψ θ ( θ)). 3. P ψ θ ( θ), θ Θ are independent among themselves. 4. P θ is independent of { P ψ θ ( θ) }. θ Θ The law of the random joint is obtained from the joint law of the marginal and conditionals through the mapping (P θ, P ψ θ ) ( ) P ψ θ( θ)dp θ (θ). Many more precision parameters, yet conjugacy and many other desirable properties maintained.

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Summary DP mixtures models are popular tools in BNP - known properties and analytically computable urn scheme. However, as p increases, x tends to dominate the posterior of the random partition, negatively affecting the prediction. Solution: replace the DP with the EDP, to allow local clustering, yet maintain an analytically computable urn scheme.

Joint DP mixture model Model (Müller et al. (1996)): f P (y, x) = N(y, x µ, Σ)dP(µ, Σ), P DP(αP 0 ), where P 0 is the conjugate normal-inverse Wishart. Problems: requires sampling the full p + 1xp + 1 covariance matrix and the inverse Wishart is poorly parametrized.

Joint DP mixture model Model (Müller et al. (1996)): f P (y, x) = N(y, x µ, Σ)dP(µ, Σ), P DP(αP 0 ), where P 0 is the conjugate normal-inverse Wishart. Problems: requires sampling the full p + 1xp + 1 covariance matrix and the inverse Wishart is poorly parametrized. Model (Shahbaba and Neal (2009)): p f P (y, x) = N(y xβ, σy x 2 ) N(x µ h, σx,h 2 )dp(θ, ψ), h=1 P DP(αP 0 θ P 0 ψ ), where θ = (β, σ 2 y x ), ψ = (µ, σ2 x), and P 0 θ, P 0 ψ are the conjugate normal-inverse gamma priors. Improvements: 1) eases computations, 2) more flexible prior for the variances, 3) flexible local correlations between Y and X, and 4) easy incorporation of other types of Y and X.

Examining the Partition P is discrete a.s. a random partition of the subjects. Notation: ρ n = (s 1,..., s n )-(partition), (θj, ψ j )-(unique parameters), xj, y j -(cluster-specific observations). Partition: p(ρ n x 1:n, y 1:n ) α k k k Γ(n j ) j=1 p j=1 h=1 g x,h (x j,h ) g y (y j x j ), where, for ex. g y (y j x j ) = i:s i =j K(y i x i, θ)dp 0 θ (θ).

Examining the Partition P is discrete a.s. a random partition of the subjects. Notation: ρ n = (s 1,..., s n )-(partition), (θj, ψ j )-(unique parameters), xj, y j -(cluster-specific observations). Partition: p(ρ n x 1:n, y 1:n ) α k k k Γ(n j ) j=1 p j=1 h=1 g x,h (x j,h ) g y (y j x j ), where, for ex. g y (y j x j ) = i:s i =j K(y i x i, θ)dp 0 θ (θ). As p increases, x dominates the partition structure. x exhibits increasing departures. large k and small n 1,..., n k with high posterior probability.

Examining the Prediction How does this affect prediction? E[Y n+1, y 1:n, x 1:n+1] = P n w k+1 (x n+1)x n+1 β 0 + w j (x n+1) = c 1 n j p h=1 w k+1 (x n+1) = c 1 αg x(x n+1). k w j (x n+1)x n+1 βj dp(ρ n, θ, φ y 1:n, x 1:n+1), j=1 N(x n+1,h µ j,h, σ 2 x,j,h) for j = 1,..., k, As p increases, given the partition, predictive estimates are an average over the large number of components. within component, predictive estimates are unreliable and variable due to small sample sizes. rigid measure of similarity in the covariate space.

An EDP mixture model for regression (with S. Petrone) Model: f P (y, x) = N(y xβ, σ 2 y x ) p h=1 N(x µ h, σx,h 2 )dp(θ, ψ), P EDP(α θ, α ψ ( ), P 0 θ P 0 ψ ),

An EDP mixture model for regression (with S. Petrone) Model: f P (y, x) = N(y xβ, σ 2 y x ) p h=1 N(x µ h, σx,h 2 )dp(θ, ψ), P EDP(α θ, α ψ ( ), P 0 θ P 0 ψ ), Defines a nested partition structure ρ n = (ρ n,y, (ρ nj+,x)), where k represents the number of y-clusters of sizes n j+, and k j represents the number of x-clusters within the j th y-cluster of sizes n j,l. Additional notation: (θj, ψ j,l )-(unique parameters), (yj, x j,l )-(cluster specific observations).

Examining the Partition Under the assumption that α ψ (θ) = α ψ for all θ Θ, k k j k p(ρ n x 1:n, y 1:n) αθ k α k j Γ(α ψ )Γ(n j+ ) k j p ψ Γ(n j,l ) g y (yj xj ) g x(xj,l,h). Γ(α ψ + n j+ ) j=1 l=1 j=1 l=1 h=1 likelihood of y determines if a coarser y-partition is present smaller k and larger n 1+,..., n k+ with high posterior probability.

Examining the Prediction Prediction: E[Y n+1, y 1:n, x 1:n+1] = P n w k+1 (x n+1)x n+1 β 0 + α x k w j (x n+1)x n+1 βj dp(ρ n, θ, φ y 1:n, x 1:n+1), j=1 w j (x n+1) = c 1 1 n j+ ( g x(x n+1) + α x + n j+ w k+1 (x n+1) = c 1 1 α y g x(x n+1). k j l=1 n j,l α x + n j+ p h=1 N(x n+1,h µ l j,h, σ 2 x,l j,h)), given the partition, predictive estimates are an average over a smaller number of components. within component, predictive estimates are based on larger sample sizes. flexible measure of similarity in the covariate space.

Examining the Prediction Prediction: E[Y n+1, y 1:n, x 1:n+1] = P n w k+1 (x n+1)x n+1 β 0 + α x k w j (x n+1)x n+1 βj dp(ρ n, θ, φ y 1:n, x 1:n+1), j=1 w j (x n+1) = c 1 1 n j+ ( g x(x n+1) + α x + n j+ w k+1 (x n+1) = c 1 1 α y g x(x n+1). k j l=1 n j,l α x + n j+ p h=1 N(x n+1,h µ l j,h, σ 2 x,l j,h)), given the partition, predictive estimates are an average over a smaller number of components. within component, predictive estimates are based on larger sample sizes. flexible measure of similarity in the covariate space. Summary: More reliable and less variable estimates, yet computations remain relatively simple due to analytically computable urn scheme.

Application to AD data Aim: Diagnosis of AD. Data: Y-binary indicator of CN (1) or AD (0); X-summaries of structural Magnetic Resonance Images (smri) (p = 15). Model: ind Y i x i, β i Bern(Φ((x i β i )/σ y,i )), p X i µ i, σx,i 2 ind N(µ i,h, σx,i,h 2 ), h=1 β i, µ i, σ 2 x,i P i.i.d. P, P Q. Q= DP or EDP with the same base measure and gamma hyperpriors on the precision parameters.

Results for n = 185 Posterior mode of k: 16 (DP) vs. 3 (EDP). 1200 1400 1600 1800 700 800 900 1000 1100 1200 1300 ICV BV 2.0 2.5 3.0 3.5 4.0 4.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 LHV RHV (a) DP 1200 1400 1600 1800 700 800 900 1000 1100 1200 1300 ICV BV 2.0 2.5 3.0 3.5 4.0 4.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 LHV RHV (b) EDP

Results for n = 185 Posterior mode of k: 16 (DP) vs. 3 (EDP). 1200 1400 1600 1800 700 800 900 1000 1100 1200 1300 ICV BV 2.0 2.5 3.0 3.5 4.0 4.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 LHV RHV (c) DP 1200 1400 1600 1800 700 800 900 1000 1100 1200 1300 ICV BV 2.0 2.5 3.0 3.5 4.0 4.5 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 LHV RHV (d) EDP Prediction for m = 192 subjects: Number of subjects correctly classified: 159 (DP) vs. 168 (EDP). Tighter credible intervals.

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Introduction Aim: Analyse the role of the partition in prediction for BNP mixtures and highlight an understated problem: huge dimension of the partition space.

Introduction Aim: Analyse the role of the partition in prediction for BNP mixtures and highlight an understated problem: huge dimension of the partition space. Assume both Y and X are univariate and continuous.

Introduction Aim: Analyse the role of the partition in prediction for BNP mixtures and highlight an understated problem: huge dimension of the partition space. Assume both Y and X are univariate and continuous. Number of ways to partition the n subjects: S n,k = 1 k! B n = n S n,k, k=1 k ( ( 1) j k j j=0 ) (k j) n. Many of these partitions similar large number of partitions will fit the data.

Introducing Covariate Information Covariates typically provide information on the partition structure. Incorporating this covariate information (as for models with w j (x)) will improve posterior inference on partition.

Introducing Covariate Information Covariates typically provide information on the partition structure. Incorporating this covariate information (as for models with w j (x)) will improve posterior inference on partition. But posterior is still diluted this information needs to be strictly enforced.

Introducing Covariate Information Covariates typically provide information on the partition structure. Incorporating this covariate information (as for models with w j (x)) will improve posterior inference on partition. But posterior is still diluted this information needs to be strictly enforced. If aim is estimation of regression function, desirable partitions satisfy an ordering constraint of (x i ) Reduces the total number of partitions to 2 n 1. Note: if n=10, B 10 = 115, 975 and 2 10 1 = 512, 0.44% of the total partitions, and if n = 100, 2 100 1 /B 100 < 10 85.

Restricted DPM (with S.G. Walker and S. Petrone) Let x (1),..., x (n) denote the ordered values of x, and s (1),..., s (n) denote the corresponding values of s 1,..., s n. Prior for the random partition: p(ρ n x) = Γ(α)Γ(n + 1) α k Γ(α + n) k! k j=1 1 n j I s(1)... s (n). Introduces covariate dependency, greatly reduces the total number of partitions, improves prediction, and speeds up computations.

Simulated Example Data are simulated from: { xi /8 + 5 if x y i x i = i 6 2x i 12 if x i > 6, x i = (0, 0.25, 0.5,..., 8.75, 9).

Simulated Ex.: 3 partitions with highest posterior mass 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.3973 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0695 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0381 (e) DPM 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.5317 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0493 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0418 (f) jdpm 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.9031 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0302 0 2 4 6 8 1 2 3 4 5 6 p( rho y,x)= 0.0129 (g) rdpm No. of partitions with positive post. prob.: 1,263, 918, and 43, respectively.

Simulated Example: Prediction 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 (h) DPM (i) joint DPM (j) restricted DPM Figure: Prediction for x = (3.3, 5.9, 6.2, 6.3, 7.9, 8.1, 10). The empirical L 2 prediction error, m (1/m (ŷ m,j,est ŷ m,j,true ) 2 ) 1/2, j=1 for the three models, respectively, is 2.19, 1.66, and 1.09.

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Introduction BNP mixture models with covariate dependent weights: f Px (y x) = K(y x, θ)dp x (θ), P x = w j (x)δ θj. Construction of covariate dependent weights (in literature): w j (x) = v j (x) v j (x)), j <j(1 j=1 where v j ( ) are independent stochastic processes.

Introduction BNP mixture models with covariate dependent weights: f Px (y x) = K(y x, θ)dp x (θ), P x = w j (x)δ θj. Construction of covariate dependent weights (in literature): w j (x) = v j (x) v j (x)), j <j(1 j=1 where v j ( ) are independent stochastic processes. Drawbacks: non-interpretable weight function, typically exhibiting peculiarities. choice for functional shapes and hyper-parameters. extensions for continuous and discrete covariates are not straightforward.

Introduction BNP mixture models with covariate dependent weights: f Px (y x) = K(y x, θ)dp x (θ), P x = w j (x)δ θj. Construction of covariate dependent weights (in literature): w j (x) = v j (x) v j (x)), j <j(1 j=1 where v j ( ) are independent stochastic processes. Drawbacks: non-interpretable weight function, typically exhibiting peculiarities. choice for functional shapes and hyper-parameters. extensions for continuous and discrete covariates are not straightforward. Aim: construct interpretable weights that can handle continuous and discrete covariates, with feasible posterior inference.

Normalized weights (with S.G. Walker and I. Antoniano) Solution: Since w j (x) represents the weight assigned to regression model j at covariate x, an interpretable construction is obtained through Bayes Theorem by combining P(A j) = K(x ψ j )dx, the probability that an observation from regression model j has a covariate value in the region A X. P(j) = w j, the prior probability that an observation comes from model j. w w j (x) = j K(x ψ j ) j =1 w j K(x ψ ). j Inclusion of continuous and discrete covariates is straightforward via appropriate definition of K(x ψ).

Normalized weights: Computations Likelihood contains an intractable normalizing constant, where w j (x) = f P (y x) = w j (x)k(y x, θ j ), j=1 w j K(x ψ j ) j =1 w j K(x ψ j ).

Normalized weights: Computations Likelihood contains an intractable normalizing constant, where w j (x) = f P (y x) = w j (x)k(y x, θ j ), j=1 w j K(x ψ j ) j =1 w j K(x ψ j ). However, using results for the geometric series, for r < 1, k=0 the likelihood can be written as j=1 r k = 1 1 r, f P (y x) = w j K(x ψ j )K(y x, θ j ) (1 w j K(x ψ j )) k. k=0 j =1

Normalized weights: Computations The likelihood can be re-written as f P (y x) = w j K(x ψ j )K(y x, θ j ) ( w j (1 K(x ψ j ))) k. j=1 k=0 j =1 k = w j K(x ψ j )K(y x, θ j ) ( w jl (1 K(x ψ jl ))). j=1 k=0 l=1 j l =1 Introducing latent variables k, d, D, latent model is k f P (y, k, d, D x) = w d K(x ψ d )K(y x, θ d ) w Dl (1 K(x ψ Dl )). l=1

Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression 4 Restricted DP Mixtures for Regression 5 Normalized Covariate Dependent Weights 6 Discussion

Discussion BNP mixture models for regression are highly flexible and numerous. This thesis aimed to shed light on advantages and drawbacks of the various models.

Discussion BNP mixture models for regression are highly flexible and numerous. This thesis aimed to shed light on advantages and drawbacks of the various models. 1. Joint models: + easy computations, + flexible and easy to extend to other types of Y and X, posterior inference is based on the joint, problematic for large p. Can overcome with EDP prior (maintains + +).

Discussion 2. Conditional models with covariate dependent atoms: + moderate computations (increases with flexibility in θ j (x)), + posterior inference is based on the conditional, appropriate definition of θ j (x) for mixed x can be challenging, choice of θ j (x) has strong implications, inflexible: extremely poor prediction, over flexible: prediction may also suffer.

Discussion 2. Conditional models with covariate dependent atoms: + moderate computations (increases with flexibility in θ j (x)), + posterior inference is based on the conditional, appropriate definition of θ j (x) for mixed x can be challenging, choice of θ j (x) has strong implications, inflexible: extremely poor prediction, over flexible: prediction may also suffer. 3. Conditional models with covariate dependent weights: + posterior inference is based on the conditional, + flexible and incorporate covariate proximity in partition, difficult computations, lack of interpretable weights that can accommodate mixed x. Can overcome with normalized weights.

Discussion 2. Conditional models with covariate dependent atoms: + moderate computations (increases with flexibility in θ j (x)), + posterior inference is based on the conditional, appropriate definition of θ j (x) for mixed x can be challenging, choice of θ j (x) has strong implications, inflexible: extremely poor prediction, over flexible: prediction may also suffer. 3. Conditional models with covariate dependent weights: + posterior inference is based on the conditional, + flexible and incorporate covariate proximity in partition, difficult computations, lack of interpretable weights that can accommodate mixed x. Can overcome with normalized weights. - Huge dimension of partition space: posterior of ρ n is very spread out. Can overcome by enforcing notion of covariate proximity.

Future Work Examining models for large p problems. Formalizing model comparison of various models, Consistency, Convergence rates, Finite sample bounds. AD study: Diagnosis based on smri and other imaging techniques. Investigate dynamics of AD biomarkers (jointly + incorporating longitudinal data).

References 1. Consonni, G. and Veronese, P. (2001). Conditionally reducible natural exponential families and enriched conjugate priors. Scand. J. Statist., 28, 377 406. 2. Ramamoorthi, R.V. and Sangalli, L. (2005). On a Characterization of Dirichlet Distribution. Proceedings of the International Conference on Bayesian Statistics and its Applications, Varanasi, India. 3. Hannah, L. and Blei, D. and Powell W. (2011). Dirichlet Process mixtures of Generalized Linear Models. Journal of Machine Learning Research,12,1923 1953. 4. Müller, P. Erkanli, A. and West, M. (1996). Bayesian curve fitting using multivariate normal mixtures. Biometrika, 83, 67-79. 5. Shahbaba, B. and Neal, R.M. (2009). Nonlinear models using Dirichlet Process Mixtures. Journal of Machine Learning Research, 10, 1829 1850.

References 6. MacEachern, S.N. (2000). Dependent Nonparametric Processes. Technical Report, Department of Statistics, Ohio State University. 7. Griffin, J.E. and Steele, M. (2006). Order-Based Dependent Dirichlet Processes. JASA, 10, 179 194. 8. Dunson, D.B. and Park, J.H. (2008). Kernel Stick-Breaking Process. Biometrika,95,307 323. 9. Rodriguez, A. and Dunson, D.B. (2011). Nonparametric Bayesian Models through Probit Stick-Breaking Processes. Bayesian Analysis, 6 145 178. 10. Fuentes-Garcia, R., Mena, R.H. and Walker, S.G. (2010). A probability for classification based on the mixture of Dirichlet process model. Journal of Classification, 27, 389 403.