Nonparametric Bayes Modeling - PDF Free Download

Nonparametric Bayes Modeling Lecture 6: Advanced Applications of DPMs David Dunson Department of Statistical Science, Duke University Tuesday February 2, 2010

Motivation Functional data analysis Variable selection & shrinkage

Hierarchical Modeling θ i = random effects specific to subject i Hierarchical models let θ i P P = random effects distribution Choice of P critical in controlling borrowing of information

Some Classical Applications Meta Analysis: combine data from multiple studies to make overall conclusion (e.g., drug is effective) Multi-level Designs: subjects are nested in schools, regions or study centers Longitudinal Data: data collected for subject over time - important to accommodate within-subject dependence

Some Emerging Applications Joint modeling of data from different domains Images and captions Diagnostic images or functional predictors & health responses Multiple types of omics data (sequence & expression) Multi-task learning: borrow strength across tasks Multiple images, music pieces, security videos Compressive sensing User preferences in different domains (film, books, etc)

Application 1 - Multinational Bioassay Increasing concern about adverse effects of environmental estrogens on human development Rodent uterotrophic bioassay: system for identifying suspected agonists or antagonists of estrogen. OECD study: collected data from 19 laboratories to investigate consistency of effects of known agonist (EE) & antagonist (ZM) y ij = uterus weight for rat j in lab i x ij =protocol type, dose of EE, dose of ZM

Summary of 19 participating laboratories Notation Lab Name Country Protocols Conducted F1 Citifrance France A F2 Poulenc France A K1 ChungKorea Korea A, B K2 KoreaPark Korea B, C G1 Berlin Germany A G2 Basf Germany A G3 Bayer Germany A J1 Citijapan Japan A, B, C J2 Hatano Japan A, B, C, D J3 InEnvTox Japan A, B, C, D J4 Mitsubishi Japan A, B, C, D J5 Nihon Japan A, B, C, D J6 Sumitomo Japan A, B, C N1 Denmark Netherlands B N2 TNO Netherlands A, B B1 Huntingdon UK C B2 Zeneca UK A, B, C U1 Exxon USA A U2 WIL USA A, B

Some Comments Can potentially fit normal random effects model, y ij = x ijθ i + ɛ ij, ɛ ij N(0, σ 2 ), θ i N p (θ, Σ) Normal distribution has light tails & does not allow outlying labs or clusters of labs Conclusions may be sensitive to violations of normality Appealing to have a more flexible approach available

Application 2 - Dependent Functions Interest in estimating a collection of functions, {f i } n i=1 Longitudinal trajectories for different individuals Function relating features to probability yij = 1 for task i We will focus on the following model: y ij = f i (t ij ) + ɛ ij, ɛ ij t ν (σ 2 ) p f i (t) = θ ij b j (t) = b(t) θ i j=1 θ i P b = {b j }=basis functions, θ i =basis coefficients

log PdG trajectories (Bigelow & Dunson, JASA, 08)

Comments on Functional Data Model Subject-specific basis coefficients, θ i, allow variability in the functional trajectories for different individuals Heterogeneity among subjects controlled by the random effects distribution, P Number of basis functions, p, is not small (p >= 20)

Multi-Task Compressive Sensing (Ji, Dunson, Carin 08) Compressive sensing (CS): limit # measurements needed to accurately reconstruct a signal, u Let u i = Ψθ i + ɛ i, with Ψ a wavelet basis & assume that many of the elements of θ i are 0 Instead of measuring u i, CS measures v i = ΦΨ u i, with Φ a random projection matrix. Candes & Tau prove accuracy in reconstructing u i from v i We propose to let θ i P to borrow information from related signals

Application 3 - Multiple brain images (a) Linear 1, N=4096 (b) Linear 2, N=4096 (c) Linear 3, N=4096 (d) Linear 4, N=4096 (e) Linear 5, N=4096 (f) ST 1, N=1636 (g) ST 2, N=1636 (h) ST 3, N=1636 (i) ST 4, N=1636 (j) ST 5, N=1636 (k) MT 1, N=1636 (l) MT 2, N=1636 (m) MT 3, N=1636 (n) MT 4, N=1636 (o) MT 5, N=1636

Application 4 - Multiple videos (a) Linear 1, N=4096 (b) Linear 2, N=4096 (c) Linear 3, N=4096 (d) Linear 4, N=4096 (e) Linear 5, N=4096 (f) ST 1, N=1717 (g) ST 2, N=1717 (h) ST 3, N=1717 (i) ST 4, N=1717 (j) ST 5, N=1717 (k) MT 1, N=1717 (l) MT 2, N=1717 (m) MT 3, N=1717 (n) MT 4, N=1717 (o) MT 5, N=1717

Application 5 - Images and text Coast sky, sea water, sand beach, tree, tree, mountain, mountain, person, person

DPMs for Longitudinal & Multi-Level Data A simple case corresponds to the linear mixed effects model y ij = x ijβ + z ijb i + ɛ ij, ɛ ij N(0, σ 2 ) b i P, P DP(αP 0 ), DP prior on P, the distribution of the random effects Useful semiparametric model for longitudinal & correlated data Bush & MacEachern (1996), Müller & Rosner (1997), Kleinman & Ibrahim (1998), Ishwaran & Takahara (2002), etc

Modeling Random Curves Let y ij = f i (t ij ) + ɛ ij = noisy observation of a smooth curve f i for subject i For example, f i may represent child growth during development Characterize variability in growth curves & cluster children having similar trajectories Can be accomplished using DPM linear mixed model with f i (t ij ) = p β il b l (t ij ) = x ijβ i, l=1 β i P = h=1 π h δ β, h b = {b l } p l=1 = basis functions (e.g., cubic splines)

Comments- Functional Dirichlet Process Recalling the DP stick-breaking property (Sethuraman, 1994): β i P = iid V h (1 V l )δ β, V h beta(1, α), β iid h P0, h=1 l<h Hence, the n subjects are grouped into k n clusters Subjects in cluster l all have β i = β l Provides a semiparametric Bayes version of latent trajectory class or growth mixture models. Avoids fixing the number of clusters in advance h

Comments continued The curve in cluster l is f (t) = b(t) β l The number of functional clusters in n growth curves is treated as unknown Gibbs samplers of lecture 1 are straightforward to generalize Number of clusters and configuration of subjects into clusters varies across the MCMC iterations Problem: label switching!

Label Switching Problem arises because the labels on the cluster-specific parameters are ambiguous, so vary in meaning across the iterations Not meaningful to calculate posterior summaries of β h across the iterations Strategies: 1. Relabeling algorithms that align the clusters after running MCMC (Stephens, 00); 2. Define clusters as individuals that are grouped together with high posterior probability 3. Estimate optimal clustering (Dahl, 06; Lau & Green, 97) 4. Ignore problem & avoid cluster-specific inferences

Joint Modeling One is often interested in joint modeling of data having different measurement scales Example: Joint modeling of a functional predictor with a health outcome Functional predictor may consist of a longitudinally recorded biomarker Predictor may also correspond to a diagnostic image Challenging to build flexible joint models for data having different scales

Application to Hormone Curves & Pregnancy Loss Progesterone is a female reproductive hormone - maintains pregnancy Urinary progesterone measured after ovulation through early pregnancy for 172 women Shape of the trajectory may predict impending early pregnancy loss Of interest to identify losses before they occur

log PdG trajectories (Bigelow & Dunson, JASA, 08)

Joint Modeling with Functional Predictors Suppose interest focuses on joint modeling of a functional predictor f i & response y i Component model for functional predictor: x i (t ij ) = f i (t ij ) + ɛ ij, ɛ ij N(0, σ 2 ) Component model for response: logit Pr(y i = 1 u i, µ i ) = µ i + u iψ u i = vector of covariates

Dirichlet Process Joint Modeling How to specify joint model for functional predictor, f i, & response y i? Let f i (t) = p h=1 β ihb h (t) & (β i, µ i ) P P can then be considered as unknown through a DP prior This same strategy can be used broadly for data fusion & joint modeling

DP Joint Models - Comments Approach automatically clusters subjects into groups Group l has functional predictor f (t) = p h=1 β lh b h(t) & baseline response probability, 1/{1 + exp( µ l )} Allows response probability to systematically shift between functional predictor clusters Very flexible approach for characterizing nonlinear & complex relationships with a functional predictor

Estimated Hormone Clusters & Risk of EPL

High Dimensional Applications Enormous increase in the generation of high-dimensional data Large p, small n problems create challenges to classical methods Appealing to develop flexible Bayesian approaches for identifying sparse latent structure in high dimensional data

Application - Massive Numbers of Predictors Commonly a large number of predictors x i = (x i1,..., x ip ) are available Interest focuses on identifying important predictors of y i Many parametric approaches available (e.g., using variable selection mixture priors) Methods commonly rely on two component priors, with one component concentrated at zero & one more diffuse

Application - Single Nucleotide Polymorphism (SNP) Data Single nucleotide polymorphism (SNPs) - variants in pair of amino acids at a given loci. SNP: g icl {1, 2, 3} = genotype at locus l within gene c for individual i. Total number of loci can be very large (Affy set to launch million-snp chip) One SNP = pair of alleles inherited from mother & father, with source (phase) unknown

SNPs and Health Outcomes In genetic epidemiology, interest focuses on identifying genetic factors predictive of a disease outcome y i. Epidemiologists tend to favor logistic regression models: logit Pr(y i = 1 g i, z i ) = z iα + C p c c=1 l=1 h=1 3 1(g icl = h)β clh, z i =environmental exposures, demographic variables, etc g i =SNP data for C genes β=very high-dimensional vector of coefficients Ideally, we would characterize β using a sparseness favoring prior

DP Priors for Shrinkage & Variable Selection Assuming the elements of β are exchangeable draws from P, assign P a zero-inflated DP prior Prior is a mixture of a DP and a point mass at zero, allowing zero coefficients SNPs clustered into null & non-null groups according to impact on health response Dunson et al. (2008, JASA) implement this approach & generalizations to borrow information across functionally-related genes.

Multiple Lasso Shrinkage Popular Lasso procedure corresponds to MAP estimation under a double exponential (Laplace) prior Tendency to over-shrink important predictors MacLehose & Dunson (08) propose a DP mixture of Laplace priors for the coefficients in a high-dimensional regression model Posterior computation uses retrospective MCMC (Papaspiliopoulos & Roberts, 08) Simulation studies show reduce MSE relative to Lasso

Illustration of Multiple Lasso Prior

Estimated Coefficients for Pima Indian Data

Parkinson s Disease Application Parkinson s disease is a common neurologic disease - tremors, rigidity & slowness of movement SNP data available for 540 individuals, with 270 having Parkinson s disease Focus on 270 SNPs on chromosome 11-540 SNP coefficients to estimate MCMC was run for 50,000 iterations Bayesian Lasso very sensitive to hyperparameters - either selects all SNPs or none Two genotypes were selected by the multiple Lasso

Histograms of Posterior Probabilities of Genotype Effect

Summary DPMs are useful in a very broad variety of applications areas beyond density estimation By sharing clustering across data from different scales, provide a highly flexible approach for joint modeling Also useful for generating more flexible sparse shrinkage priors - DP favors few components Computation is feasible even in high dimensions using efficient MCMC & alternatives (variational approximations, fast sequential search, etc)