Nonparametric Bayes Modeling Lecture 6: Advanced Applications of DPMs David Dunson Department of Statistical Science, Duke University Tuesday February 2, 2010
Motivation Functional data analysis Variable selection & shrinkage
Hierarchical Modeling θ i = random effects specific to subject i Hierarchical models let θ i P P = random effects distribution Choice of P critical in controlling borrowing of information
Some Classical Applications Meta Analysis: combine data from multiple studies to make overall conclusion (e.g., drug is effective) Multi-level Designs: subjects are nested in schools, regions or study centers Longitudinal Data: data collected for subject over time - important to accommodate within-subject dependence
Some Emerging Applications Joint modeling of data from different domains Images and captions Diagnostic images or functional predictors & health responses Multiple types of omics data (sequence & expression) Multi-task learning: borrow strength across tasks Multiple images, music pieces, security videos Compressive sensing User preferences in different domains (film, books, etc)
Application 1 - Multinational Bioassay Increasing concern about adverse effects of environmental estrogens on human development Rodent uterotrophic bioassay: system for identifying suspected agonists or antagonists of estrogen. OECD study: collected data from 19 laboratories to investigate consistency of effects of known agonist (EE) & antagonist (ZM) y ij = uterus weight for rat j in lab i x ij =protocol type, dose of EE, dose of ZM
Summary of 19 participating laboratories Notation Lab Name Country Protocols Conducted F1 Citifrance France A F2 Poulenc France A K1 ChungKorea Korea A, B K2 KoreaPark Korea B, C G1 Berlin Germany A G2 Basf Germany A G3 Bayer Germany A J1 Citijapan Japan A, B, C J2 Hatano Japan A, B, C, D J3 InEnvTox Japan A, B, C, D J4 Mitsubishi Japan A, B, C, D J5 Nihon Japan A, B, C, D J6 Sumitomo Japan A, B, C N1 Denmark Netherlands B N2 TNO Netherlands A, B B1 Huntingdon UK C B2 Zeneca UK A, B, C U1 Exxon USA A U2 WIL USA A, B
Some Comments Can potentially fit normal random effects model, y ij = x ijθ i + ɛ ij, ɛ ij N(0, σ 2 ), θ i N p (θ, Σ) Normal distribution has light tails & does not allow outlying labs or clusters of labs Conclusions may be sensitive to violations of normality Appealing to have a more flexible approach available
Application 2 - Dependent Functions Interest in estimating a collection of functions, {f i } n i=1 Longitudinal trajectories for different individuals Function relating features to probability yij = 1 for task i We will focus on the following model: y ij = f i (t ij ) + ɛ ij, ɛ ij t ν (σ 2 ) p f i (t) = θ ij b j (t) = b(t) θ i j=1 θ i P b = {b j }=basis functions, θ i =basis coefficients
log PdG trajectories (Bigelow & Dunson, JASA, 08)
Comments on Functional Data Model Subject-specific basis coefficients, θ i, allow variability in the functional trajectories for different individuals Heterogeneity among subjects controlled by the random effects distribution, P Number of basis functions, p, is not small (p >= 20)
Multi-Task Compressive Sensing (Ji, Dunson, Carin 08) Compressive sensing (CS): limit # measurements needed to accurately reconstruct a signal, u Let u i = Ψθ i + ɛ i, with Ψ a wavelet basis & assume that many of the elements of θ i are 0 Instead of measuring u i, CS measures v i = ΦΨ u i, with Φ a random projection matrix. Candes & Tau prove accuracy in reconstructing u i from v i We propose to let θ i P to borrow information from related signals
Application 3 - Multiple brain images (a) Linear 1, N=4096 (b) Linear 2, N=4096 (c) Linear 3, N=4096 (d) Linear 4, N=4096 (e) Linear 5, N=4096 (f) ST 1, N=1636 (g) ST 2, N=1636 (h) ST 3, N=1636 (i) ST 4, N=1636 (j) ST 5, N=1636 (k) MT 1, N=1636 (l) MT 2, N=1636 (m) MT 3, N=1636 (n) MT 4, N=1636 (o) MT 5, N=1636
Application 4 - Multiple videos (a) Linear 1, N=4096 (b) Linear 2, N=4096 (c) Linear 3, N=4096 (d) Linear 4, N=4096 (e) Linear 5, N=4096 (f) ST 1, N=1717 (g) ST 2, N=1717 (h) ST 3, N=1717 (i) ST 4, N=1717 (j) ST 5, N=1717 (k) MT 1, N=1717 (l) MT 2, N=1717 (m) MT 3, N=1717 (n) MT 4, N=1717 (o) MT 5, N=1717
Application 5 - Images and text Coast sky, sea water, sand beach, tree, tree, mountain, mountain, person, person
DPMs for Longitudinal & Multi-Level Data A simple case corresponds to the linear mixed effects model y ij = x ijβ + z ijb i + ɛ ij, ɛ ij N(0, σ 2 ) b i P, P DP(αP 0 ), DP prior on P, the distribution of the random effects Useful semiparametric model for longitudinal & correlated data Bush & MacEachern (1996), Müller & Rosner (1997), Kleinman & Ibrahim (1998), Ishwaran & Takahara (2002), etc
Modeling Random Curves Let y ij = f i (t ij ) + ɛ ij = noisy observation of a smooth curve f i for subject i For example, f i may represent child growth during development Characterize variability in growth curves & cluster children having similar trajectories Can be accomplished using DPM linear mixed model with f i (t ij ) = p β il b l (t ij ) = x ijβ i, l=1 β i P = h=1 π h δ β, h b = {b l } p l=1 = basis functions (e.g., cubic splines)
Comments- Functional Dirichlet Process Recalling the DP stick-breaking property (Sethuraman, 1994): β i P = iid V h (1 V l )δ β, V h beta(1, α), β iid h P0, h=1 l<h Hence, the n subjects are grouped into k n clusters Subjects in cluster l all have β i = β l Provides a semiparametric Bayes version of latent trajectory class or growth mixture models. Avoids fixing the number of clusters in advance h
Comments continued The curve in cluster l is f (t) = b(t) β l The number of functional clusters in n growth curves is treated as unknown Gibbs samplers of lecture 1 are straightforward to generalize Number of clusters and configuration of subjects into clusters varies across the MCMC iterations Problem: label switching!
Label Switching Problem arises because the labels on the cluster-specific parameters are ambiguous, so vary in meaning across the iterations Not meaningful to calculate posterior summaries of β h across the iterations Strategies: 1. Relabeling algorithms that align the clusters after running MCMC (Stephens, 00); 2. Define clusters as individuals that are grouped together with high posterior probability 3. Estimate optimal clustering (Dahl, 06; Lau & Green, 97) 4. Ignore problem & avoid cluster-specific inferences
Joint Modeling One is often interested in joint modeling of data having different measurement scales Example: Joint modeling of a functional predictor with a health outcome Functional predictor may consist of a longitudinally recorded biomarker Predictor may also correspond to a diagnostic image Challenging to build flexible joint models for data having different scales
Application to Hormone Curves & Pregnancy Loss Progesterone is a female reproductive hormone - maintains pregnancy Urinary progesterone measured after ovulation through early pregnancy for 172 women Shape of the trajectory may predict impending early pregnancy loss Of interest to identify losses before they occur
log PdG trajectories (Bigelow & Dunson, JASA, 08)
Joint Modeling with Functional Predictors Suppose interest focuses on joint modeling of a functional predictor f i & response y i Component model for functional predictor: x i (t ij ) = f i (t ij ) + ɛ ij, ɛ ij N(0, σ 2 ) Component model for response: logit Pr(y i = 1 u i, µ i ) = µ i + u iψ u i = vector of covariates
Dirichlet Process Joint Modeling How to specify joint model for functional predictor, f i, & response y i? Let f i (t) = p h=1 β ihb h (t) & (β i, µ i ) P P can then be considered as unknown through a DP prior This same strategy can be used broadly for data fusion & joint modeling
DP Joint Models - Comments Approach automatically clusters subjects into groups Group l has functional predictor f (t) = p h=1 β lh b h(t) & baseline response probability, 1/{1 + exp( µ l )} Allows response probability to systematically shift between functional predictor clusters Very flexible approach for characterizing nonlinear & complex relationships with a functional predictor
Estimated Hormone Clusters & Risk of EPL
High Dimensional Applications Enormous increase in the generation of high-dimensional data Large p, small n problems create challenges to classical methods Appealing to develop flexible Bayesian approaches for identifying sparse latent structure in high dimensional data
Application - Massive Numbers of Predictors Commonly a large number of predictors x i = (x i1,..., x ip ) are available Interest focuses on identifying important predictors of y i Many parametric approaches available (e.g., using variable selection mixture priors) Methods commonly rely on two component priors, with one component concentrated at zero & one more diffuse
Application - Single Nucleotide Polymorphism (SNP) Data Single nucleotide polymorphism (SNPs) - variants in pair of amino acids at a given loci. SNP: g icl {1, 2, 3} = genotype at locus l within gene c for individual i. Total number of loci can be very large (Affy set to launch million-snp chip) One SNP = pair of alleles inherited from mother & father, with source (phase) unknown
SNPs and Health Outcomes In genetic epidemiology, interest focuses on identifying genetic factors predictive of a disease outcome y i. Epidemiologists tend to favor logistic regression models: logit Pr(y i = 1 g i, z i ) = z iα + C p c c=1 l=1 h=1 3 1(g icl = h)β clh, z i =environmental exposures, demographic variables, etc g i =SNP data for C genes β=very high-dimensional vector of coefficients Ideally, we would characterize β using a sparseness favoring prior
DP Priors for Shrinkage & Variable Selection Assuming the elements of β are exchangeable draws from P, assign P a zero-inflated DP prior Prior is a mixture of a DP and a point mass at zero, allowing zero coefficients SNPs clustered into null & non-null groups according to impact on health response Dunson et al. (2008, JASA) implement this approach & generalizations to borrow information across functionally-related genes.
Multiple Lasso Shrinkage Popular Lasso procedure corresponds to MAP estimation under a double exponential (Laplace) prior Tendency to over-shrink important predictors MacLehose & Dunson (08) propose a DP mixture of Laplace priors for the coefficients in a high-dimensional regression model Posterior computation uses retrospective MCMC (Papaspiliopoulos & Roberts, 08) Simulation studies show reduce MSE relative to Lasso
Illustration of Multiple Lasso Prior
Estimated Coefficients for Pima Indian Data
Parkinson s Disease Application Parkinson s disease is a common neurologic disease - tremors, rigidity & slowness of movement SNP data available for 540 individuals, with 270 having Parkinson s disease Focus on 270 SNPs on chromosome 11-540 SNP coefficients to estimate MCMC was run for 50,000 iterations Bayesian Lasso very sensitive to hyperparameters - either selects all SNPs or none Two genotypes were selected by the multiple Lasso
Histograms of Posterior Probabilities of Genotype Effect
Summary DPMs are useful in a very broad variety of applications areas beyond density estimation By sharing clustering across data from different scales, provide a highly flexible approach for joint modeling Also useful for generating more flexible sparse shrinkage priors - DP favors few components Computation is feasible even in high dimensions using efficient MCMC & alternatives (variational approximations, fast sequential search, etc)