Dirichlet process Bayesian clustering with the R package PReMiuM

Similar documents
Présentation des travaux en statistiques bayésiennes

Modelling collinear and spatially correlated data

Multiple linear regression S6

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

CS Lecture 19. Exponential Families & Expectation Propagation

Bayesian Hierarchical Models

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Nonparametric Bayes tensor factorizations for big data

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Modelling geoadditive survival data

Bayesian Linear Regression

Joint longitudinal and time-to-event models via Stan

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Bayesian Nonparametrics: Dirichlet Process

Distribution-free ROC Analysis Using Binary Regression Techniques

Bayesian non-parametric model to longitudinally predict churn

Flexible modelling of the cumulative effects of time-varying exposures

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Hierarchical Hurdle Models for Zero-In(De)flated Count Data of Complex Designs

Power and Sample Size Calculations with the Additive Hazards Model

More Statistics tutorial at Logistic Regression and the new:

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayes methods for categorical data. April 25, 2017

Bayesian Nonparametric Inference Methods for Mean Residual Life Functions

Missing Data Issues in the Studies of Neurodegenerative Disorders: the Methodology

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

2018 PAA Short Course on Bayesian Small Area Estimation using Complex Data Introduction and Overview

A comparison of fully Bayesian and two-stage imputation strategies for missing covariate data

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

Non-Parametric Bayes

BET: Bayesian Ensemble Trees for Clustering and Prediction in Heterogeneous Data

Variable selection for model-based clustering

Spatial Bayesian Nonparametrics for Natural Image Segmentation

General Regression Model

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Bayesian Nonparametric Autoregressive Models via Latent Variable Representation

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

Unbiased estimation of exposure odds ratios in complete records logistic regression

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

High-Throughput Sequencing Course

Causal Inference in Observational Studies with Non-Binary Treatments. David A. van Dyk

Mixed Membership Stochastic Blockmodels

Estimating the long-term health impact of air pollution using spatial ecological studies. Duncan Lee

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Hybrid Dirichlet processes for functional data

Bayesian Nonparametric Meta-Analysis Model George Karabatsos University of Illinois-Chicago (UIC)

Bayesian Hypothesis Testing in GLMs: One-Sided and Ordered Alternatives. 1(w i = h + 1)β h + ɛ i,

Measurement Error in Spatial Modeling of Environmental Exposures

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Genetic Association Studies in the Presence of Population Structure and Admixture

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

Bayesian Multivariate Logistic Regression

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Machine Learning for OR & FE

Test of Association between Two Ordinal Variables while Adjusting for Covariates

Contents. Part I: Fundamentals of Bayesian Inference 1

Factor Analytic Models of Clustered Multivariate Data with Informative Censoring (refer to Dunson and Perreault, 2001, Biometrics 57, )

CHOICE OF REFERENCE SUBCLASS IN REGRESSION MODELS

Introduction to Machine Learning

On the posterior structure of NRMI

Regularization in Cox Frailty Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Turning a research question into a statistical question.

STA 216, GLM, Lecture 16. October 29, 2007

Fractional Imputation in Survey Sampling: A Comparative Review

Linear Mixed Models. One-way layout REML. Likelihood. Another perspective. Relationship to classical ideas. Drawbacks.

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles

Bayesian linear regression

Chapter 2: Describing Contingency Tables - I

Experimental Design and Data Analysis for Biologists

STAT Advanced Bayesian Inference

3 Joint Distributions 71

Lecture 2: Poisson and logistic regression

A general mixed model approach for spatio-temporal regression data

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Department of Biostatistics University of Copenhagen

Analysing data: regression and correlation S6 and S7

IEOR165 Discussion Week 5

A SIGNAL THEORETIC INTRODUCTION TO RANDOM PROCESSES

Nonparametric Bayes Modeling

A comparative review of variable selection techniques for covariate dependent Dirichlet process mixture models

Disease mapping with Gaussian processes

Lecture 4 Multiple linear regression

Analysing geoadditive regression data: a mixed model approach

Exam Applied Statistical Regression. Good Luck!

Nonparametric Bayesian modeling for dynamic ordinal regression relationships

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data

Constrained estimation for binary and survival data

Introduction to BART with binary outcomes. Rodney Sparapani Medical College of Wisconsin Copyright (c) 2017 Rodney Sparapani

Transcription:

Dirichlet process Bayesian clustering with the R package PReMiuM Dr Silvia Liverani Brunel University London July 2015 Silvia Liverani (Brunel University London) Profile Regression 1 / 18

Outline Motivation Method R package PReMiuM Examples Silvia Liverani (Brunel University London) Profile Regression 2 / 18

Many collaborators John Molitor (University of Oregon) Sylvia Richardson (Medical Research Centre Biostatistics Unit) Michail Papathomas (University of St Andrews) David Hastie Aurore Lavigne (University of Lille 3, France) Lucy Leigh (University of Newcastle, Australia)... Silvia Liverani (Brunel University London) Profile Regression 3 / 18

Motivation Multicollinearity Goal of epidemilogical studies is to investigate the joint effect of different covariates / risk factors on a phenotype...... but highly correlated risk factors create collinearity problems! Silvia Liverani (Brunel University London) Profile Regression 4 / 18

Motivation Multicollinearity Goal of epidemilogical studies is to investigate the joint effect of different covariates / risk factors on a phenotype...... but highly correlated risk factors create collinearity problems! Example Researchers are interested in determining if a relationship exists between blood pressure (y = BP, in mm Hg) and weight (x 1 = Weight, in kg) body surface area (x 2 = BSA, in sq m) duration of hypertension (x 3 = Dur, in years) basal pulse (x 4 = Pulse, in beats per minute) stress index (x 5 = Stress) Silvia Liverani (Brunel University London) Profile Regression 4 / 18

Motivation 86 88 90 92 94 96 98 100 0 1 2 3 Weight BSA BP=y, Weight = x 1, BSA = x 2 Silvia Liverani (Brunel University London) Profile Regression 5 / 18

Motivation Highly correlated risk factors create collinearity problems, causing instability in model estimation Model ˆβ 1 SE ˆβ 1 ˆβ 2 SE ˆβ 2 y x 1 2.64 0.30 y x 2 3.34 1.33 y x 1 + x 2 6.58 0.53-20.44 2.28 Silvia Liverani (Brunel University London) Profile Regression 6 / 18

Motivation Highly correlated risk factors create collinearity problems, causing instability in model estimation Model ˆβ 1 SE ˆβ 1 ˆβ 2 SE ˆβ 2 y x 1 2.64 0.30 y x 2 3.34 1.33 y x 1 + x 2 6.58 0.53-20.44 2.28 Effect 1: the estimated regression coefficient of any one variable depends on which other predictor variables are included in the model. Effect 2: the precision of the estimated regression coefficients decreases as more predictor variables are added to the model. Silvia Liverani (Brunel University London) Profile Regression 6 / 18

Profile Regression Issues caused by correlated risk factors interacting risk factors Silvia Liverani (Brunel University London) Profile Regression 7 / 18

Profile Regression Issues caused by correlated risk factors interacting risk factors Profile regression partitions the multi-dimensional risk surface into groups having similar risks investigation of the joint effects of multiple risk factors jointly models the covariate patterns and health outcomes flexible but tractable Bayesian model Silvia Liverani (Brunel University London) Profile Regression 7 / 18

Profile Regression Notation For individual i y i x i = (x i1,..., x ip ) w i z i = c outcome of interest covariate profile fixed effects the allocation variable indicates the cluster to which individual i belongs Silvia Liverani (Brunel University London) Profile Regression 8 / 18

Statistical Framework Profile Regression Joint covariate and response model f (x i, y i φ, θ, ψ, β) = c ψ c f (x i z i = c, φ c )f (y i z i = c, θ c, β, w i ) Silvia Liverani (Brunel University London) Profile Regression 9 / 18

Statistical Framework Profile Regression Joint covariate and response model f (x i, y i φ, θ, ψ, β) = c ψ c f (x i z i = c, φ c )f (y i z i = c, θ c, β, w i ) For example for discrete covariates f (x i z i = c, φ c ) = J j=1 φ zi,j,x i,j For example, for Bernoulli response logit{p(y i = 1 θ c, β, w i )} = θ c + β T w i Silvia Liverani (Brunel University London) Profile Regression 9 / 18

Statistical Framework Profile Regression Joint covariate and response model f (x i, y i φ, θ, ψ, β) = c ψ c f (x i z i = c, φ c )f (y i z i = c, θ c, β, w i ) Prior model for the mixture weights ψ c stick-breaking priors (constructive definition of the Dirichlet Process) P(Z i = c ψ) = ψ c ψ 1 = V 1 ψ c = V c (1 V l ) V c Beta(1, α) l<c larger concentration parameter α the more evenly distributed is the resulting distribution. smaller concentration parameter α the more sparsely distributed is the resulting distribution, with all but a few parameters having a probability near zero Silvia Liverani (Brunel University London) Profile Regression 10 / 18

R package PReMiuM Implementation: R package PReMiuM We have implemented profile regression in C++ within the R package PReMiuM. Binary, binomial, categorical, Normal, Poisson and survival outcome Allows for spatial correlation Fixed effects (global parameters) including also spatial CAR term Normal and/or discrete covariates Dependent or independent slice sampling (Kalli et al., 2011) or truncated Dirichlet process model (Ishwaran and James, 2001) Fixed alpha or update alpha, or use the Pitman-Yor process prior Handles missing data Silvia Liverani (Brunel University London) Profile Regression 11 / 18

R package PReMiuM Implementation: R package PReMiuM We have implemented profile regression in C++ within the R package PReMiuM. Allows users to run predictive scenarios Performs post processing Contains plotting functions Currently working on: Quantile profile regression Enriched Dirichlet processes Silvia Liverani (Brunel University London) Profile Regression 12 / 18

Applications and features of the model Example: Simulated data The profiles are given by y : outcome, Bernoulli x : 5 covariates, all discrete with 3 levels w : 2 fixed effects, continuous or discrete Silvia Liverani (Brunel University London) Profile Regression 13 / 18

Applications and features of the model Survival response with censoring: sleep study Survival 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 Time Silvia Liverani (Brunel University London) Profile Regression 15 / 18

Variable selection

Applications and features of the model Spatial correlated response: deprivation in London (7.88,31.1] (2.47,7.88] ( 2.04,2.47] ( 7.73, 2.04] [ 39.1, 7.73] Silvia Liverani (Brunel University London) Profile Regression 17 / 18

References Applications and features of the model Liverani, S., Hastie, D. I., Azizi, L., Papathomas, M. and Richardson, S. (2015) PReMiuM: An R package for Profile Regression Mixture Models using Dirichlet Processes. Journal for Statistical Software, 64(7), 1-30. J. T. Molitor, M. Papathomas, M. Jerrett and S. Richardson (2010) Bayesian Profile Regression with an Application to the National Survey of Childrens Health, Biostatistics, 11, 484-498. Molitor, J., Brown, I. J., Papathomas, M., Molitor, N., Liverani, S., Chan, Q., Richardson, S., Van Horn, L., Daviglus, M. L., Stamler, J. and Elliott, P. (2014) Blood pressure differences associated with DASH-like lower sodium compared with typical American higher sodium nutrient profile: INTERMAP USA. Hypertension. Hastie, D. I., Liverani, S. and Richardson, S. (2015) Sampling from Dirichlet process mixture models with unknown concentration parameter: Mixing issues in large data implementations. To appear in Statistics and Computing. M. Papathomas, J. Molitor, S. Richardson, E. Riboli and P. Vineis (2011) Examining the joint effect of multiple risk factors using exposure risk profiles: lung cancer in non smokers. Environmental Health Perspectives, 119, 84-91. Papathomas, M, Molitor, J, Hoggart, C, Hastie, D and Richardson, S (2012) Exploring data from genetic association studies using Bayesian variable selection and the Dirichlet process: application to searching for gene-gene patterns. Genetic Epidemiology. Silvia Liverani (Brunel University London) Profile Regression 18 / 18