Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models

Size: px
Start display at page:

Download "Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models"

Transcription

1 Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models David B. Dahl Department of Statistics Texas A&M University Marina Vannucci, Michael Newton, & Qianxing Mo 3 rd Lehmann Symposium Rice University May 16, 2007 David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 1 / 41

2 Michael Newton Quincy Mo Marina Vannucci D. B. Dahl, M. A. Newton (200?), Multiple Hypothesis Testing by Clustering Treatment Effects of Correlated Objects, Journal of the American Statistical Association, accepted. D. B. Dahl (2006), Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, Kim-Anh Do, Peter Müller, Marina Vannucci (Eds.), Cambridge University Press. D. B. Dahl, Q. Mo, M. Vannucci (200?), Simultaneous Inference for Multiple Testing and Clustering via a Dirichlet Process Mixture Model, Statistical Modelling: An International Journal, accepted. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 2 / 41

3 Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 3 / 41

4 Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 4 / 41

5 Two Statistical Tasks Multiple hypothesis testing: Goal: Detect shift in marginal distribution of gene expression. Statistical dependence among genes is a nuisance parameter. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 5 / 41

6 Two Statistical Tasks Multiple hypothesis testing: Goal: Detect shift in marginal distribution of gene expression. Statistical dependence among genes is a nuisance parameter. Clustering: Goal: Group genes that are highly correlated. Correlation may reflect underlying biological factors of interest. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 5 / 41

7 Two Statistical Tasks Multiple hypothesis testing: Goal: Detect shift in marginal distribution of gene expression. Statistical dependence among genes is a nuisance parameter. Clustering: Goal: Group genes that are highly correlated. Correlation may reflect underlying biological factors of interest. We propose a hybrid methodology... Main Idea Simultaneously infer clustering & test for differential expression David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 5 / 41

8 Two Statistical Tasks Multiple hypothesis testing: Goal: Detect shift in marginal distribution of gene expression. Statistical dependence among genes is a nuisance parameter. Clustering: Goal: Group genes that are highly correlated. Correlation may reflect underlying biological factors of interest. We propose a hybrid methodology... Main Idea Simultaneously infer clustering & test for differential expression Other work: Storey (2007), Yuan & Kendziorski (2006), Tibshirani & Wasserman (2006) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 5 / 41

9 Gene 1 14 Gene Expression Treatment David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 6 / 41

10 Gene 2 14 Gene Expression Treatment David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 6 / 41

11 14 Gene Expression Treatment David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 6 / 41

12 Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 7 / 41

13 Elementary Setup Parameters θ 1,..., θ n for n observations. Hypotheses: H 0i : θ i = 0, vs. H ai : θ i > 0 Test statistics Z 1,..., Z n are independent and Z i N(θ i, 1) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 8 / 41

14 Method I Standard Z -test in which H 0i is rejected if Z i > z, where z is chosen to achieve the desired size. The test has power: 1 Φ(z θ i ) where Φ(x) is the standard normal distribution function evaluated at x. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 9 / 41

15 Method II Assumes a known clustering: c ij = I{θ i = θ j }. Test statistic: S i = Z i + c ij Z j. i j The test has power: 1 Φ(z n (i) θ i ) where n (i) = n j=1 c ij Method 2 is never less powerful than Method 1. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 10 / 41

16 Method III Clustering indicators c ij s are estimated: ĉ ij = { c ij γ is the error rate of clustering. with probability 1 γ 1 c ij with probability γ, Take Method II, but replace c ij with ĉ ij to form Ŝi. Under an assumption about the distribution of θ 1,..., θ n, the test has power: 1 Φ(z kθ i ) where k is a constant involving γ, n (i), etc. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 11 / 41

17 Power("Method 3") / Power("Method 1") n (i) = 30 n (i) = γ = Error Probability David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 12 / 41

18 Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 13 / 41

19 BEMMA for Diff. Expression Dahl, Newton (200?) Bayesian Effects Model for Microarrays (BEMMA): Conjugate Dirichlet process mixture (DPM) model. Identifies differentially expressed genes by borrowing strength from genes likely to have the same parameters. Averages over clustering uncertainty. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 14 / 41

20 BEMMA for Diff. Expression Dahl, Newton (200?) Bayesian Effects Model for Microarrays (BEMMA): Conjugate Dirichlet process mixture (DPM) model. Identifies differentially expressed genes by borrowing strength from genes likely to have the same parameters. Averages over clustering uncertainty. Sampling model: y gtr µ g, τ gt, λ g N(y gtr µ g + τ gt, λ g ), where r is replicate, t is treatment, and g is gene. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 14 / 41

21 BEMMA for Diff. Expression Dahl, Newton (200?) Bayesian Effects Model for Microarrays (BEMMA): Conjugate Dirichlet process mixture (DPM) model. Identifies differentially expressed genes by borrowing strength from genes likely to have the same parameters. Averages over clustering uncertainty. Sampling model: y gtr µ g, τ gt, λ g N(y gtr µ g + τ gt, λ g ), where r is replicate, t is treatment, and g is gene. Genes g and g come from the same cluster iff: (τ g1,..., τ gt, λ g ) = (τ g 1,..., τ g T, λ g ) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 14 / 41

22 Nuisance Parameters Gene-specific means µ 1,..., µ G are not related to differential expression or clustering. 14 Gene Expression Treatment David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 15 / 41

23 Nuisance Parameters Gene-specific means µ 1,..., µ G are not related to differential expression or clustering. 14 Gene Expression Treatment Let d g be a vector whose elements are y gtr y g1 for t 2. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 15 / 41

24 Model Sampling distribution: d g τ g, λ g N N (d g Xτ g, λ g M), where τ g = (τ g2,..., τ gt ), M = ( I + 1 R 1 J ) 1, and X is a design matrix picking off the appropriate element of τ g. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 16 / 41

25 Model Sampling distribution: d g τ g, λ g N N (d g Xτ g, λ g M), where τ g = (τ g2,..., τ gt ), M = ( I + 1 R 1 J ) 1, and X is a design matrix picking off the appropriate element of τ g. Clustering based on (τ, λ) via a Dirichlet process prior: (τ g, λ g ) F F F DP(αF 0 ), where F 0 is conjugate to the likelihood. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 16 / 41

26 David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

27 David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

28 David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

29 David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

30 David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

31 David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

32 David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 17 / 41

33 Model Fitting The τ s and λ s may be integrated away, leaving only the clustering of the G genes. Sample from posterior clustering distribution using MCMC. Gibbs of MacEachern (1994) and Neal (1992) Merge-Split of Jain & Neal (2004) Merge-Split of Dahl (2003) After MCMC, it s easy to sample τ s and λ s given clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 18 / 41

34 Inference on Differential Expression Define a univariate parameter q g that encodes the hypothesis of interest. For example, the global F-test in one-way ANOVA setting is analogous to: T q g = Estimate q g under squared-error loss by computing its expection with respect to p(q g d 1,..., d G ). Rank genes for evidence for differential expression using the estimates ˆq 1,..., ˆq G. t=2 τ 2 gt David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 19 / 41

35 Simulation Study Some other methods for differential expression: EBarrays (Kendziorski, Newton, et al., 2003) LIMMA (Smyth 2004) Comparison based on proportion of false discoveries. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 20 / 41

36 Simulation Study Some other methods for differential expression: EBarrays (Kendziorski, Newton, et al., 2003) LIMMA (Smyth 2004) Comparison based on proportion of false discoveries. Simulate datasets: Time-course experiment: Three time points Two treatment conditions 300 of 1,200 genes are differentially expressed. Interest lies in genes that are differentially expressed at one or more time points. Four levels of clustering: Heavy Clustering: 12 clusters of 100 genes per cluster. Moderate Clustering: 60 clusters of 20 genes per cluster. Weak Clustering: 240 clusters of 5 genes per cluster. No Clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 20 / 41

37 Heavy Clustering False Discovery Rate Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 21 / 41

38 Moderate Clustering False Discovery Rate Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 22 / 41

39 Weak Clustering False Discovery Rate Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 23 / 41

40 No Clustering False Discovery Rate Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 24 / 41

41 Paraquat Experiment Old and young mice treated with paraquat injection. Sacrifice as baseline or 1, 3, 5, or 7 hours after injection. Three replicates per treatment. 10,043 probe sets on Affymetrix MG-U74A arrays. Background correction and normalization using RMA (Irizarry et al., 2003). Biologists are interested in genes whose expression between old and young is similar at baseline and very different at one hour. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 25 / 41

42 Estimated Treatment Effects for Probe Set 92885_at Deviation from Reference Yng Old Yng Old Yng Old Yng Old Yng Old Baseline 1 hour 3 hours 5 hours 7 hours David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 26 / 41

43 Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 27 / 41

44 Inference on Clustering Dahl (200?) MCMC sampling algorithm produces B clusterings π (1),..., π (B) from the posterior clustering distribution. Point estimation methods: Maximum a posteriori (MAP) clustering Medvedovic & Sivaganesan (2002): hierarchical clustering using pairwise probabilities Dahl (2006): stochastic search to minimize posterior expected loss from Binder (1978) Lau & Green (200?): heuristic to minimize posterior expected loss from Binder (1978) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 28 / 41

45 Inference on Clustering Dahl (200?) MCMC sampling algorithm produces B clusterings π (1),..., π (B) from the posterior clustering distribution. Point estimation methods: Maximum a posteriori (MAP) clustering Medvedovic & Sivaganesan (2002): hierarchical clustering using pairwise probabilities Dahl (2006): stochastic search to minimize posterior expected loss from Binder (1978) Lau & Green (200?): heuristic to minimize posterior expected loss from Binder (1978) Selects the observed clustering closest to the matrix of pairwise probabilities in terms of squared distances: π LS = arg min π {π (1),...,π (B) } G i=1 j=1 G (δ i,j (π) ˆp i,j ) 2 (1) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 28 / 41

46 Heavy Clustering Degree of Clustering Adjusted Rand Index Clustering Method w/ 95% C.I. MCLUST (0.380, 0.447) BEMMA(least-squares) (0.373, 0.431) BEMMA(map) (0.362, 0.419) HCLUST(effects,average) (0.247, 0.308) HCLUST(effects,complete) (0.242, 0.279) HCLUST(correlation,complete) (0.144, 0.180) HCLUST(correlation,average) (0.141, 0.172) Heavy Table: Adjusted Rand Index for BEMMA and Other Methods. Large values of the adjusted Rand index indicate better agreement between the estimated and true clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 29 / 41

47 Moderate Clustering Degree of Clustering Adjusted Rand Index Clustering Method w/ 95% C.I. BEMMA(least-squares) (0.146, 0.163) MCLUST (0.136, 0.152) BEMMA(map) (0.119, 0.135) HCLUST(effects,complete) (0.111, 0.123) HCLUST(effects,average) (0.095, 0.107) HCLUST(correlation,average) (0.075, 0.083) HCLUST(correlation,complete) (0.068, 0.078) Moderate Table: Adjusted Rand Index for BEMMA and Other Methods. Large values of the adjusted Rand index indicate better agreement between the estimated and true clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 30 / 41

48 Weak Clustering Degree of Clustering Adjusted Rand Index Clustering Method w/ 95% C.I. MCLUST (0.048, 0.052) HCLUST(effects,complete) (0.043, 0.048) BEMMA(least-squares) (0.040, 0.043) HCLUST(effects,average) (0.035, 0.038) BEMMA(map) (0.030, 0.033) HCLUST(correlation,average) (0.027, 0.030) HCLUST(correlation,complete) (0.025, 0.029) Weak Table: Adjusted Rand Index for BEMMA and Other Methods. Large values of the adjusted Rand index indicate better agreement between the estimated and true clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 31 / 41

49 Posterior Distribution of the Number of Clusters Density Number of Clusters David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 32 / 41

50 Effects Intensity Plot of Cluster of 92885_at 250 Genes Grouped by Inferred Clusters Young Old Young Old Young Old Young Old Young Old Baseline 1 hour 3 hours 5 hours 7 hours David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 33 / 41

51 Effects Intensity Plot for All Clusters Genes Grouped by Inferred Clusters Young Old Young Old Young Old Young Old Young Old Baseline 1 hour 3 hours 5 hours 7 hours David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 34 / 41

52 Effects Intensity Plot for Smallest Clusters Genes Grouped by Inferred Clusters Young Old Young Old Young Old Young Old Young Old Baseline 1 hour 3 hours 5 hours 7 hours David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 35 / 41

53 Outline 1 Motivation 2 Z-scores Demonstration 3 BEMMA for Differential Expression Dahl, Newton (200?) 4 BEMMA for Clustering Dahl (200?) 5 SIMTAC for DE and Clustering Dahl, Mo, Vannucci (200?) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 36 / 41

54 SIMTAC Dahl, Mo, Vannucci (200?) Simultaneous Inference for Multiple Testing and Clustering (SIMTAC): Extension of BEMMA Separates clustering of regression coefficients from accommodation of heteroscedasticity Wider class of experimental designs No need to specify an arbitrary reference treatment Nonconjugate Dirichlet process mixture (DPM) model Sampling distribution: d g µ g, β g, λ g N K ( d g µ g j + Xβ g, λ g M ), Prior distribution: µ g N ( µ g m µ, p µ ) β g G β G β G β DP ( α β Gβ ) λ g G λ G λ G λ DP (α λ Gλ ) David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 37 / 41

55 Simulation Study Relationship of Regression Coefficients Number of Clusters Size of Encoding Equivalent and Differential Expression with this Each Cluster Time Point 1 Time Point 2 Time Point 3 Configuration 120 β g,3 = 0 β g,4 = β g,1 β g,5 = β g, β g,3 = 0 β g,4 = β g,1 β g,5 = β g, β g,3 = 0 β g,4 β g,1 β g,5 = β g, β g,3 = 0 β g,4 = β g,1 β g,5 = β g, β g,3 = 0 β g,4 β g,1 β g,5 = β g, β g,3 = 0 β g,4 β g,1 β g,5 β g,2 1 5 β g,3 = 0 β g,4 = β g,1 β g,5 = β g, β g,3 = 0 β g,4 β g,1 β g,5 = β g,2 2 5 β g,3 = 0 β g,4 β g,1 β g,5 β g,2 2 5 β g,3 0 β g,4 β g,1 β g,5 β g,2 1 2 β g,3 = 0 β g,4 = β g,1 β g,5 = β g, β g,3 = 0 β g,4 β g,1 β g,5 = β g,2 4 2 β g,3 = 0 β g,4 β g,1 β g,5 β g,2 4 2 β g,3 0 β g,4 β g,1 β g,5 β g,2 4 1 β g,3 = 0 β g,4 = β g,1 β g,5 = β g, β g,3 = 0 β g,4 β g,1 β g,5 = β g,2 5 1 β g,3 = 0 β g,4 β g,1 β g,5 β g,2 5 1 β g,3 0 β g,4 β g,1 β g,5 β g,2 5 1 β g,3 = 0 β g,4 = β g,1 β g,5 β g,2 5 1 β g,3 0 β g,4 β g,1 β g,5 = β g,2 5 Table: Clusters in a Synthetic Dataset. For the 216 clusters in each synthetic dataset, this table shows the relationship among and the cluster sizes for the regression coefficients. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 38 / 41

56 Proportion of False Discoveries Proportion of False Discoveries ANOVA BEMMA SIMTAC Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 39 / 41

57 Difference in Proportion of False Discoveries Difference in Proportion of False Discoveries ANOVA SIMTAC BEMMA SIMTAC Number of Discoveries David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 40 / 41

58 Summary Dependence can be exploited to improve power in multiple testing. Dirichlet process mixture (DPM) models provide a powerful machinery to accomplish simultaneous inference on clustering and multiple hypothesis testing. BEMMA: Under weak clustering, BEMMA performs as well as its peers. Under heavier clustering, BEMMA performs substantially better. BEMMA has been successfully applied to a replicated microarray study with 10,000+ probesets and 10 treatment conditions. SIMTAC: Improvemed implementation of the the idea. Simulation results are encouraging... now applying to local data. Least-squares clustering: Convenient and conceptually appealing procedure for point estimation of clustering. David Dahl (Texas A&M) Multiple Testing & Clustering 3 rd Lehmann Symposium 41 / 41

Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model

Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model David B Dahl 1, Qianxing Mo 2 and Marina Vannucci 3 1 Texas A&M University, US 2 Memorial Sloan-Kettering

More information

Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models

Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models Bayesian Analysis (2009) 4, Number 4, pp. 707 732 Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models Sinae Kim, David B. Dahl and Marina Vannucci Abstract.

More information

Bayesian Nonparametric Regression for Diabetes Deaths

Bayesian Nonparametric Regression for Diabetes Deaths Bayesian Nonparametric Regression for Diabetes Deaths Brian M. Hartman PhD Student, 2010 Texas A&M University College Station, TX, USA David B. Dahl Assistant Professor Texas A&M University College Station,

More information

Department of Statistics, The Wharton School, University of Pennsylvania

Department of Statistics, The Wharton School, University of Pennsylvania Submitted to the Annals of Applied Statistics BAYESIAN TESTING OF MANY HYPOTHESIS MANY GENES: A STUDY OF SLEEP APNEA BY SHANE T. JENSEN Department of Statistics, The Wharton School, University of Pennsylvania

More information

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison AN IMPROVED MERGE-SPLIT SAMPLER FOR CONJUGATE DIRICHLET PROCESS MIXTURE MODELS David B. Dahl dbdahl@stat.wisc.edu Department of Statistics, and Department of Biostatistics & Medical Informatics University

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

Hierarchical Mixture Models for Expression Profiles

Hierarchical Mixture Models for Expression Profiles 2 Hierarchical Mixture Models for Expression Profiles MICHAEL A. NEWTON, PING WANG, AND CHRISTINA KENDZIORSKI University of Wisconsin at Madison Abstract A class of probability models for inference about

More information

Distance-Based Probability Distribution for Set Partitions with Applications to Bayesian Nonparametrics

Distance-Based Probability Distribution for Set Partitions with Applications to Bayesian Nonparametrics Distance-Based Probability Distribution for Set Partitions with Applications to Bayesian Nonparametrics David B. Dahl August 5, 2008 Abstract Integration of several types of data is a burgeoning field.

More information

Sequentially-Allocated Merge-Split Sampler for Conjugate and Nonconjugate Dirichlet Process Mixture Models

Sequentially-Allocated Merge-Split Sampler for Conjugate and Nonconjugate Dirichlet Process Mixture Models Sequentially-Allocated Merge-Split Sampler for Conjugate and Nonconjugate Dirichlet Process Mixture Models David B. Dahl* Texas A&M University November 18, 2005 Abstract This paper proposes a new efficient

More information

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Why uncertainty? Why should data mining care about uncertainty? We

More information

Spatial Normalized Gamma Process

Spatial Normalized Gamma Process Spatial Normalized Gamma Process Vinayak Rao Yee Whye Teh Presented at NIPS 2009 Discussion and Slides by Eric Wang June 23, 2010 Outline Introduction Motivation The Gamma Process Spatial Normalized Gamma

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles Peter Green and John Lau University of Bristol P.J.Green@bristol.ac.uk Isaac Newton Institute, 11 December

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

28 Bayesian Mixture Models for Gene Expression and Protein Profiles

28 Bayesian Mixture Models for Gene Expression and Protein Profiles 28 Bayesian Mixture Models for Gene Expression and Protein Profiles Michele Guindani, Kim-Anh Do, Peter Müller and Jeff Morris M.D. Anderson Cancer Center 1 Introduction We review the use of semi-parametric

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

A Fully Nonparametric Modeling Approach to. BNP Binary Regression A Fully Nonparametric Modeling Approach to Binary Regression Maria Department of Applied Mathematics and Statistics University of California, Santa Cruz SBIES, April 27-28, 2012 Outline 1 2 3 Simulation

More information

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE Sanat K. Sarkar 1, Tianhui Zhou and Debashis Ghosh Temple University, Wyeth Pharmaceuticals and

More information

Problem Selected Scores

Problem Selected Scores Statistics Ph.D. Qualifying Exam: Part II November 20, 2010 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. Problem 1 2 3 4 5 6 7 8 9 10 11 12 Selected

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

29 Sample Size Choice for Microarray Experiments

29 Sample Size Choice for Microarray Experiments 29 Sample Size Choice for Microarray Experiments Peter Müller, M.D. Anderson Cancer Center Christian Robert and Judith Rousseau CREST, Paris Abstract We review Bayesian sample size arguments for microarray

More information

Time-Sensitive Dirichlet Process Mixture Models

Time-Sensitive Dirichlet Process Mixture Models Time-Sensitive Dirichlet Process Mixture Models Xiaojin Zhu Zoubin Ghahramani John Lafferty May 25 CMU-CALD-5-4 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 Abstract We introduce

More information

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles Jeremy Gaskins Department of Bioinformatics & Biostatistics University of Louisville Joint work with Claudio Fuentes

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Research Article Spiked Dirichlet Process Priors for Gaussian Process Models

Research Article Spiked Dirichlet Process Priors for Gaussian Process Models Hindawi Publishing Corporation Journal of Probability and Statistics Volume 200, Article ID 20489, 4 pages doi:0.55/200/20489 Research Article Spiked Dirichlet Process Priors for Gaussian Process Models

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

A Bayesian Mixture Model for Differential Gene Expression

A Bayesian Mixture Model for Differential Gene Expression A Bayesian Mixture Model for Differential Gene Expression Kim-Anh Do, Peter Müller and Feng Tang 1 Abstract We propose model-based inference for differential gene expression, using a non-parametric Bayesian

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Optimal Sample Size for Multiple Testing: the Case of Gene Expression Microarrays

Optimal Sample Size for Multiple Testing: the Case of Gene Expression Microarrays Johns Hopkins University, Dept. of Biostatistics Working Papers 2-2-2004 Optimal Sample Size for Multiple Testing: the Case of Gene Expression Microarrays Peter Muller University of Texas M.D. Anderson

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Graphical Models for Query-driven Analysis of Multimodal Data

Graphical Models for Query-driven Analysis of Multimodal Data Graphical Models for Query-driven Analysis of Multimodal Data John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Nonparametric Bayesian modeling for dynamic ordinal regression relationships

Nonparametric Bayesian modeling for dynamic ordinal regression relationships Nonparametric Bayesian modeling for dynamic ordinal regression relationships Athanasios Kottas Department of Applied Mathematics and Statistics, University of California, Santa Cruz Joint work with Maria

More information

Bayesian Wavelet-Based Functional Mixed Models

Bayesian Wavelet-Based Functional Mixed Models Bayesian Wavelet-Based Functional Mixed Models Jeffrey S. Morris U M.D. Anderson Cancer Center Raymond J. Carroll exas A&M University Functional Data Analysis Functional Data: Ideal units of observation:

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations John R. Michael, Significance, Inc. and William R. Schucany, Southern Methodist University The mixture

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

Estimation of a Two-component Mixture Model

Estimation of a Two-component Mixture Model Estimation of a Two-component Mixture Model Bodhisattva Sen 1,2 University of Cambridge, Cambridge, UK Columbia University, New York, USA Indian Statistical Institute, Kolkata, India 6 August, 2012 1 Joint

More information

The optimal discovery procedure: a new approach to simultaneous significance testing

The optimal discovery procedure: a new approach to simultaneous significance testing J. R. Statist. Soc. B (2007) 69, Part 3, pp. 347 368 The optimal discovery procedure: a new approach to simultaneous significance testing John D. Storey University of Washington, Seattle, USA [Received

More information

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1 Controlling Bayes Directional False Discovery Rate in Random Effects Model 1 Sanat K. Sarkar a, Tianhui Zhou b a Temple University, Philadelphia, PA 19122, USA b Wyeth Pharmaceuticals, Collegeville, PA

More information

Hybrid Dirichlet processes for functional data

Hybrid Dirichlet processes for functional data Hybrid Dirichlet processes for functional data Sonia Petrone Università Bocconi, Milano Joint work with Michele Guindani - U.T. MD Anderson Cancer Center, Houston and Alan Gelfand - Duke University, USA

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

ABTEKNILLINEN KORKEAKOULU

ABTEKNILLINEN KORKEAKOULU Two-way analysis of high-dimensional collinear data 1 Tommi Suvitaival 1 Janne Nikkilä 1,2 Matej Orešič 3 Samuel Kaski 1 1 Department of Information and Computer Science, Helsinki University of Technology,

More information

Genetic Association Studies in the Presence of Population Structure and Admixture

Genetic Association Studies in the Presence of Population Structure and Admixture Genetic Association Studies in the Presence of Population Structure and Admixture Purushottam W. Laud and Nicholas M. Pajewski Division of Biostatistics Department of Population Health Medical College

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 2017 1 / 10 Lecture 7: Prior Types Subjective

More information

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1 Parameter Estimation William H. Jefferys University of Texas at Austin bill@bayesrules.net Parameter Estimation 7/26/05 1 Elements of Inference Inference problems contain two indispensable elements: Data

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables Qi Tang (Joint work with Kam-Wah Tsui and Sijian Wang) Department of Statistics University of Wisconsin-Madison Feb. 8,

More information

Similarity Measures and Clustering In Genetics

Similarity Measures and Clustering In Genetics Similarity Measures and Clustering In Genetics Daniel Lawson Heilbronn Institute for Mathematical Research School of mathematics University of Bristol www.paintmychromosomes.com Talk outline Introduction

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Dissertations and Theses in Statistics Statistics, Department of 2009 DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Marginal Specifications and a Gaussian Copula Estimation

Marginal Specifications and a Gaussian Copula Estimation Marginal Specifications and a Gaussian Copula Estimation Kazim Azam Abstract Multivariate analysis involving random variables of different type like count, continuous or mixture of both is frequently required

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Exam: high-dimensional data analysis January 20, 2014

Exam: high-dimensional data analysis January 20, 2014 Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

VCMC: Variational Consensus Monte Carlo

VCMC: Variational Consensus Monte Carlo VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object

More information

Construction of Dependent Dirichlet Processes based on Poisson Processes

Construction of Dependent Dirichlet Processes based on Poisson Processes 1 / 31 Construction of Dependent Dirichlet Processes based on Poisson Processes Dahua Lin Eric Grimson John Fisher CSAIL MIT NIPS 2010 Outstanding Student Paper Award Presented by Shouyuan Chen Outline

More information

Part 7: Hierarchical Modeling

Part 7: Hierarchical Modeling Part 7: Hierarchical Modeling!1 Nested data It is common for data to be nested: i.e., observations on subjects are organized by a hierarchy Such data are often called hierarchical or multilevel For example,

More information

Bayesian Inference of Multiple Gaussian Graphical Models

Bayesian Inference of Multiple Gaussian Graphical Models Bayesian Inference of Multiple Gaussian Graphical Models Christine Peterson,, Francesco Stingo, and Marina Vannucci February 18, 2014 Abstract In this paper, we propose a Bayesian approach to inference

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models Matthew S. Johnson New York ASA Chapter Workshop CUNY Graduate Center New York, NY hspace1in December 17, 2009 December

More information

The Calibrated Bayes Factor for Model Comparison

The Calibrated Bayes Factor for Model Comparison The Calibrated Bayes Factor for Model Comparison Steve MacEachern The Ohio State University Joint work with Xinyi Xu, Pingbo Lu and Ruoxi Xu Supported by the NSF and NSA Bayesian Nonparametrics Workshop

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley University of Minnesota Mar 30, 2004 Outline

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

A Unified Approach for Simultaneous Gene Clustering and Differential Expression Identification

A Unified Approach for Simultaneous Gene Clustering and Differential Expression Identification A Unified Approach for Simultaneous Gene Clustering and Differential Expression Identification Ming Yuan and Christina Kendziorski (March 17, 2005) Abstract Although both clustering and identification

More information

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures Dongxiao Zhu June 13, 2018 1 Introduction OMICS data are increasingly available to biomedical researchers, and

More information

Hybrid Copula Bayesian Networks

Hybrid Copula Bayesian Networks Kiran Karra kiran.karra@vt.edu Hume Center Electrical and Computer Engineering Virginia Polytechnic Institute and State University September 7, 2016 Outline Introduction Prior Work Introduction to Copulas

More information

Extending the Robust Means Modeling Framework. Alyssa Counsell, Phil Chalmers, Matt Sigal, Rob Cribbie

Extending the Robust Means Modeling Framework. Alyssa Counsell, Phil Chalmers, Matt Sigal, Rob Cribbie Extending the Robust Means Modeling Framework Alyssa Counsell, Phil Chalmers, Matt Sigal, Rob Cribbie One-way Independent Subjects Design Model: Y ij = µ + τ j + ε ij, j = 1,, J Y ij = score of the ith

More information

Summary of Extending the Rank Likelihood for Semiparametric Copula Estimation, by Peter Hoff

Summary of Extending the Rank Likelihood for Semiparametric Copula Estimation, by Peter Hoff Summary of Extending the Rank Likelihood for Semiparametric Copula Estimation, by Peter Hoff David Gerard Department of Statistics University of Washington gerard2@uw.edu May 2, 2013 David Gerard (UW)

More information

Expression arrays, normalization, and error models

Expression arrays, normalization, and error models 1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

Nuisance parameters and their treatment

Nuisance parameters and their treatment BS2 Statistical Inference, Lecture 2, Hilary Term 2008 April 2, 2008 Ancillarity Inference principles Completeness A statistic A = a(x ) is said to be ancillary if (i) The distribution of A does not depend

More information

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Abhirup Datta 1 Sudipto Banerjee 1 Andrew O. Finley 2 Alan E. Gelfand 3 1 University of Minnesota, Minneapolis,

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017 Bayesian Statistics Debdeep Pati Florida State University April 3, 2017 Finite mixture model The finite mixture of normals can be equivalently expressed as y i N(µ Si ; τ 1 S i ), S i k π h δ h h=1 δ h

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

Nonparametric Bayes tensor factorizations for big data

Nonparametric Bayes tensor factorizations for big data Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & DARPA N66001-09-C-2082 Motivation Conditional

More information

Nearest Neighbor Gaussian Processes for Large Spatial Data

Nearest Neighbor Gaussian Processes for Large Spatial Data Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification 10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

More information

BIOLOGY 111. CHAPTER 1: An Introduction to the Science of Life

BIOLOGY 111. CHAPTER 1: An Introduction to the Science of Life BIOLOGY 111 CHAPTER 1: An Introduction to the Science of Life An Introduction to the Science of Life: Chapter Learning Outcomes 1.1) Describe the properties of life common to all living things. (Module

More information

Identifying Mixtures of Mixtures Using Bayesian Estimation

Identifying Mixtures of Mixtures Using Bayesian Estimation Identifying Mixtures of Mixtures Using Bayesian Estimation arxiv:1502.06449v3 [stat.me] 20 Jun 2016 Gertraud Malsiner-Walli Department of Applied Statistics, Johannes Kepler University Linz and Sylvia

More information