SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Similar documents
Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

GaGa: a Parsimonious and Flexible Hierarchical Model for Microarray Data Analysis

Hypothesis Tests for One Population Mean

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment

Lab 1 The Scientific Method

AP Statistics Notes Unit Two: The Normal Distributions

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

, which yields. where z1. and z2

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Tree Structured Classifier

Simple Linear Regression (single variable)

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

How do scientists measure trees? What is DBH?

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Pipetting 101 Developed by BSU CityLab

Computational modeling techniques

Comparing Several Means: ANOVA. Group Means and Grand Mean

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

Sequential Allocation with Minimal Switching

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Kinematic transformation of mechanical behavior Neville Hogan

Resampling Methods. Chapter 5. Chapter 5 1 / 52

A Matrix Representation of Panel Data

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

A NOTE ON BAYESIAN ANALYSIS OF THE. University of Oxford. and. A. C. Davison. Swiss Federal Institute of Technology. March 10, 1998.

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

ECEN 4872/5827 Lecture Notes

Differentiation Applications 1: Related Rates

UNIV1"'RSITY OF NORTH CAROLINA Department of Statistics Chapel Hill, N. C. CUMULATIVE SUM CONTROL CHARTS FOR THE FOLDED NORMAL DISTRIBUTION

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

Verification of Quality Parameters of a Solar Panel and Modification in Formulae of its Series Resistance

On Huntsberger Type Shrinkage Estimator for the Mean of Normal Distribution ABSTRACT INTRODUCTION

Pattern Recognition 2014 Support Vector Machines

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

What is Statistical Learning?

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Part 3 Introduction to statistical classification techniques

Sections 15.1 to 15.12, 16.1 and 16.2 of the textbook (Robbins-Miller) cover the materials required for this topic.

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Physics 2B Chapter 23 Notes - Faraday s Law & Inductors Spring 2018

NUMBERS, MATHEMATICS AND EQUATIONS

READING STATECHART DIAGRAMS

Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Key Wrds: Autregressive, Mving Average, Runs Tests, Shewhart Cntrl Chart

Distributions, spatial statistics and a Bayesian perspective

Activity Guide Loops and Random Numbers

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Lecture 17: Free Energy of Multi-phase Solutions at Equilibrium

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

ENG2410 Digital Design Sequential Circuits: Part B

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Chapter 3: Cluster Analysis

Inference in the Multiple-Regression

How T o Start A n Objective Evaluation O f Your Training Program

Introduction to Quantitative Genetics II: Resemblance Between Relatives

Physics 2010 Motion with Constant Acceleration Experiment 1

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

CESAR Science Case The differential rotation of the Sun and its Chromosphere. Introduction. Material that is necessary during the laboratory

LCAO APPROXIMATIONS OF ORGANIC Pi MO SYSTEMS The allyl system (cation, anion or radical).

5 th grade Common Core Standards

CHM112 Lab Graphing with Excel Grading Rubric

Experiment #3. Graphing with Excel

Revision: August 19, E Main Suite D Pullman, WA (509) Voice and Fax

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

Checking the resolved resonance region in EXFOR database

Particle Size Distributions from SANS Data Using the Maximum Entropy Method. By J. A. POTTON, G. J. DANIELL AND B. D. RAINFORD

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

SAMPLING DYNAMICAL SYSTEMS

arxiv:hep-ph/ v1 2 Jun 1995

NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION

CLASS. Fractions and Angles. Teacher Report. No. of test takers: 25. School Name: EI School. City: Ahmedabad CLASS 6 B 8709

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Phys. 344 Ch 7 Lecture 8 Fri., April. 10 th,

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP

Lecture 24: Flory-Huggins Theory

Unit Project Descriptio

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

BOUNDED UNCERTAINTY AND CLIMATE CHANGE ECONOMICS. Christopher Costello, Andrew Solow, Michael Neubert, and Stephen Polasky

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Revisiting the Socrates Example

CSE4334/5334 Data Mining Association Rule Mining. Chengkai Li University of Texas at Arlington Fall 2017

THE LIFE OF AN OBJECT IT SYSTEMS

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

Performance Bounds for Detect and Avoid Signal Sensing

Inverse Document Frequency (IDF): A Measure of Deviations from Poisson

Subject description processes

BASD HIGH SCHOOL FORMAL LAB REPORT

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Chapter 13: The Correlation Coefficient and the Regression Line. We begin with a some useful facts about straight lines.

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

Equilibrium of Stress

Determining the Accuracy of Modal Parameter Estimation Methods

Transcription:

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical mdel fr micrarray data analysis David Rssell Department f Bistatistics M.D. Andersn Cancer Center, Hustn, TX 77030, USA rsselldavid@gmail.cm Cntents 1 Bayesian prcedure fr gene differential expressin analysis 2 2 Frequentist perating characteristics f the Bayesian prcedure 3 3 Frequentist FDR fr the Armstrng dataset 5 4 Gdness f fit in the Armstrng dataset 6 1

1 Bayesian prcedure fr gene differential expressin analysis Here we detail the Bayesian prcedure that minimizes the Bayesian FNR subject t the Bayesian FDR being belw a threshld α. Let δ 1,..., δ n dente the expressin pattern that each gene fllws, e.g. if there are nly tw grups t be cmpared δ i = 0 dentes the null hypthesis that bth grups are equal and δ i = 1 dentes that they are different. Dente as d i = d i (x) the pattern that gene i is assigned t, i.e. d i = 0 means that we declare the gene as equally expressed (EE) and d i 0 that we declare it differentially expressed (DE). The false negative (FNP) and false discvery prprtins (FDP) can be written as: FNP = FDP = i=1 I(d i = 0)I(δ i 0) i=1 I(d i = 0) i=1 I(d i = 1)I(δ i = 0) i=1 I(d, (1) i = 1) where I( ) is the indicatr functin. That is, the FNP is the prprtin f genes declared EE that are actually DE, and the FDP is the prprtin f genes declared DE that are actually EE. The Bayesian FNR and Bayesian FDR are defined as the expected FNP and FDP, respectively, where the expectatin is taken with respect t the psterir distributin f δ 1,..., δ n (1). Fr any fixed decisins d 1,..., d n, ne can evaluate the Bayesian FNR and FDR simply as BFNR = BFDR = i=1 I(d i = 0)(1 v i0 ) i=1 I(d i = 0) i=1 I(d i = 1)v i0 n i=1 I(d i = 1), (2) where v i0 = P (δ i = 0 x) is the psterir prbability that gene i is EE. (2) shwed that, in a setup with nly tw hyptheses, the ptimal rule t minimize BFNR subject t BFDR< α is t declare as DE all genes with v i0 belw a certain threshld t, i.e. d i = I(v i0 < t), where t is the minimum value such that BFDR α. Nte that BFNR and BFDR nly take int accunt whether a gene was classified int pattern 0 r nt. Therefre, when minimizing BFNR subject t BFDR α it makes n difference whether a particular gene is assigned t pattern 1 r pattern 2, say. We prpse the

bvius: given that a gene is declared DE, we assign it t the pattern with highest psterir prbability, i.e. δ i = I(v i0 < t) argmax k {1,...,H 1} (v ik ). It is straightfrward t see that, fr any fixed BFNR and BFDR, this rule maximizes the expected number f genes crrectly classified int their expressin pattern. 2 Frequentist perating characteristics f the Bayesian prcedure In this sectin we review the algrithm that (3) prpsed t estimate the frequentist FDR fr any given prcedure, i.e. the expected FDP in (1) when the prcedure is applied under repeated sampling. As a first step, ne btains btstrap samples frm the riginal data, in such a way that the sample mean and variance f each gene are rughly preserved and that it represents a sample under the cmplete null hypthesis that n genes are DE. Then ne repeatedly applies the prcedure t each btstrap dataset, btaining an estimate f the number f false psitives, and cmpares that t the number f genes fund in the riginal dataset. Mre specifically, the algrithm is as fllws: Algrithm 1 1. Apply the prcedure t the riginal dataset, and dente the number f genes declared t be DE as P. Dente as X i and S i the sample mean and standard deviatin f the gene expressin measurements fr gene i = 1... n. 2. Fr b = 1... B, d: Cmpute z ij = (x ij X i )/S i, i = 1... n, j = 1... J. Fr each gene, btain a sample f size J with replacement frm the cllectin f all z ij. Dente this sampled values as z (b) ij. Then cmpute x (b) ij = S i z (b) ij X i. Apply the prcedure t find differentially expressed genes t the btstrap dataset. Since all discveries are false psitives, dente the number f genes declared DE as FP b. P B b=1 FP 3. Estimate the frequentist FDR as FDR = ˆπ b 0, where ˆπ B P 0 is an estimate f the prprtin f EE genes.

Grup 1 Grup 2 2.2 2.8 2.8 3.2 3.1 2.0 2.5 2.1 2.7 2.1 2.0 2.5 2.1 2.9 2.3 10.5 2.9 2.2 2.4 2.4 Table 1: Hypthetical expressin values fr a gene, 10 arrays and 2 grups There is an imprtant remark t make here: there are ther ways t simulate data under the null hypthesis. Fr instance, ne culd simply permute the grup labels r btstrap data within each gene, but this culd be trublesme when the sample size is t small t prvide an accurate representatin f the null. Determining what sample size is large enugh is nt a simple matter. Fr example, suppse that we have a gene fr which mst expressin values are small, but fr ne f the grups it ccasinally presents large expressin values. Table 1 10 presents hypthetical values fr a single gene and 2 grups. Nte that there are ( ) 10 5 = 184, 756 ways t permute the grup labels, which at first sight may seem a large enugh number. Hwever, in grup 2 there is an utlying value. Fr any permutatin, whatever grup the utlier is assigned t will tend t be declared t have higher expressin levels than the ther grup, and it will be cunted as a false psitive. If ne btains a btstrap sample, there is sme prbability that the effect f the utlier will be mitigated (the utlier may nt be sampled at all, r be sampled the same number f times in bth grups), but it will still tend t increase t false psitive cunt. If ne btstraps residuals frm ther genes, such as Algrithm 1 des, the value f 10.5 may nt be utlying at all anymre, since ther genes may present values that are als far away frm the mean. Hence, we see that three reasnable strategies t sample under the null may result in three quite different sampling distributins and estimated FDR. The main issue is whether the utlying value shuld be cnsidered t be an errr r nt. In ur experience, in micrarrays it is nt unfrequent t encunter data such as that in Table 1. A pssible bilgical explanatin is that a small prprtin f individuals frm grup 2 experience sme kind f

GaGa mdel MiGaGa mdel (M=2) Estimated frequentist FDR 5 arrays 10 arrays 15 arrays All data Estimated frequentist FDR 5 arrays 10 arrays 15 arrays All data Bayesian FDR Bayesian FDR Figure 1: Armstrng dataset. Frequentist FDR fr the GaGa (a) and Mi- GaGa (b) mdels. mutatin, which causes the expressin f a particular gene t raise cnsiderably. If such a discvery is bilgically meaningful ne shuld nt cunt it as a false psitive, and hence methds based n permuting r btstrapping each gene separately wuld nt be apprpriate. We agree with (3) that mre research is needed regarding this tpic, but we feel that unless the number f arrays is quite large it may be beneficial t use a resampling scheme that uses data frm several genes at nce. 3 Frequentist FDR fr the Armstrng dataset We nw estimate the frequentist FDR f the Bayesian prcedure utlined in Sectin 1 by applying Algrithm 1 t the Armstrng dataset. Figure 1(a) displays the estimated frequentist FDR fr target Bayesian FDRs ranging frm 0 t 0.1, bth fr the GaGa and MiGaGa mdels and increasing amunts f data. Fr Bayesian FDR at the 0.05 level, the estimated frequentist FDR is always belw 0.05. The nly exceptin is the MiGaGa mdel applied t the full dataset, fr which the frequentist FDR is estimated t be 6.4%. Figure 2 prvides the analgus estimates fr the mntnically transfrmed data, which was analyzed with a GaGa mdel. Fr a Bayesian FDR f 0.05, the estimated frequentist FDR is belw 0.05, as desired.

(a) (b) GaGa mdel, transfrmed data Estimated frequentist FDR 5 arrays 10 arrays 15 arrays All data Density 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Observed data GaGa 0 5 10 15 Bayesian FDR Expressin levels (lg scale) Figure 2: Armstrng dataset after mntnic transfrmatin. (a): frequentist FDR ft the GaGa mdel; (b): marginal distributin f data vs. prir predictive f GaGa mdel with ω = ˆω. We cnclude that the Bayesian prcedure frm Sectin 1 has reasnably gd frequentist perating characteristics when applied t the Armstrng dataset. 4 Gdness f fit in the Armstrng dataset In the riginal paper, Sectin 6.2.1, we evaluated sme aspects f the verall gdness-f-fit. Figure 2(b) cmpares the marginal distributin f the mntnically transfrmed data with draws frm the prir-predictive GaGa mdel, setting the hyper-parameters t their psterir mean. The mntnic transfrmatin imprves the fit f the GaGa mdel substantially (cmpare with Figure 4(a) in the riginal paper). We nw assess the fit fr sme genes individually. First, we select the tw genes with the highest prbability f being DE accrding t the Ga mdel. Figure 3(a) cmpares their bserved expressin values with draws frm their psterir predictive distributin based n the Ga mdel. We see that, even thugh the mdel underestimates the variability fr the MLL grup, the tw genes d actually seem t be differentially expressed. Figure 3(b) shws hw draws frm the GaGa mdel psterir predictive mre apprpriately capture

(a) (b) Ga mdel GaGa mdel Prbe 37809_at ALL MLL Prbe 37809_at CI CII Prbe 1914_at (c) Ga mdel Prbe 1914_at (d) GaGa mdel Prbe 2087_s_at ALL MLL Prbe 2087_s_at CI CII Prbe 1369_s_at Prbe 1369_s_at Figure 3: Observed expressin values vs. predictive distributin. Large black symbls are actual bservatins, small gray symbls are draws frm the psterir predictive. (a),(b): the tw genes with highest prbability f being DE accrding t the Ga mdel; (c),(d): tw genes declared DE by the Ga mdel and declared EE by the GaGa mdel

the variability f the data. Fr these tw genes in particular, hwever, the result f the inference is the same: bth mdels declare prbes 1914 at and 37809 at t be differentially expressed. We nw select tw genes that are declared DE by the Ga mdel and EE by the GaGa mdel, and again cmpare the bserved values t their psterir predictive distributin. Figure 3(c) reveals that the Ga mdel underestimates the variability in the data, while in panel (d) we see that GaGa represents it mre satisfactrily. In this case the inference abut the prbes 1369 s at and 2087 s at frm bth mdels is radically different, fr the GaGa mdel assigns a psterir prbability < 0.01 that each gene is differentially expressed. The pr Ga fit t these tw genes and the fact that n strng differences between grups are bserved in Figure 3(c)-(d) suggest that the inference prvided by the GaGa mdel is mre realiable. Althugh nt presented here, we bserve a similar favrable behavir f the MiGaGa mdel with M = 2 cmpnents. We cnclude that the psterir distributin f the GaGa and MiGaGa mdels present an adequate fit t the data. This is in cntrast with the prir-predictive plt in Figure 3(b) in the riginal paper, which suggested the GaGa fit t be f limited quality. This shuld nt be t surprising, it merely reflects that the psterir distributin incrprates the infrmatin abut bimdality present in the data. References [1] P. Müller, G. Parmigiani, and K. Rice. FDR and Bayesian Multiple Cmparisns Rules. Oxfrd University Press, 2007. [2] P. Müller, G. Parmigiani, C. Rbert, and J. Russeau. Optimal sample size fr multiple testing: the case f gene expressin micrarrays. Jurnal f the American Statistical Assciatin, 99:990 1001, 2004. [3] J.D. Strey. The ptimal discvery prcedure: A new apprach t simultaneus significance testing. Jurnal f the Ryal Statistical Sciety B, 69:347 368, 2007.