Tools and topics for microarray analysis

Size: px

Start display at page:

Download "Tools and topics for microarray analysis"

Richard Cunningham
5 years ago
Views:

1 Tools and topics for microarray analysis USSES Conference, Blowing Rock, North Carolina, June, 2005 Jason A. Osborne, Department of Statistics, North Carolina State University 1

2 Outline Introduction. An example from NCSU Forest Biotechnology program using SAS Scientific Discovery Systems along with JMP for visualization activities. A discussion of the False Discovery Rate including Bioconductor s qvalue() for its estimation. 2

3 Microarray experiments Use emerging technology to observe expression intensity for thousands of genes at once across different exptl. conditions. Analysis of data from these expts. presents many challenges. Terminology, communication Information technology Various platforms, continually developing Many sources of error: effects of individual organisms, arrays, spots, dye, RNA extraction and amplification, dynamic range of expression. Many different things to look at, investigate: The search for differentially expressed genes and the identification of significance The clustering of genes with similar behavior across different experimental conditions... 3

4 A cartoon would be nice. 4

5 An example The fungus Cronartium quercuum fusiforme is the known causal agent of fusiform rust disease in southern pine trees. A microarray experiment was undertaken to study, at the molecular level, the effects of the agent, as well as differences in resistance among genotypes over various stages of development in loblolly pine (Pinus taeda L.). The experiment involved 56 spotted 2-color slides and approximately 3000 genes, each spotted twice on each array. Carried out by H. Myburg, L. van Zyl, et al, Forest Biotechnology Program, NCSU Supported by USDA Forest Service Grant 5

Background on the fungal agent This image taken from www.backyardnature.

6 Background on the fungal agent This image taken from comes with an explanation of how the fungus prospers. This 5 year old 20-foot Loblolly, with the swollen item at about 10 feet, will probably die. 6

7 Design The treatment factors in a complete crossed layout with two satellite design points: Genotype: Heterozygous (Fr1/fr1) or homozygous (fr1/fr1) (i.e. resistant R or susceptible r). Inoculated with fungus (I) or with water (C). Time. (Tissue harvested at seven time points.) RNA extracted from harvested tissue from 700 seedlings 30 seedlings for each of the 14 challenged timepoints 20 seedlings for each of the 14 controls timepoints One time point (28 days) discarded (insuff. RNA) RNA pooled over 30 (or 20) seedlings, then sampled for comparisons. 7

8 Difficult to claim that this expt. accounts for biological variation due to pooling. Units are pools of RNA, not individual seedlings. 8

9 Processing and analysis of data from the experiment intensity files (one per array), design file, annotation file. 9

10 Quality control: background signal (Data from an expt. conducted by Katrin Wuennenburg-Stapleton, Ngai Lab at UC Berkeley, with four 2-color slides to study zebrafish.) 10

11 Quality control: array group correlation plots One matrix per treatment, one column per array channel 11

12 Screenshot from SAS Scientific Discovery Solutions 12

13 Normalization The idea is to correct for array and dye (and array dye) effects across whole genome. Could center all gene intensities about zero for each combination of array and dye. LOESS: Nonparametric regression of log 2 intensity for a channel against a baseline, which may be taken as average for that channel across all arrays. An observation for each spot on scatterplot. 13

14 Loess-normalization plots 14

15 Gene model: mixed model for signals from individual channels proc mixed; by new_cloneid; /* new_cloneid is the gene id */ class water host inoc age array spot_number; model y=host(water) inoc(water) age(water) host*inoc(water) host*age(water) inoc*age(water) host*inoc*age(water) water dye; random array spot_number(array); run; 15

16 Parameterizing the mean response Y (g) ijktdmn normalized log 2-intensity for gene (g) with indices i, j, k index samples (or treatment combinations): i indexes host j indexes inoculum k indexes control/treatment ( satellite design points) d indexes dye t indexes time m denotes array, n spot within array 16

17 Mixed model Y (g) ijkdtmn = µ (g) ijkt + δ(g) d + A (g) m + S (g) n(m) }{{} + E(g) ijkdtm }{{} fixed factorial effects random effects µ + α i(1) + β j + τ t when k = 1 µ ijkt = +(αβ) ij + (ατ) it + (βτ) jt + (αβτ) ijt µ + α i(2) + ω when k = 2 for g = 1,...,

18 F -tests for factorial effects Histograms of p-values from F -tests 18

19 Volcano Plots Plots of fold-changes (log 2 ratios) versus p-values. 19

20 Heat map, two-way clustering of multivariate genes, treatments 20

21 Parallel plots of mean response across treatments 21

22 22

23 R 2 histogram R 2 (g) = 1 variance(resid(g)) variance(log2in(g)) 23

24 Some conclusions Many genes exhibit significant differential expression across these treatments. The average estimated variance components were ˆσ 2 =.033, ˆσ S(A) 2 = 0.037, ˆσ2 A = but we don t have any assessment of variability among individuals due to pooling of RNA samples over seedlings. Clear time effects, with high fold-changes occurring at the later time points, particularly for the susceptible seedlings inoculated with the fungus. 24

25 False Discovery Rates Consider an expt with many tests of significance p (1) p (2) p (m) denote ordered, unadjusted p-values. A volcano plot, with ( log 10 (p)) on the vertical axis: 25

26 Truth table: Outcome from multiple tests Truth Declared Significant Not significant Total Null is true F m 0 F m 0 Alternative is true S m 1 S m 1 Total R = F + S m R m Some quantifications of error: comparisonwise (CER), familwise (FWE) and false discovery (pfdr): ( F CER = E m 0 ) F W E = Pr(F > 0) ( ) F F DR = E R R > 0 Appealing, straightforward interpretation of FDR in microarray: if these genes investigated further (e.g. by PCR), FDR is proportion that will result in a dead-end. 26

27 To control FDR at α, How does the BH step-up procedure work? 1. Order the raw p-values: p (1) p (m) 2. Find ˆk = max{k : p (k) kα/m} 3. If ˆk exists, reject tests corresponding to p (1),..., p (ˆk) Equivalently, the BH-adjusted p-values are defined as p (m) = p (m) p (m 1) = min{ p (m),.. p (1) = min{ p (2), mp (1) } m m 1 p (m 1)} 27

28 FDR option in PROC MULTTEST with variable raw p in dataset. (Taken from Westfall, et al, (1999)) The SAS System The Multtest Procedure p-values False Discovery Test Raw Bonferroni Rate

29 A different approach to multiple testing The stepup BH procedure estimates the rejection region, i.e. ˆk, so that on average, F DR < α. Alternatively, Storey (2002) advocates fixing the critical region, and then estimating the FDR. Information in the p-values about π 0 = m 0 /m may be used to obtain an estimator and to construct a more powerful procedure that may still be used to control FDR. 29

30 Estimation of FDR Consider fixing the critical region by rejecting hypotheses with p-values less than t. From the truth table F DR(t) E[F (t)] E[R(t)] = tm 0 E[#{p i < t}] F DR(t) (Need an estimator ˆπ 0 of m 0 /m.) t ˆm 0 #{p i < t} = tˆπ 0m #{p i < t} 30

31 Estimation of ˆπ 0 π 0 = m 0 m Introduce a tuning parameter, 0 < λ < 1: ˆπ 0 (λ) = #{p i > λ} m(1 λ) Choose the best λ then substitute ˆπ 0 into the expression for F DR(t) for the fixed critical region (0, t). 31

32 Estimation of ˆπ 0 Storey (2002) considered a simulation with m = 1000 tests of H 0 : µ = 0 against H 1 : µ > 0 two random samples of size π 0 m from N(0, 1) and (1 π 0 )m from N(2, 1) (for a variety of π 0 ). p i = 1 Φ(y i ), i = 1,..., 1000 For the case π 0 = 0.8, some plots on next slide,... 32

33 33

34 Estimation of π 0 continued λ #{p i > λ} ˆπ 0 (λ) = = =.92 (Positive bias of ˆπ 0 for λ near 0, high variance for λ near 1.) qvalue() procedure in R fits smooth function, π(λ) and considers limit as λ 1. A bootstrap procedure is also available. 34

35 35

36 Estimation of FDR, contd Consider a rejection region of (0,.01) for the m = 1000 normal mixture. ˆπ 0 = 0.807(smoother estimate from software) R(.01) = 85(number of tests rejected) F DR(0.01) = ˆπ 0 mt #{p i < t} = 0.807(1000)(0.01) 85 = Intepretations: The proportion of the 85 rejected tests that are false leads is estimated to be about 10%. Bonferroni correction with α = 0.1 leads to 6 rejected tests, and we re able to say that Pr( 1 false lead) 0.1. If CER = 0.1 (no multiplicity adjustment), 229 tests are rejected, and type I error among the m = 1000 tests is 10% 36

37 q-values and their interpretation q value(p i ) = min t p i F DR(t) A measure of significance in terms of the FDR. The smallest FDR at which the statistic may be declared significant. 37

38 References Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSSB, 57: Storey JD. (2002) A direct approach to false discovery rates. JRSSB, 64: Storey JD and Tibshirani R. (2003) Statistical significance for genome-wide studies. PNAS, 100: Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: Storey JD, Taylor JE, and Siegmund D. (2004) Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. JRSSB, 66:

Estimation of the False Discovery Rate

Estimation of the False Discovery Rate Coffee Talk, Bioinformatics Research Center, Sept, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline