Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Size: px

Start display at page:

Download "Probe-Level Analysis of Affymetrix GeneChip Microarray Data"

Robert Jordan
5 years ago
Views:

1 Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad Biostatistics, University of California, Berkeley University of Minnesota Mar 30, 2004

2 Outline A brief introduction to the Affymetrix GeneChip Technology Pre-processing Background Correction Normalization Probe Level Modeling Models Applications

3 Brief Technology Overview High density oligonucleotide array technology as developed by Affymetrix Known as the GeneChip Overview images courtesy of Affymetrix unless otherwise specified

4 Probes and Probesets

5 Two Probe Types Reference Sequence TAGGTCTGTATGACAGACACAAAGAAGATG CAGACATAGTGTCTGTGTTTCTTCT CAGACATAGTGTGTGTGTTTCTTCT PM: the Perfect Match MM: the Mismatch

6 Constructing the Chip Source: Lipshutz et al (1999) Nature Genetics Supplement: The Chipping Forecast

7 Focusing on a Single GeneChip Cell Location

8 Sample Preparation

9 Hybridization to the Chip

10 The Chip is Scanned

11 Chip dat file checkered board oligo B2 Courtesy: F. Colin

12 Chip dat file checkered board close up w/ grid Courtesy: F. Colin

13 Chip dat file checkered board close up pixel selection

14 Chip cel file checkered board Courtesy: F. Colin

15 Probe-Level Analysis What is Probe-level analysis? Analysis and manipulation of probe intensity data Expression calculation: Background, Normalization, Summarization Fitting models to probe intensity data Quality control diagnostics Why do we do it? Hopefully it will allow us to produce better, more biologically meaningful gene expression values We want accurate (low bias) and precise (low variance) gene expression estimates Is there additional information at the probe-level that we might otherwise throw away?

16 High-Level Analysis Clustering/Classification Pathway Analysis Cell Cycle Gene function Anything where a more biological interpretation is desired Such matters will not be discussed further in today's talk.

17 Background/Signal Adjustment A method which does some or all of the following Corrects for background noise, processing effects Adjusts for cross hybridization Adjust estimated expression values to fall on proper scale Probe intensities are used in background adjustment to compute correction (unlike cdna arrays where area surrounding spot might be used)

18 RMA Background Approach Convolution Model = + Observed O Signal S ( ) Noise N Exp α ( 2 N µ, σ )

19 E Correction is given by ( ) SO= o = a+ b a= o, b= 2 µ σ α σ φ a o a φ b b a o a Φ +Φ 1 b b

20 Normalization Non-biological factors can contribute to the variability of data... In order to reliably compare data from multiple probe arrays, differences of non-biological origin must be minimized. 1 Normalization is a process of reducing unwanted variation across chips. It may use information from multiple chips 1 GeneChip 3.1 Expression Analysis Algorithm Tutorial, Affymetrix technical support

21 Non-Biological Variability 5 scanners for 6 dilution groups

22 Non-linear normalization needed Unnormalized Scaled A Non-linear Normalization

23 Quantile Normalization Normalize so that the quantiles of each chip are equal. Simple and fast algorithm. Goal is to give same distribution to each chip. Original Distribution Target Distribution

24 Sort Sort columns of of original matrix Take averages across rows Take averages across rows Set Set average as as value for for All All elements in in the the row row Unsort columns of of matrix to to original order

25 It Reduces Variability Expression Values Fold change Also no serious bias effects. For more see Bolstad et al (2003)

26 General Probe Level Model y ij = f( X) + ε ij Where f(x) is function of factor (and possibly covariate) variables (our interest will be in linear functions) y ij is a pre-processed probe intensity (usually log scale) Assume that E = 0 Var ε ij ε ij = σ 2

27 Parallel Behavior Suggests Multi-chip Model

28 Probe Pattern Suggests Including Probe-Effect

29 Also Want Robustness

30 Summarization Problem: Calculating gene expression values. How do we reduce the probe intensities for each probeset on to a gene expression value? Our Approach RMA a robust multi-chip linear model fit on the log scale Some Other Approaches Single chip AvDiff (Affymetrix) no longer recommended for use due to many flaws Mas 5.0 (Affymetrix) use a 1 step Tukey biweight to combine the probe intensities in log scale Multiple Chip MBEI (Li-Wong dchip) a multiplicative model on natural scale

31 The Three Steps of RMA 1. Convolution Background 2. Quantile Normalization 3. Linear model on the log2 scale fit robustly. Software for implementing RMA is in the Bioconductor affy package

32 RMA mostly does well in practice Detecting Differential Expression Not noisy in low intensities RMA MAS 5.0

33 One Drawback RMA MAS 5.0 Some fixes for this are being developed see GCRMA (Irizarry and Wu, JHU)

34 The RMA model y = m+ α + β + ε ij i j ij where y ij α i β j = log N B 2 ( ( PM )) ij is a probe-effect i= 1,,I is chip-effect ( m + β j is log2 gene expression on array j) j=1,,j

35 Median Polish Algorithm y11 y1 J 0 yi1 yij Imposes Constraints Sweep Columns Iterate medianα = median β = 0 i Sweep Rows median ε = median ε = 0 i ij j ij j ˆ ε ˆ ε αˆ 11 1J 1 ˆ ε ˆ ˆ I1 ε α IJ I ˆ ˆ β β mˆ 1 J

36 Advantages Fast Very robust Disadvantages Median Polish No algorithmic flexibility to fit alternative models No standard error estimates

37 An Alternative Method for Fitting a PLM Robust regression using M-estimation In this talk, we will use Huber s influence function ψ. The software handles many more. Fitting algorithm is IRLS with weights dependent on current residuals ψ r r ( ) Software for fitting such models is part of affyplm package of Bioconductor

38 Variance Covariance Estimates Suppose model is Y = Xβ + ε Huber (1981) gives three forms for estimating variance covariance matrix 2 κ 1/ ( n p) ψ ( r) 1/ n i i ψ ( r) i i 2 2 ( T X X ) 1 1/( n p) ψ ri i κ 1/ n ψ r i ( ) ( ) i 2 W 1 1 1/ ( ) ( ) ( T n p ψ r ) i W X X W κ i We will use this form T ' W = X Ψ X

39 We Will Focus on Two Particular PLM Array effect model Array Effect Pre-processed Log PM intensity Treatment effect model y = α + β + ε ij i j ij y = α + τ + ε ij i l ij j In both cases I i= 1 α i = 0 Probe Effect Treatment Effect

40 Detecting Differential Expression Problem: Given an experiment with two treatment groups correctly identify the differential genes without incorrectly choosing non differential genes. Question 1: How do different methods for DDE compare? Question 2: Can we do better using PLM s?

41 How Do We Know Which Genes are Differential? Spike-in datasets. Transcripts at known concentrations differing across arrays. Common background crna. Typically, Latin Square design Affymetrix HG-U95A GeneLogic AML bacterial spike-ins GeneLogic Tonsil bacterial spike-ins

42 Affymetrix Spike-in Data 59 chips. All but 1 of the rows are done as triplicates A B C D E F G H I J K L M N O P Q R S T

43 Testing for Differential Expression On a probeset by probeset basis compute a test statistic Should include something related to observed FC and some variability estimate A t statistic: something of the form t = X SE

44 Fold Change FC = X X l m Where X l = β j Ind( j group l) Ind( j group l)

45 Simple t-statistic t = X l s n m 2 s 2 l m l X + n m

46 Robust t-statistic t = X l s n m 2 s 2 l m l X + n m Use medians in place of means Use MAD in place of standard deviation

47 Simple Moderated t- statistic t = s n l 2 s 2 l m l X X m m + + n s med s 2 2 sl sm med is median + across all genes Analogous to SAM n l n m

48 Limma ebayes t- statistic Generalization of Bayesian method of Lonnstadt and Speed (2002) to the general linear model case An alternative and much more sophisticated moderated t-statistic. Variability is estimated using a posterior From the limma Bioconductor package.

49 Probe Level Model test statistics Σ Suppose that is component of the variancecovariance matrix related to β Let c be the contrast vector defined such that the j th element of c is 1 if array j is in group l n l 1 if array j is in group m n m 0 otherwise

50 Probe Level Model test statistics t PLM.1 = J j= 1 T c β c 2 j Σ jj PLM.2 T t = c β T c Σc

51 A First Comparison 8 chips from Affymetrix HG-U95A spike-in dataset 4 arrays for each of two concentration profiles Fit an array effect model to all 8 chips Compare the performance of the different methods by looking at all comparisons 1 vs 1 2 vs 2 3 vs 3 4 vs 4

56 What Happens as the Number of Arrays Increases? Expand comparison to all 24 Arrays with same concentration profiles from Affymetrix HG-U95A spike-in dataset Fit an array effect model to all 24 arrays Look at comparisons between equal number of chips

61 A Larger Comparison Look at the entire 59 chips for Affymetrix HG-U95A spike-in dataset Examine two cases. After standard preprocessing Fit models for each pairwise comparison (individual models) Fit a model to all 59 chips (single model) There are 91 pairwise comparisions

62 Results Method Individual Models Single Model 0% FP 5% FP AUC 0% FP 5% FP AUC FC Std Robust Mod PLM PLM Ebayes

63 More Spike-in Datasets Two GeneLogic Spike-in datasets Tonsil dataset (36 arrays) AML dataset (34 arrays) In each case use single models fitted to all arrays

66 What is Going On Here? Examine residuals stratified by concentration group Spike-ins Randomly chosen non-differential probesets at low, medium and high average expression

67 Affymetrix Spike-ins

68 Low Non-Differential

69 Middle Non-Differential

70 High Non-Differential

71 GeneLogic AML Spike-ins

72 GeneLogic Tonsil dataset

73 How About With More Real Data? GeneLogic Dilution/Mixture Dataset Learning Set Liver 30x VS CNS 30x 400 top genes truth 75:25 5x VS 25:75 5x Test Set Mixture Data Dilution Series Data

74 Mixture Data Results Method 3 vs 3 4 vs 4 5 vs 5 0% FP 5% FP AUC 0% FP 5% FP AUC 0% FP 5% FP AUC FC Std Robust Mod PLM PLM Ebayes

75 What About the Treatment Effect Model? With treatment effect model, have more observations contributing to estimation for each element of the variance covariance matrix Much quicker to fit

77 Moderation for the PLM test statistic Method 5 vs 5 0% FP 5% FP AUC PLM Ebayes PLM Moderated (Preliminary Results)

78 Quality Assessment using PLM PLM quantities useful for assessing chip quality Weights Residuals Standard Errors Expression values relative to median chip

79 Pseudo-chip images Weights Residuals Positive Residuals Negative Residuals

80 NUSE Normalized Unscaled Standard Errors

81 RLE Relative Log Expression

82 Ongoing Work in this Area Technology changes: What still works? What doesn t? Other probe-level models (eg Introduce sequence information, MM s)

83 Acknowledgements Terry Speed (UC Berkeley) Rafael Irizarry (Johns Hopkins) Francois Colin (UC Berkeley) Julia Brettschneider (UC Berkeley) Bioconductor Core

84 Additional Slides

85 From Chip To Data

86 Constructing a gene expression measure

87 Computing Expression Measures: A Three Step Procedure 1. Background/Signal adjustment (B) 2. Normalization (N) 3. Summarization (S) Let be cel file data from multiple arrays then X X Expression values = S(N(B( )))

88 Background Methods Affymetrix Location dependent Ideal mismatch RMA Convolution model Other Standard curve adjustment GCRMA (Wu, Irizzary et al 2003) + =

89 Background Signal Methods Affymetrix Location dependent background based on grids I will refer to this as the MAS 5 background Originally proposed subtracting MM from PM but this is problematic because as many as a third of MM s are greater than the respective PM No longer used Now uses what they refer to as the Ideal Mismatch which is MM when possible and something else when not possible (designed so that there is now no negatives) Call this IMM

90 Original RMA Background Convolution model is suggested by looking at density of observed empirical distributions

91 Convolution Model O = S + N O is observed PM, S is signal (assumed exponential), N is noise (assumed normal, truncated at zero) Correction is then a o a φ φ E ( ) b = = + b S O o a b a o a Φ Φ b b a = o, b = 2 µ σ α σ 1

92 A Standard Curve Adjustment Based on Spike-in Information Observes that there is a curve that relates observed expression and spike-in concentration. The ideal would be to have a linear relationship between concentration and computed expression. The curve gives us a concentration dependent adjustment

93 What About Non Spike-ins? We don t know a concentration for most probesets. If we did, or if we had a variable that related to concentration, the adjustment would be easy to perform Fit the following model y y ( k ) = α + ε ( k) ( k) 1i i i ( k) k ( k) = + + ( k) ( ) 2 i αi γ ε' i Where y y = log ( k) ( k) 1i 2 i = log PM MM ( k) ( k) 2i 2 i

94 γ Relates to Concentration

95 Establishing a Relationship γ Between and Concentration

96 The Two Curves Yield an Adjustment Curve

97 Normalization Methods Methods already compared in Bolstad et al (2003) Baseline (normalized using reference chip) Scaling (Affymetrix) Non linear (Li-Wong) Complete data (no reference chip, information from all arrays used) Quantile normalization (Bolstad et al 2003) Contrast (Åstrand, 2003) Cyclic Loess

98 Why Quantile Normalization? Quantile normalization found to perform acceptably in reducing variance without drastic bias effects Quantile normalization is fast

99 RMA Model To each probeset (k), with i being number of probes and j being number of chips, fit the model: y = α + β + ε ( k) ( k) ( k) ( k) ij i j ij ( k) ( k) where αi is a probe effect and β j is the ( k ) log gene expression. y ij is the log2 background adjusted and normalized PM intensity Different ways to fit this model Median polish quick Robust linear model yields some good quality diagnostic tools

100 Basic RMA model Let then y ij = log N B 2 ( ( PM )) ij y = m+ α + β + ε ij i j ij where α i β j is probe-effect is chip-effect ( m + β j is log2 gene expression on array j) Median-polish imposes constraints medianα = median β = 0 i median ε = median ε = 0 i ij j ij j

101 Advantages/Disadvantag es of RMA/Median polish Advantages Fast Gene expression measures perform favorably when compared with MAS 5.0, Li-Wong MBEI Robust against outliers Disadvantages No standard error estimates No algorithmic flexibility to fit alternative models

102 Probe Level Models are RMA method based on RMA Convolution Model Background Quantile Normalization Summarization using a robust multi-chip model on the log scale. Model is fitted using the median polish algorithm on a probeset by probeset basis

103 Comparing the background methods Using an Affymetrix spike-in experiment we shall examine Observed vs spike-in concentration Observed vs expected fold change Composite M vs A plots ROC curves In each case we will compute expression values use standard RMA methodology. (ie quantile normalization, median polish summarization)

104 Assessing Bias: Observed Expression vs Spike-in Concentration Slope None RMA MAS 5 IMM MAS5/ S.C.A. IMM All Mid Low High

105 Assessing Bias: Observed Fold-change versus Expected Fold-change Slope None RMA MAS 5 IMM MAS5/ IMM S.C.A. All

106 Assessing Variability: M vs A plots Vertical Axis is M: log2 fold-change. Horizontal Axis is A: average expression value (on log2 scale) aka geometric average. Ideally non differential genes tight about M=0

107

108

109

110

111

112

113 Detecting Differential Expression: ROC Curves

114 Summary of Trade-offs Background Method No Background RMA MAS 5.0 Ideal Mismatch MAS5.0/IdealMM Standard Curve Adjustment Detect Differential Genes Good Good Good Poor Poor Good Accurate estimates of Fold change Poor Poor Poor Good Good Good

115 Comparing the Normalization Methods Want to reduce variation but at the same time we do not want to introduce any bias First a quick examination of the expression values by array Using same spike-in experiment as before, this time no background correction, only normalization and median polish summarization.

116

117

118

119 Scaling is Not Sufficient

120 Variability of Non-Differential Genes is Reduced

121 Little effect on Spike-ins Method All Low Mid High FC No Normalization (0.845) (0.148) (0.733) (0.207) (0.952) Quantile (0.851) (0.153) (0.741) (0.224) (0.955) Scaling (0.852) (0.156) (0.742) 0.33 (0.225) (0.954)

122 ROC Curves

123 Comparing Established Expression Measures

124 Slope All Mid Low High Value

125 Slope All Mid Low High Value

126 Slope All Mid Low High Value

127 Slope All Mid Low High Value

128 Slope All Mid Low High Value

129 Slope All Mid Low High Value

130 Slope: 0.484

131 Slope: 0.624

132 Slope: 0.583

133 Slope: 0.683

134 Slope: 0.692

135 Slope: 0.847

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline