Microarray Preprocessing

Size: px

Start display at page:

Download "Microarray Preprocessing"

Lucas Conley
6 years ago
Views:

1 Microarray Preprocessing

2 Normaliza$on Normaliza$on is needed to ensure that differences in intensi$es are indeed due to differen$al expression, and not some prin$ng, hybridiza$on, or scanning ar$fact. Normaliza$on is necessary before any analysis which involves within or between slides comparisons of intensi$es, e.g., clustering, tes$ng. Somewhat different approaches are used in two color and one color technologies Normaliza$on is ubiquitous across high throughput technologies

3 Preprocessing Lingo Spatial artifacts: spatial refers to location on array. Problematic when the same feature is in the same physical location on arrays. Background correction: we get a measurement even when no DNA (RNA) is present. This is background. Usually estimated on a per-array basis. Normalization: techniques for making measurements comparable across (between) arrays. Sometimes within array normalization is used. Summarization: Mostly used for Affymetrix arrays. Used when multiple features measure the same target, to summarize the multiple measurements into a single measurement. Preprocessing: Do all (or some) of the above.

4 Methods We will discuss a number of issues and solutions. A method will usually combine a number of these and other solutions into a single pipeline Example: RMA for preprocessing Affymetrix Expression arrays Combines RMA background model Quantile normalization Median polish Wildly successful ( no-one ever got fired for using RMA )

5 Boxplot raw intensities Array 1 Array 2 Array 3 Array 4 55

6 Density plots 56

7 The MA (mean-difference) plot

8 Pairwise MA plots M=log 2 array i /array j A=1/2*log 2 (array i *array j ) Array 1 Array 2 Array 3 Array 4 57

10 Probe effects

11 Why Adjust for Background? (E 1 + B) / (E 2 + B) E 1 / E 2 (E 1 + B) / (E 2 + B) 1 No$ce local slope decrease as the nominal concentra$on becomes small

12 RMA Background Adjustment The Basic Idea: PM=B+S Observed: PM Of interest: S Pose a sta$s$cal model and use it to predict S from the observed PM

13 Background Correc$on Observed Intensi$es = Signal + Background Noise = + Use the data from all probes to es$mate signal/noise distribu$ons

14 The Basic Idea PM=B+S A mathema$cally convenient, useful model B ~ Normal (µ,σ) S ~ ExponenIal (λ) S ˆ = E[S PM] Borrowing strength across probes

15 Transforming the data Intensity is typically measured using a 16 bit scanner. This gives values 0-65,536 We almost always transform data to the log2 scale, giving us values between 0 and 16.

16 The two-component model multiplicative noise additive noise raw scale log scale B. Durbin, D. Rocke, JCB 2001

17 The two component model measured intensity = offset + gain true abundance y = a + b x ik ik ik k a ik = a +! i ik a i per-sample offset ε ik ~ N(0, b i2 s 12 ) additive noise b = b b exp(! ) ik i k ik b i per-sample normalization factor b k sequence-wise probe efficiency η ik ~ N(0,s 22 ) multiplicative noise

18 variance stabilizing transformations X u a family of random variables with EX u =u, VarX u =v(u). Define f ( x ) x =! 1 v( u ) du var f(x u ) independent of u derivation: linear approximation

19 variance stabilizing transformations f(x) x raw scale

20 variance stabilizing transformations f ( x ) x =! 1 v( u ) du 1.) constant variance ( additive ) 2 v ( u ) = s! f " u 2.) constant CV ( multiplicative ) 2 v ( u )! u " f! log u 3.) offset v ( u )! ( u + u ) " f! log( u + u ) ) additive and multiplicative ( ) ( 0 ) arsinh u + v u! u + u + s " f! u s 2 2 0

21 glog raw scale log glog difference log-ratio generalized log-ratio variance: constant part proportional part

22 evaluation: effects of different data transformations difference red-green rank(average)

23 Comparing two samples; intensity dependent effect

24 Non-Biological variability is a problem for single channel arrays Log2 PM intensity 5 scanners for 6 dilution groups 62

25 Some Solu$ons Proposed solu$ons Force distribu$ons (not just medians) to be the same: Amaratunga and Cabrera (2001) Bolstad et al. (2003) Use curve es$mators, e.g. loess, to adjust for the effect: Li and Wong (2001) Note: they also use a rank invariant set Colantuoni et al (2002) Dudoit et al (2002) Use adjustments based on addi$ve/mul$plica$ve model: Rocke and Durbin (2003) Huber et al (2002) Cui et al (2003)

26 Quantile Normalization Normalize so that the quantiles of each chip are equal. Simple and fast algorithm. Goal is to give same distribution to each chip. Original Distribution Target Distribution 64

27 Quan$le normaliza$on All these non linear methods perform similarly Quan$les is my favorite because its fast and conceptually simple Basic idea: order value in each array take average across probes Subs$tute probe intensity with average Put in original order

28 Example of quan$le normaliza$on Original Ordered Averaged Re-ordered

29 Unnormalized It works!! Scaling Quantile Normalization 65

30 Before Quan$le Normaliza$on

31 Afer Quan$le Normaliza$on A worry is that it over corrects

32 Another look at QN Let X have distribution P, with quantile function p. Given a target distribution Q, with quantile function q, then Y = q(p^-1(x)) has distribution Q.

33 Summariza$on (Median Polish) Y ij = m i + a j + e ij Y ij normalized probe value for jth probe on the ith gene chip m i expression value on the ith gene chip a j probe affinity effect fo the jth probe e ij random noise

34 Parallel Behavior Suggests Multi-chip Model Differentially expressing Non Differential PM probe intensity PM probe intensity Array Array 69

35 Probe Pattern Suggests Including Probe-Effect Differentially expressing Non Differential PM probe intensity PM probe intensity Probe Number Probe Number 70

36 Also Want Robustness Differentially expressing Non Differential PM probe intensity PM probe intensity PM probe intensity Differentially expressing Non Differential PM probe intensity 71

37 RMA MAS

38 Summariza$on (Median Polish)

39 Summariza$on (Median Polish)

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this