Seminar Microarray-Datenanalyse

Size: px

Start display at page:

Download "Seminar Microarray-Datenanalyse"

Adela Casey
6 years ago
Views:

1 Seminar Microarray- Normalization Hans-Ulrich Klein Christian Ruckert Institut für Medizinische Informatik WWU Münster SS 2011

2 Organisation Normalisierung Bestimmen diff. expr. Gene, Experiment-Design Dimensionsreduktion, Clusteranalyse Klassifikation, Gene Set Analysis Analye von Überlebenszeiten

4 Data acquisition

5 Array Types 3 Gene expression Whole transcript Promoter Tiling SNP GenChips Illumina Bead Arrays Agilent Micro Spotted red / green micro

6 Normalization Microarray measurements are subject systematic and random variations. The correction for systematic effects is called. Relationship between measured intensities and abundances: y ki = a ki + b ki x ki measured intensity y ki of probe i on array k true abundance x ki gain facr b ki (number of cells, hybridization efficiency, label efficiency, detecr gain,...) additive term a ki (unspecific hybridization, background fluorescence, detecr offset,...) affin linear dependence between y ki and x ki

7 Multiplicative error model a ki and b ki cannot be estimated for all and probes. b ki = b k β i (1 + ɛ ki ) with ɛ ki N (0, σ 2 ɛ ) a ki = 0 (trust the image analysis software background estimation) Y ki = a ki + b ki x ki = b k β i x ki (1 + ɛ ki ) = b k m ki (1 + ɛ ki ) m ki is the molecule abundance in a probe specific unit interest in ratios: m ki m li = y ki y li b l b k

8 Variance - expectation dependence (1/2) In the multiplicative error model: VarY ki is a quadratic function of EY ki EY ki = b k m ki VarY ki = Var(b k m ki (1 + ɛ ki )) = (b k m ki ) 2 σ 2 ɛ = (EY ki ) 2 σ 2 ɛ homoscedasticity is an assumption of many downstream analysis methods

9 Variance - expectation dependence (2/2)

10 Variance-stabilizing transformations Random variables Y ki with EY ki = µ ki and VarY ki = v(µ ki ); h is a differentiable function h Var(h(Y ki )) h (µ ki ) 2 v(µ ki ) An approximately (first order) variance-stabilizing transformation is any function h for which the right hand side is constant. Search function h statisfying h 1 (µ ki ) = = 1 v(µki ) µ ki c. the logarithm is a variance-stabilizing transformation in the multiplicative error model there are further reasons for log-transforming microarray data

11 Logarithmic transformation (1/2) density density intensity log intensity intensities array log intensities array intensities array log intensities array 1

12 Logarithmic transformation (2/2) no absolute mrna measurement interested in ratios ratios are not symmetric around 1 (average of 1 2 and 2 is 1.25) log ratios are symmetric around 0 (average of log 1 2 and log 2 is 0) most simple models have additive effects functional equation of the logarithm: log(xy ) = log(x ) + log(y ) log( X ) = log(x ) log(y ) Y

13 Median 1 select reference array r 2 calculate ẏ ki = log(y ki ) for all and probes 3 calculate c k = median(ẏ k1 ẏ r1,..., ẏ km ẏ rm ) for all 4 calculate x ki = ẏ ki c k for all and probes c k is an estimation for log(b k /b r ) in the multiplicative error model

14 Limitations of the multiplicative error model often, the variance of log-transformed intensities increases as their mean decreases non-linearities: scatterplot of the log-transformed intensities of two samples follows a curved line negative values due background subtraction

15 Additive and multiplicative error model Y ki = a ki + b ki x ki b ki = b k β i e η ki with η ki N (0, σ η ) a ki = a k + ν ki with ν ki N (0, σ ν ) This leads the model: Y ki a k b k }{{} term Y ki = a k + b k m ki e η ki + ν ki = m ki e η ki + ν ki }{{} mult. and add. error term

16 Variance-expectation dependence Relationship between VarY ki and EY ki in the additive and multiplicative error model: Var(Y ki ) = c 2 (E(Y ik ) a k ) 2 + b 2 k σ2 ν Remember: search function h statisfying h (µ ki ) = 1 v(µki ) h = arsinh

17 arsinh arsinh(x) = log(x + x 2 + 1)

18 Variance stabilizing The transformation h(y ki ) = arsinh( Y ki a k ) = µ ki + ɛ ki b k is called variance stabilizing (in the additive and multiplicative error model). set µ ki = µ i estimate a k and b k least trimmed sum of squares (LTS)

19 Variance stabilizing

20 Quantile rank based method make the empirical distributions of all equal no underlying model assumption: intensities on each chip originate from same distribution

21 Quantile - example array 1: array 2:

22 Quantile - example array 1: array 2: mean:

23 Quantile - example array 1: array 2: mean:

24 Quantile - example array 1: array 2: mean:

25 Quantile

26 Quantile

28 Cusm cdna

29 array 1 Background correction 2 Within array e.g. median, lowess (including log transformation) process each array separately calculate differences between red and green intensities: M ki = log(r ki ) log(g ki ) A ki = (log(r ki ) + log(g ki ))/2 3 Between array (optional) of the M-values e.g. quantile

30 Background estimation foreground and background intensities GenePix Standard GenePix Morph (since version 6.0) Normexp (+ offset) VSN

31 Lowess

32 Lowess robust locally-weighted polynomial regression red and green intensities may not be related by a constant facr h k ( R ki G ki ) = log( R ki G ki ) f k (log(r ki G ki )) f k is the lowess regression curve of the MA scatter plot degree of the polynomial bandwidth or smoothing parameter fit is done using robust weighted least squares

34 perfect match and mismatch probes probes form a probe set

35 Normalization of of consists of 3 steps: 1 background correction 2 3 summarization half of the probes on a GeneChip are MM probes the new Gene ST and Exon are PM only

36 PM vs MM plot

37 RMA background correction PM kij = b kij + s kij assume that b kij N (µ k, σ k ) and s kij Exp(α k ) estimate µ k, σ k and α k from all PM probes on the array use transformation B(PM kij ) = E(s kij PM kij )

38 RMA quantile based on all PM probes

39 RMA summarization starting with background adjusted, normalized (and log-transformed) PM intensities Y kij linear additive model Y kij = µ ki + α ij + ɛ kij constraint: j α ij = 0 fit model with median polish algorithm (more robust than standard ANOVA): Y kij = µ i + α ij + β ki + ɛ kij use estimated ˆµ ki as scaled expression level of the ith probe set on array k

41 Normalization open question What is the best pre-processing algorithm? [...] One method, robust multi-array average (RMA), corrects for background using a transformation, normalizes them using a formula that is based on a normal distribution, and uses a linear model estimate expression values on a log scale. RMA and a modification of this method, GCRMA, often perform as well or better than competirs, although there is some controversy about which method is best. It is also unclear whether there is an ideal way of defining which method produces the best results. (Allison et al., Nature Genetics, 2006)

42 Batch effects Batch effects are experimental facrs that add systematic biases the measurements and vary between different subsets or stages of an experiment. Examples are: spotting PCR amplification sample preparation procols array coating scanner and image analysis consider batch effects in the experiment design

43 Batch effects consensus point Avoiding confounding by extraneous facrs is crucial. Microarray measurements can be greatly influenced by extraneous facrs. If such facrs covary with the independent variable for example, with different treatments that are applied two sets of samples this might confound the study and yield erroneous conclusions. Therefore it is crucial that such facrs are minimized or, ideally, eliminated. For example, should be used from a single batch and processed by one technician on the same day. (Allison et al., Nature Genetics, 2006)

cdna Microarray Analysis

cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization