Data Preprocessing. Data Preprocessing

Size: px

Start display at page:

Download "Data Preprocessing. Data Preprocessing"

Meagan Fleming
6 years ago
Views:

1 Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale. Filtering: elimination of variables/genes whose expression variability across samples is below the instrument precision. 2 1

temperature in the lab instrument (scanner) precision Different platforms Biological: Different growth conditions,

2 Sources of variation in the data Interesting variation: e.g., differentially expressed genes between disease and normal tissues. Obscuring variation: Technical: sample preparation (RNA extraction), manufacture of the arrays, processing of the arrays (IVT), temperature in the lab instrument (scanner) precision Different platforms Biological: Different growth conditions, heterogeneity of samples, stochastic nature of biology. 3 Steps that introduce variation (noise) Tissues 4 N Engl J Med, 354: 2463,

3 Measuring variations Analysis of duplicated/replicated experiments at different levels can be used to assess the different sources of variation. Biological replicates: samples from the same biological state Technical replicates: splitting a single sample into several parts. Can be done at different stages of the protocol. Biological variation >> Technical variation 5 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale. Filtering: elimination of variables/genes whose expression variability across samples is below the instrument precision. 6 3

4 cdna Normalization cdna and oligonucleotide arrays have different normalization needs due to different sources of noise. cdna: Since the two dyes (Cy3 and Cy5) are unbalanced (different efficiency), each channel is normalized separately and then combined. Normalization methods for the two technologies share similar concepts but often tools are dedicated to a single technology. 7 Input: Raw signal matrix CEL files (probe-level data) Signal extraction from single experiments e.g. using MAS5 Genes (Probesets) g 1 g 2 g m e 1 e 2 e n Normalization 8 4

5 Data acquisition and preprocessing are often united Data acquisition microarray processing Data preprocessing scaling/normalization/filtering Common signal extraction methods such as dchip, RMA (and gcrma) combine signal extraction and preprocessing. Pros: Normalization can be done at the probe level. Use statistics on a set of samples to identify outlier probes. Cons: Generates a dependency between the samples. Example: Adding/Removing samples requires to rerun the signal extraction part. 9 Scaling Common sources of variation yield readings at different scales. Can be caused by hybridizing different amounts of RNA, different efficiency of the labels, different scanning conditions The distortion can be non-linear (e.g. due to saturation effects). Note: cdna chips often encounter stronger nonlinear distortions. 10 5

6 Example 10 biological replicates: MCF7 cells grown in 0.1% DMSO taken from the connectivity map experiment (Lamb, Science 2006) run on U133A Affymetrix chips. Signal extracted with MAS5 250 Histogram of mean expression 200 mean expression fold difference between extremes Scatter plot Comparing samples using a scatter plot 8000 Sample #7 mean= Sample #5 mean=

7 Normalization/Scaling methods Linear scaling Invariant set normalization Quantile normalization cdna: Non-linear scaling (loess, splines, etc.) 13 Normalization methods linear scaling Assumption: same overall chip intensity across samples. 14 7

8 Normalization methods y = x linear scaling Assumption: same overall chip intensity across samples. Transformation: fitting a linear relationship w/ zero intercept. Reference sample: typically the one with median mean value. Before After x' i = x i Y X, i =1,2,..., p X = mean(x), Y = mean(y) 15 Normalization methods Invariant Set Normalization (used by dchip) Assumption: Many genes are kept unchanged and transformation is linear Method: Identify genes whose ranks are relatively constant (e.g. std. of rank<10). Use the mean of these genes to linearly scale the samples. Repeat several times, until converges. Used by dchip at probe level. 16 8

9 Normalization methods quantile normalization (used by RMA) Assumption: Measured expression grows monotonically with true level of expression Method: (RMA uses quantile normalization at probe level) Transform data so that the quantile-quantile plot for any two arrays is the straight identity line. Take the mean quantile (across samples), and susbtitute it as the value of the data item in the original dataset. probes samples sort by column replace w/ row averages rearrange in original order 17 (color corresponds to rank) MAS5: Std(log 2 ) vs. Mean(log 2 ) Log 2 -transformation: Noise nearly constant at all levels Std of log Mean of log 9

10 RMA: Std(log 2 ) vs. Mean(log 2 ) RMA: Less variation, more constant across values Std of log Mean of log 19 Normalization assumptions summary Same overall intensity (or, same distribution) for different arrays. Measured expression grows linearly with true level of expression (or at least monotonically) Gene-specific noise is multiplicative (additive in the log-scale). Log 2 -transform transform noise to be independent of mean We typically use RMA or MAS5. Danger: Make sure these assumption holds in your experiment. For example: stem cells have higher overall expression than differentiated cells

11 Questions 21 Data Preprocessing Normalization: the process of removing sample-to to- sample variations in the measurements not due to differential gene expression. Bringing measurements from the different hybridizations to a common and convenient scale. Filtering: elimination of variables/genes whose expression variability across samples is below the instrument precision

12 Why filtering? Small N, large P implies vulnerability to overfitting (modeling noise). Try to reduce the number of hypothesis (therefore, the number of false negatives) 23 Filtering methods 2.5 Variation filters based on simple threshold: select only genes that vary more than a given minimum (e.g., genes with s 2 > τ s, or MAD > τ MAD, or CV > τ CV, etc.). Variation filters based on noise 10 envelope: 4 define noise envelope RMA based on replicates, select genes whose 2 variation is larger than the envelope. MAS Std of log Gene selection based on reproducibility: 2 need for duplicates. 1 Std Mean of log Mean 24 12

13 Filtering methods Variation filters based on simple threshold: select only genes that vary more than a given minimum (e.g., genes with s 2 > τ s, or MAD > τ MAD, or CV > τ CV, etc.). Variation filters based on noise envelope: define noise envelope based on replicates, select genes whose variation is larger than the envelope. Gene selection based on reproducibility: need for duplicates. 25 Filtering based on noise envelope 26 Here: Estimate set of 14 a Stratagene noise envelope samples based on replicate data log( σ g ) = α + β log( μg ) + ε, ε ~ N(0, σ ) Compute Super-impose gene-specific envelope mean on and stdevdata on data to be to filtered be filtered σ + ( new) g? ˆ ( new) ˆ α β log( μg ) 95% 13

14 Filtering methods Variation filters based on simple threshold: select only genes that vary more than a given minimum (e.g., genes with s 2 > τ s, or MAD > τ MAD, or CV > τ CV, etc.). Variation filters based on noise envelope: define noise envelope based on replicates, select genes whose variation is larger than the envelope. Gene selection based on reproducibility: need for duplicates. 27 Filtering based on duplicates Look at duplicates (sample pairs) Select genes whose expression across duplicates correlates best

15 Filtering based on reproducibility given gene i (duplicate pair), i=1,,n experiment 1 experiment 2 experiment n gene i (duplicate 1) g 11 g 12 g 1N gene i (duplicate 2) g 21 g 22 g 2N experiment n duplicate 2 experiment 2 experiment 1 29 duplicate 1 Spread bad good F statistic maximizing correlation and spread Correlation good bad B F = W? > 1 30 overall mean group mean ( gi g ) ( gij gi ) 2 2 i #Groups j Group i #Groups j Group i i B = W = #Groups-1 #Samples #Groups Between-groups variation Within-group variation 15

16 Best/worst markers 31 DLBCL dataset [Blood, 105(5): ] References 1. Quackenbush, J., Microarray Data Normalization and Transformation. Nature Genetics, (4s): p The Tumor Analysis Best Practices Working Group, Expression Profiling - Best Practices for Data Generation and Interpretation in Clinical Trials. Nature Reviews Genetics, (3): p Dudoit, S., et al., Statistical Methods for Identifying Differentially Expressed Genes in Replicated cdna Microarray Experiments. Statistica Sinica, (1): p Hartemink, A.J., et al., Maximum Likelihood Estimation of Optimal Scaling Factors for Expression Array Normalization, in SPIE International Symposium on Biomedical Optics (BiOS01), M. Bittner, et al., Editors p Irizarry, R.A., et al., Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, (2): p Many more 32 16

17 Questions 33 Visualization all steps can benefit from visualization Visualization 34 17

18 Heat Map PCA SVD MDS Visualization NMF (discussed in the clustering session) 35 Visualization of E: Heatmap Raw data (RMA, log 2 scale) L E= M e ij M M L L ALL MLL AML 500 most variable genes (clustered) 36 Leukemia data from Armstrong et al. Nat. Genet. (2002) 18

(clustered) 37 Pro: Clearly see differential expression Con: Loose

19 AML/MLL/ALL Heatmap Row centered and normalized data x11 M L xij X= M M eij μi xij = σ i L L x mn ALL MLL AML 500 most variable genes (clustered) 37 Pro: Clearly see differential expression Con: Loose absolute value AML/MLL/ALL 3D Heatmap Raw z-axis and color ALL MLL AML 38 19

20 AML/MLL/ALL 3D Heatmap Raw z-axis, row centered and normalized color Color z-axis ALL MLL AML 39 Dimensionality Reduction Our brain is a good pattern recognition tool Problem: We are used to handle only 2 (or 3) dimensions Solution: Dimensionality reduction 40 20

21 Visualization of genes or samples Aim: Given x 1:n R d find y 1:n R k (typically k<<d) that capture some properties of x 1:n Methods: Principal Component Analysis (PCA) Projection onto a low-dimensional hyper-plane. Singular Value Decomposition (SVD) Approximate a matrix by a sum of simple matrices Multi-dimensional Scaling (MDS) Mapping points such that preserve distances Graph layout methods (e.g. springs and charges) Independent Component Analysis (ICA) Projection pursuit Caution: Our brain also tends to find patterns in random data (over-fitting) 41 Principal Component Analysis (PCA) Aim: Find a low k-dimensional hyper-plane on which the variation of the projected data is maximal. V=(v Matrix multiplication 1 v 2 ) v 2 v 2 σ 2 1 An mbm k = Cn k m v 1 v σ 2 21 c = a b x i J(V)=σ 12 +σ 2 2 y i =V T (x i -μ) μ=nx i /n i ij α = 1 j iα αj = i j Objective: Find V that maximizes J(V) Equivalent to: Find low k-dimensional hyper-plane such that the projected data best approximate the original data

$projection on v i Principal components V=(v 1,, v k ), captured variances {σ i2 } 1:k The projected data y i = V T (x i -μ) The fraction of variance that is captured by the principal components, c k,$

22 Incremental Building of the k Principal Components Algorithm: (not the one actually used) Loop i from 1:k Find the direction v i along which the variance σ i 2 is maximal Remove from each point its projection on v i Principal components V=(v 1,, v k ), captured variances {σ i2 } 1:k The projected data y i = V T (x i -μ) The fraction of variance that is captured by the principal components, c k, measures how well the projected data approximates that original data Captured variance c k s k =n i=1,...,k σ i 2 c k =s k /s d 43 k = # of PC Input PCA of leukemia samples Output v 1 v 2 v 3 AML MLL ALL Genes 44 22

23 Pitfalls of PCA Largest variance most informative: 2 pancakes Interesting direction Direction with largest variance Structure in low-dimensional space there is structure in the full space. But NOT 45 Singular Value Decomposition (SVD) Aim: Find best approximation for E by a sum of K rank-1 matrices (meta-sample meta-gene) + + = ALL MLL AML SVD # SVD # SVD #1 23

300 10 20 30 40 50 60 70 300 10 20 30 40 50 60 70 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 47

3 u 3 T SVD of leukemia data 10 20 30 40 50 50 = 100 150 200 AML ALL MLL 60 250 70 0.5 1 1.5 2 2.5 3 3.

24 SVD of leukemia data AML ALL MLL s 1 v 1 u 1 T s 2 v 2 u 2 T s 3 v 3 u 3 T SVD of leukemia data = AML ALL MLL s 1 v 1 u 1 T s 2 v 2 u 2 T s 3 v 3 u 3 T 24

25 Singular Value Decomposition (SVD) Aim: Find best approximation for E by a sum of K rank-1 matrices (meta-sample meta-gene) E Σ i=1:k s i v i u i T + + where {v i },{u i } are orthogonal unit vectors Objective function: J({s i },{v i },{u i })= Σ ij (e ij -Σ α=1:k s α v iα u jα ) 2 Method: Unique solution based on diagonalizing EE T Note: {v i } are the same as in PCA of the samples if the genes are centered Clustering: Identify elements with large absolute value as members of clusters 49 Multi-dimensional Scaling (MDS) Aim: Find a low k-dimension representation of the data such that best preserves the distance matrix of the original data 50 25

26 Multi-dimensional Scaling (MDS) Aim: Find a low k-dimension representation of the data such that best preserves the distance matrix of the original data δ ij = x i -x j d ij = y i -y j x i y i Objective: Find y 1:n that minimize J(δ ij,d ij ). J(δ ij,d ij ) measures how well d ij approximates δ ij. Method: Gradient descent 51 Objective functions Different ways to measure similarity between distance matrices: Emphasize large differences ( d ) i< j ij δ ij δ ij J ee = 2 i< j 2 Emphasize fractional differences J ff = i < j d δ ij δ ij ij

Gradient Descent Aim: Find minimum of J(a) Method: Init: a (0) a random position Iterate: a (t+1) a (t) -η J(a (t) ) Stop: when Da (t+1) - a (t) D<ε or t>t Gradient descend J ( x) x1 J ( x) x2 J ( a)

27 Gradient Descent Aim: Find minimum of J(a) Method: Init: a (0) a random position Iterate: a (t+1) a (t) -η J(a (t) ) Stop: when Da (t+1) - a (t) D<ε or t>t Gradient descend J ( x) x1 J ( x) x2 J ( a) = M x J ( ) x d x = a Problem: Finds a local minimum depending on the starting point (according to basin of attraction) See also: Newton s algorithm, Conjugate gradient 53 Input: MDS for leukemia samples Used J ee with Euclidean distance Output: 54 27

28 MDS vs. PCA PCA MDS Linear Yes Distorts space Unique Yes Depends on initial configuration Optimal Yes Converges to local minima Preserves distances Only projected part Yes (attempts to) Captures high-dimensional structure Missing dimensions Potentially better 55 References 1. Duda, Hart and Stork, Pattern Classification. Wiley & Sons Quackenbush, J., Microarray Data Normalization and Transformation. Nature Genetics, (4s): p Allison, D.B. et al. Microarray data analysis: from disarray to consolidation and consensus, Nat. Rev. Gent (7): p

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org