Biochip informatics-(i)

Similar documents
Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

cdna Microarray Analysis

Advanced Statistical Methods: Beyond Linear Regression

SPOTTED cdna MICROARRAYS

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Single gene analysis of differential expression

Non-specific filtering and control of false positives

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Performance Evaluation and Comparison

Experimental Design and Data Analysis for Biologists

Expression arrays, normalization, and error models

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Multiple testing: Intro & FWER 1

Optimal normalization of DNA-microarray data

GS Analysis of Microarray Data

Single gene analysis of differential expression. Giorgio Valentini

GS Analysis of Microarray Data

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Introduction to Gaussian Process

GS Analysis of Microarray Data

Design of Microarray Experiments. Xiangqin Cui

Chapter 5: Microarray Techniques

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Unsupervised machine learning

Sta$s$cs for Genomics ( )

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

University of California, Berkeley

Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Session 06 (A): Microarray Basic Data Analysis

GS Analysis of Microarray Data

GS Analysis of Microarray Data

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Mathematics, Genomics, and Cancer

Sample Size Estimation for Studies of High-Dimensional Data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Microarray data analysis

Predicting Protein Functions and Domain Interactions from Protein Interactions

GS Analysis of Microarray Data

Bioconductor Project Working Papers

Statistical Applications in Genetics and Molecular Biology

Microarray Data Analysis: Discovery

Statistical testing. Samantha Kleinberg. October 20, 2009

Stephen Scott.

L11: Pattern recognition principles

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Review of some concepts in predictive modeling

Machine Learning Linear Classification. Prof. Matteo Matteucci

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Lesson 11. Functional Genomics I: Microarray Analysis

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Tools and topics for microarray analysis

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

Design and Analysis of Gene Expression Experiments

Statistical Learning and Behrens Fisher Distribution Methods for Heteroscedastic Data in Microarray Analysis

Cross model validation and multiple testing in latent variable models

Statistics Applied to Bioinformatics. Tests of homogeneity

Announcements. Proposals graded

The Bayes Theorema. Converting pre-diagnostic odds into post-diagnostic odds. Prof. Dr. F. Vanstapel, MD PhD Laboratoriumgeneeskunde UZ KULeuven

High-throughput Testing

Microarray Preprocessing

Data Preprocessing. Data Preprocessing

The miss rate for the analysis of gene expression data

Optimal design of microarray experiments

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bioinformatics. Transcriptome

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

Introduction to Machine Learning Midterm Exam

Error models and normalization. Wolfgang Huber DKFZ Heidelberg

Statistics Toolbox 6. Apply statistical algorithms and probability models

Introduction to clustering methods for gene expression data analysis

Classification: The rest of the story

Pattern Recognition and Machine Learning

Statistical analysis of microarray data: a Bayesian approach

Introduction to Supervised Learning. Performance Evaluation

Statistical Methods for Analysis of Genetic Data

Inferring Transcriptional Regulatory Networks from High-throughput Data

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Multiple Hypothesis Testing in Microarray Data Analysis

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Anomaly Detection for the CERN Large Hadron Collider injection magnets

PATTERN CLASSIFICATION

Seminar Microarray-Datenanalyse

Machine Learning, Midterm Exam

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Gene Expression an Overview of Problems & Solutions: 3&4. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Chapter 10. Semi-Supervised Learning

Introduction to clustering methods for gene expression data analysis

Computational Biology: Basics & Interesting Problems

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

(134/ !"#!$%&%' ()*)$+,-.)//01*$2*345/0/ M<39$N1$;<)5$")3/B.)O ;).C0*141=5 " /7$83.9$:

PATTERN RECOGNITION AND MACHINE LEARNING

Resampling and the Bootstrap

Final Exam, Machine Learning, Spring 2009

Transcription:

Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing Episodes 1 and 2 Global normalization Intensity dependent normalization Controlling regional variation Alternatives Differential expression Multiple hypothesis testing Classification

Biochip basics Bioinformatics pipeline Terminology Sample (Target): RNA (cdna) hybridized to the array, aka target, mobile substrate. Probe: DNA spotted on the array, aka spot, immobile substrate. Sector (Block): rectangular matrix of spots printed using the same print-tip (or pin), aka print-tip-group Slide, Array: printed microarray Batch: collection of microarrays with the same probe layout. Channel: data from one color (Cy3 = cyanine 3 = green, Cy5 = cyanine 5 = red).

A Biochip Informatics Strategy Interesting Patients Interesting Animals Interesting Cell Cell Lines Make Biochip Appropriate Tissue Appropriate Conditions Extract RNA Hybridize Biochip Access Significance Functional Clustering Data Preprocessing Pre- Scan Biochip Biological Validation Post-cluster Analysis & Integration Informatical Validation??? Clinical relevance of Biochip informatics Dx. Discovery Tx. Px.

Biochip informatics: challenges Pre-processing: technology variation noise & data filtering missing / negative values / P & A calls data scaling Can I assume normality? chip quality, other artifacts Functional Clusters: clustering quality, consistency, & robustness Statistical Issues: study design / # of replicates / multiple testing Integrative Biochip Informatics Can we get more out of it? Why integrative approach? Variables(10 s-100 100 s) Variables(10000-100000) 100000) Case (1000 s-1000000 1000000 s) Case (10 s-100 100 s) High-dimensionality systems with insufficient data are extremely underdetermined Largely unlabelled data Not tractable by standard biostatistical techniques

Preprocessing technology variation noise & outlier detection missing / negative values / P & A calls data scaling Can I assume normality? chip quality, other artifacts Preprocessing: Noise Weighted average difference ~ ( PM MM N n θ = N n ) φ n

Preprocessing: missing values put in zeros row average values weighted KNN SVD (requires iteration) Preprocessing: Filtering intensity-based filter floors or cut-offs: min. expression level 2 s.d above (local) background percentage-based cut-offs also consider saturation low variance filter (across conditions) statistical filter (with replicates) low entropy filter / jackknife clustering target ambiguity filter (databases)

Filtering: Low entropy filter * Jackknife clustering Filtering: Low entropy filter

Chip quality Debris Data structure Gene expression level of gene 10 on slide 4 = Log2(Red intensity / Green intensity)

density plots of intensities Intensity vs. intensity plot For the same, equally-treated samples, I(R i ) = I(G i )

Intensity vs. intensity plot For different sample, what about R/G ratios? What possibly are the problems? Log transformations Pros Cons constant c.v., i.e., mean s.d. useful way of handling ratio values symmetry for up or down level of expressions & significance, low < high non-positive values only linear calibration transformations Alternatives

Episode - I a frequency pattern Is it typical or an artefact? Episode - II a frequency pattern Is it typical or an artifact?

Episode - II a frequency pattern Is it typical or an artifact? Pin tips

Normalization Correcting systematic variation Simple additive and multiplicative Linear vs. non-linear Sources of error Kinds? dye spotting experiment, slide scale, scanning

Sources of error tissue contamination mrna preparation and RNA degradation amplification efficiency reverse transcription efficiency hybridization efficiency and specificity clone identification and mapping PCR yield, contamination spotting efficiency Pin geometry DNA-support binding Dye labeling Slide variation other array manufacturing-related issues scanning image segmentation signal quantification background correction Why normalize? To correct systematic variations such as To balance the fluorescence intensities of the two dyes (green( Cy3 and red Cy5 dye) To allow the comparison of expression levels across experiments (slides) To adjust scale of the relative gene expression levels (as measured by log ratios) across replicate experiments

Normalization issues, cdna chips Within-slide What genes to use Location Scale Paired-slides (dye swap) Self-normalization Between slides Which genes to use? All genes Constantly expressed genes Controls Spiked controls Genomic DNA titration series Other useful set of genes

Global normalization Total amount of mrna measured is constant log 2 R/G log 2 R/G c = log 2 R/( /(kg) Set mean or median to zero. Assumes R and G are related by a constant factor Ignoring intensity-and and-space-dependent variations Non-parametric smoothing: lowess Spread is a function of intensity Regression without parametric assumption on a X-Y X Y plot with a kernel function and bandwidth M=log2R/G A=log2 (R/G) 1/2

Geometric transformations: translation Geometric transformations: translation = = = y x T T T y x P y x P, ' ' ', x y x y (b) (a) P P = 1 1 0 0 1 ' ' y x T T y x y x y x T y y', T x x' + = + = Geometric transformations: Geometric transformations: rotation rotation x = r cos φ, y = r sin φ x = r cos (φ + θ) = r cos φ cos θ - r sin φ sin θ y = r sin (φ + θ) = r cos φ sin θ + r sin φ cos θ x = x cos θ -y sin θ, y = x sin θ + y cos θ P = R P θ θ θ θ = y x y x cos sin sin cos ' ' (x,y) r φ (x,y ) r θ

Geometric transformations: Geometric transformations: scaling scaling x = x sx, y = y sy Scaling factor : sx (x 축으로크기조정 ), sy (y 축으로크기조정 ) P = S P Uniform Scaling: sx = sy = y x s s y x y x 0 0 ' ' P T P + = ' P S P = ' P R P = ' P T S R T P =...) ( ' M Variance is a function of intensity Variance is a function of intensity

The M vs. A plot x' cos( π / 2) = y' sin( π / 2) sin( π / 2) x = cos( π / 2) y 1 2 1 1 1 x 1 y M=log2R/G A=log2 (R/G) 1/2 M = log 2 (R/G) : log intensity-ratio A = log 2 (R*G)/2 : mean log-intensity Intensity-dependent normalization Dye bias is dependent on spot intensity! log 2 R/G log 2 R/G c(a) = log 2 R/(k(A)G) Apply a robust scatter-plot smoother, lowess The lowess() function in R with f=20% Assumes roughly symmetric up- and downs at all intensity levels

Nonparametric smoothing Smoothing Consider X Y plot Draw a regression line which requires no parametric assumptions The regression line is not linear The regression line is totally dependent on the data Two components of smoothing Kernel function, calculating weighted mean Bandwidth, a window span determining smoothness of the regression line Nonparametric smoothing Types of kernel functions Uniform, Triangular, Normal, Others Bandwidth The wider, the smoother Bigger impact than the kernel function

Global vs. print-tip tip-group lowess Roughly equal number of genes are up- or down-regulated at all intensity levels (or only few genes are expected to change)

Global vs. print-tip tip-group lowess For every print-tip tip group, changes roughly symmetric at all intensity levels Within print-tip tip-group box plots for print-tip tip-group normalized M

Taking scale into account Assumptions: all print-tip tip-group have the same spread Location / scale calibration

Location + scale normalization Effect of normalization

Location + scale normalization Biochip Informatics - (I) Biochip basics Preprocessing Episodes 1 and 2 Global normalization Intensity dependent normalization Controlling regional variation Alternatives Differential expression Multiple hypothesis testing Classification

Non-linear normalization RECOMB2002 Robust Non-linear Normalization A rank score for more robust identification of non-differentially expressed genes R k = p ( xki xkj ) i i, j j = 1 Assumption: there are non-differentially expressed genes at all range of median expression level 2

---log u arsinh((u+u o )/c) -200 0 200 400 600 800 1000 intensity 2 ( ) arsinh( x) = log x + x + 1 lim x ( x x ) arsinh log log 2 = 0 Differential expression Form a statistic from the central value and spread. deviation, sum of deviation, variance, standard deviation

Statistical testing Form a statistic (such as T) (for each gene) from the data Calculate the null distribution(s) for the statistic Choose the rejection region Compare the statistic to the null distribution(s) of the statistic Assigning a score Form a statistic u jk jk (i) : the log 2 R/G value of the j-th gene on the k-th array in the two groups(i=1,2) log fold-change: ū j (2) - ū j (1) T: (ū j (2) - ū j (1))/s j Wilcoxon (rank sum): r j = Σr jk SAM (shrunken centroid): (ū j (2) - ū j (1))/(s j +s 0 ) Baldi s (Bayesian): (ū j (2) - ū j (1))/sqrt sqrt((1-w)s j2 +ws 02 ) * Depends on the way how the pop. variability is accounted and how to borrow strength across genes.

Diagnostic test characteristics D + D T + a b (α) a+b T - c (β) d c+d a+c b+d a+b+c+d Sensitivity = a / (a+c) = p(t+ D+) Specificity = d / (b+d) = p(t- D D-) Positive Predictive Value = a / (a+b) = p(d+ T+) = 1-1 FDR Negative Predictive Value = d / (c+d) = p(d- T T-) ) = power (α)) rejected a true null, (β)( ) fail to reject a false null Diagnostic test characteristics 예제 ) Sensitivity=99.99%, specificity=99.9% 인최신의에이즈검사가개발되었다. 철수는이검사에양성반응을보였다. 철수가에이즈에감염되었을확률은얼마인가? ( 현재한국인의에이즈유병율은 0.0001 이라고한다.) 1. 99% 2. 95% 3. 80% 4. 50% 5. 10%

Diagnostic test characteristics Sensitivity=99.99% Specificity=99.9% Prevalence=0.0001 T + T - AIDS 9,999 1 10,000 no AIDS 1 00,000 999 00,000 1000 00,000 109,999 99,900,001 100,010,000 Positive predictive value = 9,999 / 109,999 < 10% Negative predictive value = 1.0 Diagnostic test characteristics T+ (TP) T- (FN) T+ (FP) T- (TN) T+ (TP) T- (FN) T+ (FP) T- (TN)

ROC curve Multiple testing problem Thousands of hypotheses are tested simultaneously The gene is not differentially expressed vs. there is no gene that is differentially expressed. increased chance of false positives (α,, Type-I) should adjust your p-valuesp FWER (Family-wise error rate): prob. of at least one false positive FWER = Pr(α > 0) FDR (False discovery rate): expected proportion of false positives among the rejected hypotheses ( 95)( FDR = E(α / rejected) Pr(rejected > 0) positive FDR, pfdr = E(α / rejected rejected >0)

Multiple hypothesis testing Biochip Informatics - (I) Biochip basics Preprocessing Episodes 1 and 2 Global normalization Intensity dependent normalization Controlling regional variation Alternatives Differential expression Multiple hypothesis testing Classification

Multiple testing in microarray Microarray experiments are large and exploratory 5% of FDR says that, among the 100 genes said significant about 95 may be truly significant. What dose 5% of FWER in microarray experiments? and compared to the top (arbitrary) 100 list? What about symmetric vs. asymmetric rejection regions? Robustness against the kind of dependence?. Control of the FWER Bonferroni correction complete null hypothesis that there is no gene that is differentially expressed (i.e., a+c = a+b+c+d) weak control of type-i I error FWER = Pr(α > 0) = Pr(at least one adjusted p g < c H 0 )? dependence structure Westfall/Young s minp adjusted p-valuesp by re-sampling strong control of type-i I error (i.e., regardless of a+c) step-down procedure maxt

Control of the FWER Golub s leukemia data FDR and SAM Statistic: d j = (ū( j (2) - ū j (1))/(s j +s 0 ) s 0 : fudge factor a small positive number choosen as the percentile of the s i values that makes the C.V. of d j approximately constant as a function of s i. Also dampens large values of that arise from genes with low expression Estimate the number m 0 of invariant genes. Taking B permutations, compute the mean number of significant genes in B. Ê(V) = #{d j : gene j unchanged and d j t 2 or d j t 1 } m * 0 /m Estimated FDR = Ê(V)/R

The SAM procedure Compute the ordered statistics d (1) d (2) d (J) For b=1,,, B (randomly) permute the class labels, compute test statistic d *b j and the corresponding order statistics d *b (1) d *b (2) d (J) *b (J) From the set of B permutations, estimate the expected order statistics by đ (j) = (1/B)Σ b d *b (j) for j=1,,j,j Plot the d (J) values versus the đ (j). For fixed threshold determine t 1 ( )) and t 2 ( ). Estimate the FDR

Molecular Classification of Cancer Golbe,, et al., Science 1999;403:503-511 511

Machine Learning Approach in Bioinformatics Supervised Machine Learning Linear Discriminant Analysis / Logistic Regression / PCA Classification Tree Artificial Neural Network Support Vector Machine Rough Sets Reinforcement Learning Hidden Markov Model Unsupervised Machine Learning Hierarchical Tree Clustering Partitional Clustering Self-Organizing Feature Maps Matrix Incision Algorithms Supervised vs unsupervised classifications Classification Conditional Densities known known unknown Bayes Decision Theory Density-based Parametric Parametric Supervised Learning Nonparametric Nonparametric geometric Unsupervised Learning Parametric Parametric Nonparametric Nonparametric Optimal Rules Rules Plug-in Rules Rules Density Estimation Decision Boundary Building Mixture Resolving Cluster Analysis Jain et al. 2000, IEEE Transactions

Artificial Neural Network A Universal Function Approximator

Classification tree as an ex. of classification Sunburn at the Beach Classification tree as an ex. of classification

Classification tree as an ex. of classification Classification tree as an ex. of classification

Classification tree as an ex. of classification Classification by the shrunken centroids For the K class problem, i genes and j slides - n k : number of samples in class k - C k : indices of the n k samples The mean expression value of gene i in class k

shrunken centroids classification exercise

Feature selection K-fold cross-validation

Cross-validation probabilities Thank you!