Advanced Statistical Methods: Beyond Linear Regression

Similar documents
Gene Expression an Overview of Problems & Solutions: 3&4. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Non-specific filtering and control of false positives

Looking at the Other Side of Bonferroni

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Stat 206: Estimation and testing for a mean vector,

Statistical testing. Samantha Kleinberg. October 20, 2009

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Multiple testing: Intro & FWER 1

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Variance Reduction and Ensemble Methods

Regression tree methods for subgroup identification I

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

CS145: INTRODUCTION TO DATA MINING

BAGGING PREDICTORS AND RANDOM FOREST

High-throughput Testing

Statistical Machine Learning from Data

Evaluation. Andrea Passerini Machine Learning. Evaluation

Algorithm-Independent Learning Issues

Hierarchical models for the rainfall forecast DATA MINING APPROACH

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Sample Size Estimation for Studies of High-Dimensional Data

Knowledge Discovery and Data Mining

Holdout and Cross-Validation Methods Overfitting Avoidance

Stat 502X Exam 2 Spring 2014

Alpha-Investing. Sequential Control of Expected False Discoveries

Statistical Applications in Genetics and Molecular Biology

Machine Learning Linear Classification. Prof. Matteo Matteucci

Intro. to Tests for Differential Expression (Part 2) Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.

Exam: high-dimensional data analysis January 20, 2014

Evaluation requires to define performance measures to be optimized

Ensembles of Classifiers.

Classification using stochastic ensembles

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Lecture 21: October 19

Ensemble Methods and Random Forests

1 Handling of Continuous Attributes in C4.5. Algorithm

SF2930 Regression Analysis

ECE 5424: Introduction to Machine Learning

WALD LECTURE II LOOKING INSIDE THE BLACK BOX. Leo Breiman UCB Statistics

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Biochip informatics-(i)

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Course in Data Science

You have 3 hours to complete the exam. Some questions are harder than others, so don t spend too long on any one question.

Performance Evaluation and Comparison

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

False Discovery Rate

Machine Learning 2017

Review of Statistics 101

Learning with multiple models. Boosting.

VBM683 Machine Learning

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Microarray Data Analysis: Discovery

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Lecture 3: Decision Trees

Resampling and the Bootstrap

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

FINAL: CS 6375 (Machine Learning) Fall 2014

UVA CS 4501: Machine Learning

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Chapter 14 Combining Models

FDR and ROC: Similarities, Assumptions, and Decisions

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

3 Comparison with Other Dummy Variable Methods

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Modern Information Retrieval

Biostatistics Advanced Methods in Biostatistics IV

Unsupervised machine learning

Statistics and learning: Big Data

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Announcements. Proposals graded

Principal component analysis (PCA) for clustering gene expression data

Data Mining und Maschinelles Lernen

Learning Decision Trees

Clustering using Mixture Models

Lecture 7 April 16, 2018

Empirical Bayes Moderation of Asymptotically Linear Parameters

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Experimental Design and Data Analysis for Biologists

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CS7267 MACHINE LEARNING

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

CSCI-567: Machine Learning (Spring 2019)

18.9 SUPPORT VECTOR MACHINES

Sta$s$cs for Genomics ( )

The exam is closed book, closed notes except your one-page cheat sheet.

TDT4173 Machine Learning

STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons (Ch. 4-5)

Transcription:

Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

ALL Data Preprocessed gene expression data 12625 genes (hgu95av2 Affymetrix GeneChip) 128 samples (arrays) a matrix of expression values 128 cols, 12625 rows phenotypic data on all 128 patients, including: 95 B-cell cancer 33 T-cell cancer Motivating question: Which genes are changing expression values systematically between B-cell and T-cell groups? Needle(s) in a haystac 2

Basic idea of differential expression (DE) Observe gene expression in different conditions healthy vs. diseased, e.g. Decide which genes expression levels are changing significantly between conditions Target those genes to halt disease, e.g. Note: there are far too many ways to test for DE to present here we will just loo at major themes of most of them, and focus on implementing one 3

Miscellaneous statistical issues Test each gene individually Dependence structure among genes not wellunderstood: (co-regulation or co-expression) Ignore coregulation first, one at a time Scale of data Magnitude of change depends on scale In general: log scale is approximately right Variance stabilization transformation can help 4

Simple / Naïve test of DE Observe gene expression levels under two conditions = log -scale expr.level of gene Y ij in replicate j of "treatment"i Calculate: average log fold change Y i LFC = = ave. log expr. for gene in treatment i Y 2 Y1 = ave. log fold change for gene 5 Mae a cut-off: R Gene is "significant" if LFC > R

What does naïve test do? Estimate degree of differential expression: LFC > 0 for up-regulated genes LFC < 0 for down-regulated genes Identifies genes with largest observed change A simple mean comparison ignores something cannot really test for significance what if larger LFC have large variability? - then not necessarily significant 6

How to tae variability into account? Build some test statistic on a per-gene basis How do we usually test for mean differences between two groups or samples? Test statistic: Y Y 2 1 t = = s LFC s pooled SD 7

How to use this to test for DE? What is being tested? Null: No change for gene Under null, t ~ t dist. with n d.f. parametric assumption But what is needed to do this? Large sample size Estimate σ = pop. SD for gene (example: s ) 8

What if we don t have enough? Probably don t even dozens of arrays may not suffice Two main problems: 1. Estimate σ (especially for small sample size) 2. Appropriate sampling distribution of test stat. Basic solutions: 1. To estimate σ : Pool information across genes 2. For comparison against sampling distribution : use parametric assumption on improved test stat. use non-parametric methods resampling / permuting 9

Test for DE with limma / ebayes For gene under treatment j on array i: Y ij =, 0,1 j [ ] 2 ε σ β + β T + ε = ij, Var ij expression level (log scale) treatment effect (DE) treatment level (here, 0 or 1) What if there are more covariates than just treatment? use matrix notation for convenience: E [ Y ] = X β covariate effects log-scale expression vector design matrix (n x p) 10

11 Assumptions in linear model [ ] ( ) w w w w d w w d w w w t V t p n d d V N w V Var ~ ˆ ˆ Then resid. d.f., ~ ˆ ˆ, ~, ˆ, For covariate ˆ ˆ and, ˆ and ˆ Obtain estimates,,,, 2 2 2 2 2, 2,, 2 & σ β χ σ σ σ σ β σ β β σ β σ β = = = = This is something we ve already done Can we improve it?

12 Hierarchical model to borrow information across genes: ebayes [ ] w w d d w w d t V t d d d s d E d s s d + = + + = = 0, 0 ~ ~ ˆ ~ - statistic the "moderated" t Then ˆ ˆ ~ Consider the posterior mean Bayes methods) estimated from data using empirical and ( 1 ~ 1 Assume prior distribution,, 0 2 2 0 0 2 2 2 0 2 0 2 2 0 0 2 & σ β σ σ σ σ χ σ represents: added information (from using all genes) (using all of the genes)

13 Now how many genes will be called significant?

Significance and P-values Usually, small P-value claim significance Correct interpretation of P-value from a test of significance: The probability of obtaining a difference at least as extreme as what was observed, just by chance when the null hypothesis is true. Consider a t-test of H 0 : µ 1 -µ 2 =0, when in reality, µ 1 -µ 2 = c (and SD=1 for both pop.) What P-values are possible, and how liely are they? 14

For each value of c, 1000 data sets (thin of as 1000 genes) were simulated where two populations are compared, and the truth is µ 1 -µ 2 = c. For each data set, the t-test evaluates H 0 : µ 1 -µ 2 =0 (thin of as no change in expression level). The resulting P-values for all data sets are summarized in the histograms. What s going on here? 15

Histograms smoothed and overlayed Note: -Even when there is no difference (c=0), very small P-values are possible -Even for larger differences (c=0.2), very large P-values are possible -When we loo at a histogram of P-values from our test of DE, we have a mixture of these distributions (because each gene has its own true value for c) 16

So this is a mixture of distributions. A flat histogram would suggest that there really aren t any DE genes. The pea near 0 indicates that some genes are DE. But which ones? 17

How to treat these P-values? Traditionally, consider some cut-off Reject null if P-value < α, for example (often α = 0.05) What does this mean? α is the acceptable level of Type I error: α = P(reject null null is true) 18

Multiple testing We do this with many (thousands, often) genes simultaneously say m genes Null True Null False Fail to Reject Null U T m-r Reject Null V S R Total Count m 0 m-m 0 m # of Type I errors: V # of Type II errors: T # of correct decisions : U+S 19

Error rates Thin of this as a family of m tests or comparisons Per-comparison error rate: PCER = E[V/m] Family-wise error rate: FWER = P(V 1) What does the α-cutoff mean here? Testing each hypothesis (gene) at level α guarantees: PCER α - let s loo at why 20

What are P-values, really? Suppose T is the test stat., and t is the observed T. Pval Assume What is dy dt SoY Then = F( t) = P( T H P( T = 1 0 = F ( t) = the distribution of when > is H t 0 true. f ( t), is H t) = 0 ) t true, Let F f ( t) dt Pval isuniform[0,1]. Pval be the cdf = 1 Pval Y g( y) = = f ( t) F( t)? dt dy ~ U[0,1]. = of Let T and g f ( t) f be pdf 1 f ( t) be pdf = 1 of Y : : 21

P-values and α cut-off Suppose null is true for all m genes - (so none of the genes are differentially expressed) Loo at histogram of m=1000 P-values with α=0.05 cut-off - about 50 significant just by chance these can be expensive errors 22 (Here, V/m 50/1000 = 0.05.)

How to control this error rate? Loo at controlling the FWER: Testing each hypothesis (gene) at α/m instead of α guarantees: FWER α This is called the Bonferroni correction This is far too conservative for large m 23

A more reasonable approach Consider these corrections sequentially: Let Let P be the P - value for testing gene i, with null H (1) Let be the largest i Reject all i P P (2) H ( i) K for i P ( m) for which = 1,2,...,. be the ordered P - P (i) i α. m values. i. 24 Then for independent test statistics and for any configuration of false null hypotheses, this procedure guarantees: E[ V / R] α

What does this mean? V = # of wrongly-rejected nulls R = total # of rejected nulls Thin of rejected nulls as discovered genes of significance Then call E[V/R] the FDR - False Discovery Rate This is the Benjamini-Hochberg FDR correction sometimes called the marginal FDR correction 25

Benjamini-Hochberg adjusted P-values Let Let If P P (1) ( adj) (i) any P P (2) = ( adj) (i) K P (i) P ( m) m. i > 1, reset it to 1. be the ordered P - values. If any P ( adj) (i) > P ( adj) (i+ 1 ), reset it to P ( adj) (i+ 1 ) (starting at the end of the list, checing bacwards). Then P ( adj) (1) P ( adj) (2) K P ( adj) ( m) are the ordered BH - FDR - adjusted P - values. 26

27

An extension: the q-value p-value for a gene: the probability of observing a test stat. more extreme when null is true q-value for a gene: the expected proportion of false positives incurred when calling that gene significant Compare (with slight abuse of notation): pval = P( T > t H0 true) qval = P( H0 true T > t) 28

Useful visualization technique: the volcano plot Good for visualizing importance of variability Loss of information? 29

Smoothed Color Density Representation Interpretation: color represents density around corresponding point How is this better? visualize overlayed points 30

Another visualization tool: heatmap Expression (matrix) on color scale (here, dar red to dar blue) Rows and columns clustered dendrogram 31

Hierarchical clustering: agnes Agglomerative Nesting (hierarchical clustering) Start with singleton clusters each vector is its own group Find the two closest vectors and merge them distance usually Euclidean; form a cluster Then recalculate distances: Linage distance between clusters Average linage: average of dissimilarities between clusters Single linage: dissimilarity between nearest neighbors Complete linage: farthest neighbors 32

Side note: color matters! 33 www.vischec.com/vischec/vischecimage.php

Gene Profiling / Selection Observe gene expression in different conditions healthy vs. diseased, e.g. Use simultaneous expression profiles of thousands of genes (what are the genes doing across arrays) Loo at which genes are important in separating the two conditions; i.e., what determines the conditions signatures 34

Machine Learning Computational & statistical inference processes: observed data reusable algorithms for prediction Why machine? want minimal human involvement Why learning? develop ability to predict Here, supervised learning: use nowledge of condition type 35

Machine Learning Methods Neural Networ SVM (Support Vector Machine) RPART (Recursive PArtitioning and Regression Trees) CART (Classification and Regression Trees) Ensembling Learning (average of many trees) Boosting Bagging RandomForests 36

CART: Classification and Regression Trees Each individual (array) has data on many predictors (genes) and one response (disease state) Thin of a tree, with splits based on levels of specific predictors Gene1 > 8.3 Diseased (18:21) Gene1 < 8.3 Healthy (15:10) Diseased (3:11) Choose predictors and split levels to maximize purity in new groups; the best split at each node 37 Prediction made by: passing test cases down tree

CART generalized: Random Forests Rather than using all predictors and all individuals to mae a single tree, mae a forest of many (ntree) trees, each one based on a random selection of predictors and individuals Each tree is fit using a bootstrap sample of data (draw with replacement) and grown until each node is pure Each node is split using the best among a subset (of size mtry) of predictors randomly chosen at that node (default is sqrt. of # of predictors) (special case using all predictors: bagging) 38 Prediction made by aggregating across the forest (majority vote or average)

How to measure goodness? Each tree fit on a training set (bootstrap sample), or the bag The left-over cases ( out-of-bag ) can be used as a test set for that tree (usually 1/3 of original data) The out-of-bag (OOB) error rate is the: % misclassification 39

What does RF give us? 40 Kind of a blac box but can loo at variable importance For each tree, loo at the OOB data: Permute values of predictor j among all OOB cases Pass OOB data down the tree, save the predictions For case i of OOB and predictor j, get: OOB error rate with variable j permuted OOB error rate before permutation Average across forest to get overall variable importance for each predictor j

Why variable importance? Recall: identifying conditions signatures Sort genes by some criterion Want smallest set of genes to achieve good diagnostic ability 41

ALL subset: 3,024 genes (FDR sig. at.05), 30 arrays (15 B, 15 T) Loo at top 3 most important genes from Random Forest. 42

On the subset of 30 arrays: Over all 128 arrays: 43

Then what? List of candidate genes: needles in a haystac Further study: common pathways and functions validation: qrt-pcr, etc. Hypothesis generation 44