Multiple Testing Issues & K-Means Clustering. Definitions related to the significance level (or type I error) of multiple tests

Size: px

Start display at page:

Download "Multiple Testing Issues & K-Means Clustering. Definitions related to the significance level (or type I error) of multiple tests"

Aleesha Sharp
6 years ago
Views:

1 StatsM254 Statistical Methods in Coputational Biology Lecture 3-04/08/204 Multiple Testing Issues & K-Means Clustering Lecturer: Jingyi Jessica Li Scribe: Arturo Rairez Multiple Testing Issues When trying to find differentially expressed genes, we re perforing a test on each gene. If we have genes, 2,..., we re perfoing tests. Table : Decision atrix Decisions Accept H 0 Reject H 0 Total True H 0 U V 0 False H 0 T S 0 Total R R R is the total nuber of differentially expressed genes we called through the tests. U is the nuber of true negatives. T is the nuber of false negatives ( type II error). V is the nuber of false positives (type I error). S is the nuber of true positives. Definitions related to the significance level (or type I error) of ultiple tests : PCER Per Coparison Error Rate (PCER) = E[V ] 2: FWER Faily Wise Error Rate (FWER) = P (V ) 3: FDR False Discovery Rate (FDR) FDR = { E[ V R ] 0, if R = V = 0 PCER Suppose we test each of the hypotheses at significance level α, then P (reject H (i) 0 H(i) 0 is true) = α, i =, 2... PCER = E[V ] = P (reject H(i) 0 and H (i) 0 is true)

2 We know P (H (i) 0 is true) [0, ], so P (reject H(i) 0 = H(i) 0 is true)p (H (i) 0 is true) () () Thus, PCER is bounded by α P (reject H(i) 0 H(i) 0 is true) = α = α FWER F W ER = P (V ) = P (Reject at least one H 0 ) = P ( {reject H (i) 0 }) P (reject H (i) 0 ) α α is a bad upper bound because it can take values that don t ake sense ( e.g. = 00, α = 0.05 ), but it can still give us useful inforation. If we reduce the significance level of each test to α the FWER α This is called the Bonferroni Correction e.g. Suppose you have,000 genes and you want the FWER to be Then you call the gene differentially expressed only if its p-value This correction is very stringent. You could iss true differentially expressed genes. May result in the discovery of too few genes. What do we do? Coent : If all H (i) 0 are true and have level α, V Bin(, α) FDR (Benjaini and Hochberg, 995) goal: control FDR = E[ V R ] α procedure: consider testing H (i) 0,..., H() 0 based on p-values p,..., p. Let s order the p-values as p () p (2)... p () and their corresponding H 0 s as [H (), H (2),..., H () ] Let k be the largest i such that p (i) i α. Then we reject all H (i) for which i k 2

3 p values ordered index i, i =,2,.., In this exaple α = 0.2 and = 7. Here the red line is at h = α and the black line increases at b = α Intuition: Reject all p-values below the black line Here you copare the i th p-value to i α to ake a decision loose In Bonferroni you copare every p-value to α to ake a decision stringent Why does this procedure work? If all H 0 are true, then E[V ] = p (k) actual significance level of each test Here R = k(we decide) E[ V R ] = p (k) k p i = i th p-value = P (reject H (i) 0 H(i) 0 is true) k k α = α Exaple: Account for correlation between genes Table 2: Expression Value atrix Gene Condition Condition 2 Test Statistic -c values -c 2 values S () c values -c 2 values S () We ve got expression values for each gene where condition has c replicates and condition 2 has c 2 replicates. These c + c 2 expression values are used to calculate a test statistic S i for each gene (e.g. a t-statistic - the larger the value, the ore likely gene i is differentially expressed). These S i s are then ordered fro greatest to sallest in the data atrix. We then perute the expression values for each gene across the conditions (i.e., perute the coluns of the Expression Value Matrix) for a total of N ties, each resulting in a peruted data atrix. 3

4 We want to define a c such that the genes with S i bellow c is called non differentially expressed and the genes with S i above c is called differentially expressed. Using a decision rule S i > c we find a total of R genes to be differentially expressed. Question: What is the FDR? In perutation j, define V j as the nuber of genes with S i statistic > c Estiate FDR as N N V j R Clustering Algoriths : K-Means n genes, each with an expression vector R p (p saples) X, X 2,..., X n R p assign the in to k clusters. The class label of X i is C(i) and k is the cluster center for the k th cluster. Objective Function (goal): We want to iniize the total within cluster distance Algorith: ) K k= C(i)=k X i k 2 ({ k} K k=, {C(i) } n ) ( { k } K k=, {C(i) } n ) = argin K { k } K k=,{c(i)}n k= C(i)=k X i k 2 For a given cluster assignent C, inize the total within cluster distance with respect to { k } K k= k = argin X i k 2 k C(i)=k If X i k 2 = (X i k ) 2 k = n k C(i)=k X i 4

5 2) Given the cluster centers { k } K k=, iniize the total within cluster distance with regard to the cluster assignent. Di Di e.g. 2-Diensions. Given cluster centers,2,3,4, these cluster assignents would iniize within cluster distance for each cluster. Procedure C(i) = argin X i k 2 k K ) Find the optial cluster centers given a cluster assignent; 2) Find the optial cluster assignent given the cluster centers; 3) Iterate ) and 2) until C converges. Question: Can we find the global in {k } K k=,{c(i)}n K k= C(i)=k X i k 2 of the total within cluster distance? This clustering algorith is sensitive to initial cluster assignent. Not every initial cluster assignent can lead to the global in. Solution: Use ultiple initial cluster assignents to check if there is a coon solution reached by a ajority of the initial assignent schees. 5

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are, Page of 8 Suppleentary Materials: A ultiple testing procedure for ulti-diensional pairwise coparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyaal D. Peddada S Notations