ALGEBRAIC STABILITY INDICATORS FOR RANKED LISTS IN MOLECULAR PROFILING

Size: px

Start display at page:

Download "ALGEBRAIC STABILITY INDICATORS FOR RANKED LISTS IN MOLECULAR PROFILING"

Junior Murphy
6 years ago
Views:

1 ALGEBRAIC STABILITY INDICATORS FOR RANKED LISTS IN MOLECULAR PROFILING Giuseppe MPBA Fondazione Bruno Kessler Trento Italy Sandpit on the Application of Machine Learning Techniques to the Analysis of Complex Biomedical Data London, 6th July 2009

2 THE PROBLEM M(n p, R) n 10 3 samples p 10 5 genes / 10 6 SNPs A result of a classification/ranking procedure is a list of features, ranked according to their relevance for the classifier. To ensure repeatability and not to incur in overfitting due to information leakage phenomena and such as the selection bias, methodology has to be carefully designed.

3 THE PROBLEM M(n p, R) n 10 3 samples p 10 5 genes / 10 6 SNPs A result of a classification/ranking procedure is a list of features, ranked according to their relevance for the classifier. To ensure repeatability and not to incur in overfitting due to information leakage phenomena and such as the selection bias, methodology has to be carefully designed. Error curve for the Colon cancer dataset (left) with true labels (blue) and random labels (black).

4 THE PROBLEM M(n p, R) n 10 3 samples p 10 5 genes / 10 6 SNPs A result of a classification/ranking procedure is a list of features, ranked according to their relevance for the classifier. To ensure repeatability and not to incur in overfitting due to information leakage phenomena and such as the selection bias, methodology has to be carefully designed. Error curve for the two synthetic datasets (right) f , with 1000 discriminant features (black) and f (blue) with no discriminant features.

THE PROBLEM M(n p, R) n 10 3 samples p 10 5 genes / 10 6 SNPs A result of a classification/ranking procedure is a list of features, ranked according to their relevance for the classifier.

5 THE PROBLEM M(n p, R) n 10 3 samples p 10 5 genes / 10 6 SNPs A result of a classification/ranking procedure is a list of features, ranked according to their relevance for the classifier. To ensure repeatability and not to incur in overfitting due to information leakage phenomena and such as the selection bias, methodology has to be carefully designed. Result of a multi-institution analysis of papers about gene expression profiling in a leading journal: Inability to reproduce the analysis > 50% Partial reproduction in 1/3 Perfect reproduction in 11%

M(B d, N), B 10 3 THE SET OF LISTS L = {L i } B i=1 is the set of all lists and, for each list, L k i is

6 THE PROBLEM Solution: use a correctly planned Data Analysis Protocol (DAP) implementing a resampling scheme. At each run i=1... B an ordered list L i is produced. M(B d, N), B 10 3 THE SET OF LISTS L = {L i } B i=1 is the set of all lists and, for each list, L k i is its top-k list. i.e. the sublist consisting of the first ranked k elements from L i. What information can be derived from the structure of the feature lists set?

7 PERMUTATION GROUP THEORY GROUP A group is a set together with an associative operation which admits an identity element and such that every element has an inverse. PERMUTATIONS A permutation on a set Ω n = {s 1, s 2,..., s n} of n elements is a bijection τ : Ω n Ω n s i τ(s i ) = s j The number of different permutations on set of n elements is n!. SYMMETRIC GROUP The set of all permutations on a set Ω n (usually Ω n = N [1, n]) becomes a group S n, called permutation group or symmetric group when endowed with the operation στ : Ω n Ω n s i σ(τ(s i )) = s k

LIST REPRESENTATION Example: The symmetric group S 3 consists of {1 1, 2 2, 3 3} {1 2, 2 1, 3 3} {1 3, 2 2, 3 1} {1 1, 2 3, 3 2} {1 2, 2 3, 3 1} {1 3, 2 1, 3 2} REPRESENTATION Labelling the n genes

8 LIST REPRESENTATION Example: The symmetric group S 3 consists of {1 1, 2 2, 3 3} {1 2, 2 1, 3 3} {1 3, 2 2, 3 1} {1 1, 2 3, 3 2} {1 2, 2 3, 3 1} {1 3, 2 1, 3 2} REPRESENTATION Labelling the n genes involved in a problem by N [1, n], every ranked list L t can be identified as a permutation τ t S n, the image τ t (i) of the i-th gene being its ranking inside the list L t. Gene Ω n τ t (Ω n) AFFX-MurIL2_at _at n 175 Thus τ can be written as τ = (τ(1),, τ(n)) = (13347,, 175).

9 METRICS ON PERMUTATION GROUPS DEFINITION A map d defined on a permutation group S d : S S R is a distance function or a metric if it satifies the following axioms: NON NEGATIVITY d(σ, τ) 0 SEPARATION d(σ, τ) = 0 σ = τ SYMMETRY d(σ, τ) = d(τ, σ) σ, τ S σ, τ S σ, τ S TRIANGULAR d(σ, τ) + d(τ, ρ) > d(ρ, σ) RIGHT INVARIANCE d(σ, τ) = d(σρ, τρ) σ, τ, ρ S σ, τ, ρ S Triangularity might sometimes be relaxed to some less strict requirements (e.g. polygonal inequalities) that makes d a pseudometric.

10 RIGHT INVARIANCE The fifth property d(σ, τ) = d(σρ, τρ) σ, τ, ρ S is an essential requirements for distance functions in permutation group spaces. AN EXAMPLE L 1 = {A, C, D, B, E} and L 2 = {B, C, D, E, A} via the identification A1, B2, C3, D4, E5 correspond to τ 1 = (1, 4, 2, 3, 5), τ 2 = (5, 1, 2, 3, 4). Apply the permutation ρ corresponding to the relabeling A2, B5, C4, D1, E3 Then τ 1 ρ = (3, 1, 5, 2, 4) and τ 2 ρ = (3, 5, 4, 2, 1). Right-invariance corresponds to relabeling independence. The Euclidean distance d(τ, σ) = q Pn i=1 (τ 1 (i) σ 1 (i)) 2 is not right-invariant.

11 STATISTICAL DISTANCES SPEARMAN S FOOTRULE It is the L 1 distance between permutations: F (σ, τ) = P n i=1 σ(i) τ(i) SPEARMAN S RHO q Pn It is the L 2 distance between permutations: R(σ, τ) = i=1 (σ(i) τ(i))2 KENDALL S TAU It is the minimum number of pairwise adjacent transpositions (swaps) needed to transform σ into τ: K (σ, τ) = Card{(i, j) Ω 2 n : σ(i) < σ(j) and τ(i) > τ(j)} All these measures are metric.

12 THE CANBERRA DISTANCE A key fact in gene lists is that variations in the bottom part of the lists are much less relevant than differences in the top part. Thus a weight factor is needed to take care of such requirement. THE CANBERRA DISTANCE Ca(τ, σ) = nx i=1 τ(i) σ(i) τ(i) + σ(i) A natural candidate to substitute Spearman s footrule is its weighted version.

13 Let H s = sx j=1 1 j = log(s) + γ + 1 2s 1 12s 2 + ɛs 120s 4 MEAN be the s-th harmonic number with a famous approximation, where 0 < ɛ s < 1 and γ is the Eulero-Mascheroni constant. Many closed forms for sum and products involving H s have been found only very recently. THEOREM The expected value for the Canberra distance is E{Ca} Sp = 2p «H 2p 2p Its approximation is 2p p Ê{Ca} Sp = (p + 1) log(4) (p + 2) + o(1). «H p n + 3 «. 2

14 VARIANCE Let and px f p(i) = 1 p j=1 c p(i, j) = i j i + j c p(i, j) = 2i p (2H 2i H i H p+i 1) THEOREM The variance cannot be written in a entirely closed form Its approximation is V {Ca} Sp = 1 p 1 px cp(i, 2 j) + E{Ca} Sp 2c p(i, j)f p(j) i,j=1 px px = P(H p, H 2p ) + i 2 H p+i H 2i + i 2 H p+i H i. i=1 i=1 ˆV {Ca} Sp = αp + β log(p) + δ + o(p 1 ), α, β, δ R

15 DISTRIBUTION Let Since then d p(i, j) = c p(i, j) 1 px c p(i, l) 1 px c p(m, j) + 1 X px p p p 2 p c p(m, l). l=1 m=1 l=1 m=1 max d p 2 (i, i) = dp 2 (1, 1) = 1 E{Ca} Sp 2p 4 4H p+1, 1 i,j p p thus by Hoeffding theorem we have THEOREM max 1 i,j p dp 2 (i, i) lim p p 1 p V {Ca} = 0, S p The distribution of the Canberra distance on S p is asymptotically normal.

16 The problem ρ = arg max S p (Ca I ) has one solution ρ M for p even and two solutions for p odd, namely 8 p 2Y >< i=1 ρ M = >: θ = i p 2 + i 1 p p p while the corresponding maximum value is 2 p 1 2 UPPER BOUND for even p «+ 2 and θ 1 for odd p, max(ca) = Ca I (ρ M ) = S p 8 2r(H 3r H r ) if p = 4r >< (2r + 1)H 6r+1 + rh 3r+1 r + 1 H 2 3r (2r + 1)H 2r Hr if p = 4r (2r + 1)(2H 6r+3 H 3r+1 2H 2r+1 + H r ) if p = 4r + 2 >: (2r + 1)H 6r H 3r+2 (2r + 1)H 2r+1 (r + 1)H r+1 + r + 1 H 2 r if p = 4r + 3

17 DISTANCES ON TOP-k LISTS Since most important genes are located in the upper part of the lists, it is also important to be able to compare those more important portions of the lists. Managing partial lists is much harder than global lists, since they do not necessarily involve all the same elements. TOP k -LIST A top-k list is the sublist including the elements ranking from position 1 to k of the original list, which corresponds to induce a partial ordering on k out of n objects. This has a natural counterpart into the group-theoretical framework by Hausdorff topology: DISTANCE WITH LOCATION PARAMETER Consider a global distance D Consider two top-k lists τ m for m = 1, 2, involving respectively the subsets D τ 1 = τm 1 (N [1, k]) of Ω n ; m ( τm(i) Define the global list τ m(i) for i D = τ 1 m k + 1 otherwise Define D (k+1) (τ 1, τ 2 ) = D(τ 1, τ 2 ).

18 CANBERRA DISTANCE WITH LOCATION PARAMETER Ca (k+1) (τ, σ) = px i=1 min{τ(i), k + 1} min{σ(i), k + 1} min{τ(i), k + 1} + min{σ(i), k + 1}. THEOREM The expected (average) value of the Canberra metric on S n is E{Ca (k+1) } Sp = k 2k «H 2k 2k «H k k + 3 ««p 2k 4k 2 2(p k) + (2(k + 1)(H 2k+1 H k+1 k), p which can be approximated up to terms o(1) as Ê{Ca (k+1) } Sp = (k + 1)(2p k) p For p 10 4, E{Ca (k+1) } Sp Ê{Ca(k+1) } Sp kp + 3p k k 2 log(4). p

19 INCLUDING MODULES Moreover, the existence of feature modules, for instance highly correlated genes (e.g. clones), or genes belonging to the same pathway must be taken into account. Classifiers as SVM tend to swap the relative importance of correlated features during different ranking processes, and those swaps should be less penalized than movements between uncorrelated genes. MODULES-AWARE DISTANCES Let M = {g 1,..., g m} Ω p = {1,..., p} be a module consisting of m features. For each permutation τ L, define the permutation η M S m by the property η M (i) < η M (j) if τ(g ηm (i)) < τ(g ηm (j)). Then define the new permutation τ M as τ M (g i ) = τ(g ηm (i)), leaving τ M Ωp \M = τ out of M. The distance is then computed on L M = {τ M : τ L}.

20 EXTENSIONS TO PARTIAL LISTS WITH DIFFERENT LENGTH Let L 1 and L 2 be two partial lists of length respectively l 1 l 2 whose elements belong to F Let d be a distance on permutation groups. Let S τ be the set of all the dual lists of the elements in S L, so that if α S τ, then α(i) = τ(i) for all indices i L. Define the distance between L 1 and L 2 as a function of the distance on the two quotient groups, i.e. d(l 1, L 2 ) = f ` d(α, β): α S τ1, β S τ2 = f (d(sτ1, S τ2 )), for f a function of the (p l 1 )!(p l 2 )! distances d(α, β) such that on sigletons f ({t}) = t.

21 EXTENSIONS TO PARTIAL LISTS WITH DIFFERENT LENGTH In particular, for d = Ca the Canberra distance and f the mean function: where Λ = Ca(L 1, L 2 ) = 1 1 X X Ca(α, β) S τ1 S τ2 α S τ1 β S τ2 = Λ X X Ca(α, β) α S τ1 β S τ2 1 (p l 1 )!(p l 2 )!. = Λ X = Λ X px α S τ1 β S τ2 i=1 X α S p,β S p α(i)=τ 1 (i) if i L 1 β(i)=τ 2 (i) if i L 2 α(i) β(i) α(i) + β(i) px i=1 α(i) β(i) α(i) + β(i),

22 EXTENSIONS TO PARTIAL LISTS WITH DIFFERENT LENGTH Expanding the formula, we get for Ca(L 1, L 2 ): where X i L 1 L 2 τ1 (i) τ 2 (i) τ 1 (i) + τ 2 (i) (l 2 + 1, p, τ 1 (i)) (l «1 + 1, p, τ 2 (i)) p l 2 p l p l 2 `l1 (p l 2 ) 2ε p(l 1 ) + 2ε l2 (l 1 ) + 1 p l 1 `l1 (p l 1 ) : +4ε l1 (l 1 ) + 2ξ(l 2 ) 2ξ(l 1 ) 2ε l1 (l 2 ) 2ε p(l 2 )+ (p + l 1 )(l 2 l 1 ) + l 1 (l 1 + 1) l 2 (l 2 + 1) + A `2ξ(p) 2ξ(l 2 ) 2ε l1 (p) + 2ε l1 (l 2 ) 2ε p(p) + 2ε p(l 2 ) + (p + l 1 )(p l 2 )+ l 2 (l 2 + 1) p(p + 1)), A = F \ (L 1 L 2 ) (p l 1 )(p l 2 ) (a, b, c) = X a i b c i c + i sx ε k (s) = jh j+k j=1 Setting A = 0 we neglect the (preminent) contribution of the not included features. We call core the formula thus obtained, and complete the original one.

DISTANCE MATRIX AND DISTRIBUTION Given a set of lists L = {L t } b t=1 of p genes and an integer computation of all mutual top-k distances leads to the construction of a symmetric distance matrix M k

23 DISTANCE MATRIX AND DISTRIBUTION Given a set of lists L = {L t } b t=1 of p genes and an integer computation of all mutual top-k distances leads to the construction of a symmetric distance matrix M k M(b b, R + ). Then the corresponding histogram can be built Often the histogram can be approximated by a gaussian distribution n = 100, p = 500, b = 100, SVM-RFE, D = C INDICATORS It is thus possible to use the mean (and the variance) of the matrix M k as measures of the intrinsic distance of the top-k lists of the set L.

24 STABILITY Intuitively, the more mutually different the lists, the more unstable the problem (either because of the data or because of the employed classification procedure). STABILITY CURVE Given a set of list L on p genes, decide a sequence K of relevant gene subset dimensions. For each k K, compute the mean and the variance of M k (possibly normalized by k). Plot those values versus the sequence K : this may be computationally heavy, depending on p and on the number of lists. mean Stability curves k solid: p = dotted: p = The above procedure is independent from the source generating the lists and from the dimensions of the gene expression data.

25 DEFINITION Let µ k = 2 B(B 1) P2 i<j B (M k ) ij be the mean of all the B(B 1) non-trivial values of 2 the distance matrix M k M(B B, R + ). THE MEAN LIST DISTANCE INDICATOR is the sequence I Dµ (L) = {(k, µ k ): 1 k p}. THE NO-INFORMATION CURVE is the sequence I Dµ (S p) = {(k, E{Ca (k+1) } Sp ): 1 k p} associated to the group S p of all possible dual lists with p features. An indication of the stability of a list set L is given by considering (! ) µ k Î Dµ = k, E{Ca (k+1) } Sp : 1 k p. independent from p and k.

26 PREDICTIVE PROFILING ON SYNTHETIC DATA Synthetic datasets fx/y : binary labelled samples described by X discriminant features from N (µ, σ) (µ depending on the sample label) and by Y X random features from the uniform U[a, b]. DAP: B = 400 LinSVM/RFE exps, with correlation modules M1 (>0.8) and M2 (>0.7). On both datasets, less than five features are sufficient to reach perfect classification. I Dµ I Dµ Stability indicators in profiling synthetic data (solid lines: f 10/100; dotted lines: f 30/100) and in presence of feature modules.

27 CLASSIFIERS COMPARISON Classifiers: Linear SVM / Terminated ramp SVM Feature ranking methods: (E )RFE / 1 RFE Datasets: Microarray Breast Cancer cases described by gene expression / Proteomic Ovarian Cancer samples described by 123 mass spec peaks. Breast Cancer Ovarian Cancer Average Test Error Average Test Error LinearSVM/RFE (LR: dashed) / LinearSVM/1RFE (L1: dotted) / TRSVM/1RFE (T1: solid)

28 CLASSIFIERS COMPARISON Classifiers: Linear SVM / Terminated ramp SVM Feature ranking methods: (E )RFE / 1 RFE Datasets: Microarray Breast Cancer cases described by gene expression / Proteomic Ovarian Cancer samples described by 123 mass spec peaks. Breast Cancer Ovarian Cancer Mean distance I Dµ Mean distance I Dµ LinearSVM/RFE (LR: dashed) / LinearSVM/1RFE (L1: dotted) / TRSVM/1RFE (T1: solid)

29 CLASSIFIERS COMPARISON Classifiers: Linear SVM / Terminated ramp SVM Feature ranking methods: (E )RFE / 1 RFE Datasets: Microarray Breast Cancer cases described by gene expression / Proteomic Ovarian Cancer samples described by 123 mass spec peaks. Breast Cancer Ovarian Cancer I Dµ top-25 I Dµ top-25 LinearSVM/RFE (LR: dashed) / LinearSVM/1RFE (L1: dotted) / TRSVM/1RFE (T1: solid)

30 APPLICATION TO ENRICHMENT If a library of functionally correlated gene sets (e.g. pathways) is available, the stability indicator can be used to compare the stability of the original lists with the stability of the ranked gene sets produced by their enrichment. A dishomogeneous set of lists may be shown to correspond to a much stabler lists of pathways. Example on synthetic data with 100 lists of 100 genes and 124 pathways, where the lists are created by random permutations fixing the gene sets.

31 FILTER FEATURE SELECTIONS Dataset Tib100: samples and 100 normally distributed features, with decreasing discriminant power. Fix a number n of samples and a suitable threshold θ for the employed filtering algorithm A Randomly extract from Tib100 dataset n positive and n negative examples and compute the A statistics. The statistics considered are Fold Change (FC), Significance Analysis of Microarray (SAM), B statistics, F statistics, t, mod F and mod t. Consider the list of the features whose value of the statistic is above the threshold θ, ranked by statistic value; repeat the above process for B times. Final output: set of all lists L(n, A, θ, i), where the number of samples n ranges between 5 and 45, an ad-hoc sets of 100 values for θ are chosen and B = 100.

ability in controlling the false discovery rate.

32 FILTER FEATURE SELECTIONS Most researcher agree on the fact that SAM should outperform all other methods because of its ability in controlling the false discovery rate. Complete Core B statistics B statistics SAM statistics SAM statistics

33 ACCURACY AND STABILITY I Although they share a common trend in many cases, accuracy and stability are independent measures. Thus we can analyze a dataset in the accuracy vs. stability space. Left-down direction indicates better performance. This diagnostic plot allows the comparison of different datasets, different profiling methods (classifiers/feature ranking algorithm) and different models. ATE vs. ÎDµ for different profiling methods (LE,L1,T1) and cancer datasets (Breast, Ovarian); each point corresponds to a feature subset size, indicated for extremal models

34 ACCURACY AND STABILITY II US-FDA MAQC-II - sources of variability in microarray experiments: 6 datasets, 13 endpoints (A-M). Analysis of swapped training/validation experiments I F A G J Performance-Stability analysis for the candidate models at endpoints( both Blind and Swap) M D B E C H The MCC coordinate is the median of the MCCs for endpoint and same experiment. K Canb. Complete 0.90 L MCC = TP TN FP FN p (TP + FP)(TP + FN)(TN + FP)(TN + FN) 0.85 For each point, the rectangles are defined by CI (95% bootstrap Student s t) MCC Stability consistently predicts level of MCC performance, i.e. endpoints for which lists change less (between Blind and Swap) have higher MCC scores.

35 ACCURACY AND STABILITY II US-FDA MAQC-II - sources of variability in microarray experiments: 6 datasets, 13 endpoints (A-M). Analysis of swapped training/validation experiments. Inter-experiment list stability compared with MCC. For each pair (edp, exp) the partial top- med(edp, exp) Borda list B(edp, exp) is used in alternate experiment. Swap experiment is run with features belonging to B(exp,Blind) retrieving a MCCBlind(Swap) value MCC_Blind(Swap) MCC_Swap(Blind) Analogously, we compute the MCCSwap(Blind) for the same endpoint. InterCC on the x axis ranks endpoints for increasing list variability between experiments. Predictive Borda MCC L C H E B K D J G A F I M InterCC Ranked Endpoints Stability consistently predicts level of MCC performance, i.e. endpoints for which lists change less (between Blind and Swap) have higher MCC scores.

36 REFERENCES [Borda81] Borda, J. Mémoire sur les élections au scrutin. Histoire de l Académie Royale des Sciences, Année M. DCCLXXXI, [Critchlow85] Critchlow, D. Metric methods for analyzing partially ranked data. Lecture Notes in Statistics, 34, Springer-Verlag, Berlin, [Diaconis88] Diaconis, P. Group representations in probability and statistics. Institute of Mathematical Statistics Lecture Notes Monograph Series 11, [Dwork01] Dwork, C., Kumar, R., Naor, M. & Sivakumar, D. Rank aggregation revisited, In WWW10. [Fagin04] Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D. & Vee, E. Comparing and Aggregating Rankings with Ties. In Proc ACM Symposium on Principles of Database Systems (PODS 04), 47 58, [Saari01] Saari, D. Chaotic Elections! A Mathematician Looks at Voting. American Mathematical Society, [Simon06] Simon, R. Development and Evaluation of Therapeutically Relevant Predictive Classifiers Using Gene Expression Profiling. J Natl Cancer Inst., 98(17), 2006.

COMPARING PARTIAL RANKINGS. Laura Wilke Mathematics Honors Thesis Advisor: Professor David Meyer, Ph.D UCSD May 13, 2014

COMPARING PARTIAL RANKINGS. Laura Wilke Mathematics Honors Thesis Advisor: Professor David Meyer, Ph.D UCSD May 13, 2014 COMPARING PARTIAL RANKINGS Laura Wilke Mathematics Honors Thesis Advisor: Professor David Meyer, Ph.D UCSD May 13, 2014 Astract. A partial ranking is a partially known total ordering of a set of ojects,