clustq: Efficient Protein Decoy Clustering Using Superposition-free Weighted Internal Distance Comparisons Debswapna Auburn University ACM-BCB August 31, 2018
What is protein decoy clustering? Clustering groups similar set of items together in clusters folding simulation Identify most populated conformational states 2
3
Clustering and protein folding landscape Energy Entropy N There are more ways to be incorrect than to be correct 4
Clustering based nativeness score? potential near-native basin highest average pairwise similarity O(n 2 ) pairwise comparisons to compute the average pairwise similarity score 5
Pairwise comparison via structural alignment structural alignement optimal structural alignment is an optimization problem similarity score TMscore, GDT, O(n 2 ) alignment based comparisons is computationally expensive 6
Q-score: Alignment-free comparison {rij 1 }: internal distance matrix protein 1 {rij 2 }: internal distance matrix protein 2 Qij = exp [-(rij 1 - rij 2 )^2 ] Qij ~ 1 for very good similarity Qij ~ 0 for poor similarity 7 Sussman and coworkers, 2009
WQ-score: Weighted internal distance comparison Qnarrow: i - j < 6 Qshort: i - j 6 and i - j < 12 Qmedium: i - j 12 and i - j < 24 sequence separations are inspired from protein contact map prediction Qlong: i - j 24 WQ-score = (1 x Qnarrow + 2 x Qshort + 4 x Qmedium + 8 x Qlong) / 15 long range interactions carry more information about protein fold 8
clustq: What s conceptually new? rapid clustering based consensus scoring using alignment-free average pairwise WQ-score 9
Results 1/4: WQ-score vs. alignment-based scores comparisons with popular alignment-based scores datsets measures Modeller set (20 proteins) Rosetta set (58 proteins) TMscore GDT-TS Pearsons correlation Spearman correlation 10
Modeller set very well correlated (> 0.97) with alignment-based scores 11
Rosetta set well correlated (~0.8) with alignment-based scores 12
Results 2/4: clustq vs. alignment-based consensus scoring clustering based consensus scoring with alignment-based scores "stage 2" datasets CASP11 (80 proteins) CASP12 (40 proteins) TMscore GDT-TS measures Pearsons correlation Spearman correlation 13
CASP11 "stage 2" set better than TMscore based clustering comparable to GDT-TS based clustering 14
CASP12 "stage 2" set comparable to alignment-based consensus scoring 15
Computational Efficiency of clustq vs. TMscore CASP11 CASP12 16
clustq is 5.2 times faster than alignment based consensus scoring 17
Results 3/4: clustq vs. top consensus based approaches top consensus scoring full datasets CASP11 (80 proteins) CASP12 (40 proteins) Pcons APOLLO measures Pearsons correlation Spearman correlation 18
Results 3/4: clustq vs. top consensus based approaches Pearson Spearman CASP11 CASP12 CASP11 CASP12 0.9 0.8 Avg. Pearson correlation w.r.t. GDT-TS 0.825 0.75 0.675 Avg. Pearson correlation w.r.t. GDT-TS 0.725 0.65 0.575 0.6 clustq Pcons APOLLO 0.5 clustq Pcons APOLLO consistently better performance compared to top methods 19
can clustq score estimate target difficulty? 20
Results 4/4: Computational Efficiency of clustq vs. TMscore CASP11 CASP12 21
if clustq_score > 0.4: easy target (homology-based) else: hard target (homology-free) 22
clustq online http://watson.cse.eng.auburn.edu/clustq/ 23
Conclusions alignment-free weighted internal distance comparison metric well correlated with alignment-based metrics ultra-fast clustering based consensus scoring comparable or better performance could be employed for estimating target difficulty freely available to the community 24
Acknowledgements Rahul Alapati Auburn University 25
clustq http://watson.cse.eng.auburn.edu/clustq/ 26