clustq: Efficient Protein Decoy Clustering Using Superposition-free Weighted Internal Distance Comparisons

clustq: Efficient Protein Decoy Clustering Using Superposition-free Weighted Internal Distance Comparisons Debswapna Auburn University ACM-BCB August 31, 2018

What is protein decoy clustering? Clustering groups similar set of items together in clusters folding simulation Identify most populated conformational states 2

Clustering and protein folding landscape Energy Entropy N There are more ways to be incorrect than to be correct 4

Clustering based nativeness score? potential near-native basin highest average pairwise similarity O(n 2 ) pairwise comparisons to compute the average pairwise similarity score 5

Pairwise comparison via structural alignment structural alignement optimal structural alignment is an optimization problem similarity score TMscore, GDT, O(n 2 ) alignment based comparisons is computationally expensive 6

Q-score: Alignment-free comparison {rij 1 }: internal distance matrix protein 1 {rij 2 }: internal distance matrix protein 2 Qij = exp [-(rij 1 - rij 2 )^2 ] Qij ~ 1 for very good similarity Qij ~ 0 for poor similarity 7 Sussman and coworkers, 2009

WQ-score: Weighted internal distance comparison Qnarrow: i - j < 6 Qshort: i - j 6 and i - j < 12 Qmedium: i - j 12 and i - j < 24 sequence separations are inspired from protein contact map prediction Qlong: i - j 24 WQ-score = (1 x Qnarrow + 2 x Qshort + 4 x Qmedium + 8 x Qlong) / 15 long range interactions carry more information about protein fold 8

clustq: What s conceptually new? rapid clustering based consensus scoring using alignment-free average pairwise WQ-score 9

Results 1/4: WQ-score vs. alignment-based scores comparisons with popular alignment-based scores datsets measures Modeller set (20 proteins) Rosetta set (58 proteins) TMscore GDT-TS Pearsons correlation Spearman correlation 10

Modeller set very well correlated (> 0.97) with alignment-based scores 11

Rosetta set well correlated (~0.8) with alignment-based scores 12

Results 2/4: clustq vs. alignment-based consensus scoring clustering based consensus scoring with alignment-based scores "stage 2" datasets CASP11 (80 proteins) CASP12 (40 proteins) TMscore GDT-TS measures Pearsons correlation Spearman correlation 13

CASP11 "stage 2" set better than TMscore based clustering comparable to GDT-TS based clustering 14

CASP12 "stage 2" set comparable to alignment-based consensus scoring 15

Computational Efficiency of clustq vs. TMscore CASP11 CASP12 16

clustq is 5.2 times faster than alignment based consensus scoring 17

Results 3/4: clustq vs. top consensus based approaches top consensus scoring full datasets CASP11 (80 proteins) CASP12 (40 proteins) Pcons APOLLO measures Pearsons correlation Spearman correlation 18

Results 3/4: clustq vs. top consensus based approaches Pearson Spearman CASP11 CASP12 CASP11 CASP12 0.9 0.8 Avg. Pearson correlation w.r.t. GDT-TS 0.825 0.75 0.675 Avg. Pearson correlation w.r.t. GDT-TS 0.725 0.65 0.575 0.6 clustq Pcons APOLLO 0.5 clustq Pcons APOLLO consistently better performance compared to top methods 19

can clustq score estimate target difficulty? 20

Results 4/4: Computational Efficiency of clustq vs. TMscore CASP11 CASP12 21

if clustq_score > 0.4: easy target (homology-based) else: hard target (homology-free) 22

clustq online http://watson.cse.eng.auburn.edu/clustq/ 23

Conclusions alignment-free weighted internal distance comparison metric well correlated with alignment-based metrics ultra-fast clustering based consensus scoring comparable or better performance could be employed for estimating target difficulty freely available to the community 24

Acknowledgements Rahul Alapati Auburn University 25

clustq http://watson.cse.eng.auburn.edu/clustq/ 26