RNAscClust clustering RNAs using structure conservation and graph-based motifs

Size: px

Start display at page:

Download "RNAscClust clustering RNAs using structure conservation and graph-based motifs"

Maximillian Norton
6 years ago
Views:

1 RNsclust clustering RNs using structure conservation and graph-based motifs Milad Miladi 1 lexander Junge 2 miladim@informatikuni-freiburgde ajunge@rthdk 1 Bioinformatics roup, niversity of Freiburg 2 enter for non-coding RN in Technology and Health (RTH), niversity of openhagen 31 st TBI Winterseminar Bled, Slovenia February / 15

2 lustering ncrn sequences using structure conservation human raphlust [Heyne et al, Bioinformatics, 2012]: lusters ncrn sequences an find paralogs belonging to same ncrn class Features based on local sequence and structure 2 / 15

3 lustering ncrn sequences using structure conservation human human chimp pig mouse - raphlust [Heyne et al, Bioinformatics, 2012]: lusters ncrn sequences an find paralogs belonging to same ncrn class Features based on local sequence and structure RNsclust: lusters paralogous RN sequences structurally aligned to their orthologs Extends raphlust approach: 2 / 15

4 lustering ncrn sequences using structure conservation covariation human human chimp pig mouse - consensus structure ( ( ) ) raphlust [Heyne et al, Bioinformatics, 2012]: lusters ncrn sequences an find paralogs belonging to same ncrn class Features based on local sequence and structure RNsclust: lusters paralogous RN sequences structurally aligned to their orthologs Extends raphlust approach: Derives evolutionary conserved sequence and secondary structure 2 / 15

5 Single sequence vs alignment clustering covariation human structure prediction from single sequence human chimp pig mouse - consensus structure ( ( ) ) structure prediction based on multiple alignment consensus structure 5' 3' wrong structure 5' 5' 5' 3' 5' 3' 3' 3' set of correct structures 3 / 15

6 Single sequence vs alignment clustering covariation human structure prediction from single sequence human chimp pig mouse - consensus structure ( ( ) ) structure prediction based on multiple alignment consensus structure 5' 3' wrong structure single sequence clustering 5' 5' 5' 3' 5' 3' 3' 3' multiple sequence alignment clustering set of correct structures cluster 1 cluster 2 cluster 1 cluster 2 cluster 3 wrong clustering correct clustering 3 / 15

7 Identifying similarities of secondary structures Neighborhood Subgraph Pairwise Distance (NSPD) raph Kernel [osta and De rave, Proceedings IML 10, 2010] Extract substructures: intuitively: structure k-mers NSPD Kernel ncrns highly similar if many shared substructures 4 / 15

8 Measuring structure similarity of multiple alignments human chimp pig mouse threshold t - - ( ( ( ) ) ) base pair reliabilities constraints PETfold [Seemann et al, NR, 2008] 5 / 15

9 Measuring structure similarity of multiple alignments human chimp pig mouse threshold t - - ( ( ( ) ) ) base pair reliabilities constraints PETfold [Seemann et al, NR, 2008] constrained folding human chimp pig mouse RNfold for each sequence [Lorenz et al, lg for Mol Biol, 2011] use NSPD raph Kernel to compare alignments 5 / 15

10 RNsclust pipeline: From input alignments to clustering Set of multiple alignments Reliable basepairs onstrained folding NSPD raph Kernel Pairwise similarities raphlust postprocessing steps lustering of multiple alignments 6 / 15

11 onstructing a benchmark data set Split Rfam 12 family seed alignments into subalignments: Human Rfam family seed alignment Family subalignments himp Mouse split Pig 7 / 15

12 onstructing a benchmark data set Split Rfam 12 family seed alignments into subalignments: Human Rfam family seed alignment Family subalignments himp Mouse split Pig 1 Each subalignment contains one human sequence 2 Similar sequences from different species form a subalignment subalignments have max sequence identity 7 / 15

13 onstructing a benchmark data set Split Rfam 12 family seed alignments into subalignments: Human Rfam family seed alignment Family subalignments himp Mouse split Pig 1 Each subalignment contains one human sequence 2 Similar sequences from different species form a subalignment subalignments have max sequence identity Ideal clustering groups only subalignments from same Rfam family 7 / 15

14 omparing sequence to alignment clustering Subalignments in benchmark set Human Extract human sequences Human sequences in benchmark set luster with raphlust luster with RNsclust ompare clustering performance 8 / 15

15 Low covariation in the benchmark data set Benchmark set has high verage Pairwise Sequence Identity (PSI) in alignments High PSI ount mean PSI 48 families 234 alignments PSI per alignment (based on Rfam alignments) 9 / 15

16 Low covariation in the benchmark data set Benchmark set has high verage Pairwise Sequence Identity (PSI) in alignments High PSI ount mean PSI 48 families 234 alignments PSI per alignment (based on Rfam alignments) limit PSI to study effect of covariation on clustering performance 9 / 15

17 Benchmark sets with different degrees of covariation reate 2 additional benchmark data set with controlled PSI in alignments Medium PSI Low PSI ount 30 ount mean PSI 26 families 166 alignments PSI per alignment (based on Rfam alignments) mean PSI 10 families 92 alignments PSI per alignment (based on Rfam alignments) 10 / 15

18 lustering performance metrics - V-measure Homogeneity H: each cluster contains only members of a single family ompleteness : all members of a given family are in same cluster V-measure [Rosenberg and Hirschberg, 2007] is harmonic mean of H and : V = 2 H H + 11 / 15

19 lustering performance metrics - djusted Rand Index a = #object pairs from same family assigned to same cluster b = #object pairs from different families assigned to different clusters n = number of alignments Rand Index = a ( + b n 2) djusted Rand Index [Hubert and rabie, 1985] is Rand Index [Rand, 1971] adjusted for chance 12 / 15

20 More covariation improves RNsclust performance 10 V-measure 10 RNsclust raphlust djusted Rand Index V-measure djusted Rand Index Low PSI (049) Medium PSI (064) High PSI (081) Low PSI (049) Medium PSI (064) High PSI (081) 13 / 15

21 RNsclust onclusion Leverage conserved sec structure derived from multiple alignments NSPD raph Kernel as similarity measure Improved clustering compared to raphlust 14 / 15

22 RNsclust onclusion Leverage conserved sec structure derived from multiple alignments NSPD raph Kernel as similarity measure Improved clustering compared to raphlust Next step: enome-scale clustering of potential ncrns 14 / 15

23 cknowledgements Bioinformatics roup, niversity of Freiburg: Milad Miladi Fabrizio osta Rolf Backofen RTH, niversity of openhagen: Stefan Seemann Jakob Hull Havgaard Jan orodkin Funding: DF, Danish enter for Scientific omputing, Innovation Fund Denmark, Danish ancer Society 15 / 15

24 cknowledgements Bioinformatics roup, niversity of Freiburg: Milad Miladi Fabrizio osta Rolf Backofen RTH, niversity of openhagen: Stefan Seemann Jakob Hull Havgaard Jan orodkin Funding: DF, Danish enter for Scientific omputing, Innovation Fund Denmark, Danish ancer Society Thank you for your attention! 15 / 15

25 16 / 15

26 Identifying similarities of secondary structures Neighborhood Subgraph Pairwise Distance (NSPD) Kernel used in raphlust [Heyne et al, Bioinformatics, 2012] Extract substructures: ' 6 3' structure k-mers graph kernel H H H H H sparse feature vector ncrns highly similar if many shared substructures / 15

27 RNsclust full pipeline RNsclust specific input and methodology Input multiple alignments Predicting sets of sec structures Sparse feature vectorization Local sensitivity hashing andidate clusters of multiple alignments raphlust methodology andidate clusters of representatives luster refinement and extension Report clusters Iterate until no candiate left Steps executed in parallel are shown as stacks 18 / 15

28 V-measure lusters K = {K 1,, K m }; true classes = { 1,, n } h is defined as: Homogeneity h = { 1 if H( K) = 0 1 H( K) H() H( K) is the conditional entropy of the classes given the clustering and H() is the entropy of the classes, ie, else K H( K) = H() = c=1 k=1 c=1 a ck N log K k=1 a ck log n a ck c=1 a ck K k=1 a ck n 19 / 15

29 V-measure lusters K = {K 1,, K m }; true classes = { 1,, n } On the other hand, completeness c is defined as: c = { 1 if H(K ) = 0 1 H(K ) H(K) else where H(K ) is the conditional entropy of the clustering given the classes and H(K) is the entropy of the clustering, ie, K H(K ) = K H(K) = k=1 c=1 k=1 a ck N log c=1 a ck log n a ck K k=1 a ck c=1 a ck n V-measure is harmonic mean of homogeneity and completeness and is not normalized wrt random labeling 00 is as bad as it can be, 10 is perfect 19 / 15

30 onstructing a benchmark data set Split each Rfam 12 family seed alignment into subalignments Similar sequences from different species form a subalignment Human Rfam family seed alignment Family subalignments himp Mouse split Pig 20 / 15

31 onstructing a benchmark data set 1) Each sequence in the alignment is represented as a node in a graph Human himp Mouse Pig 20 / 15

32 onstructing a benchmark data set 2) Remove sequences with pairwise sequence identify (PSI) > 095 Human himp Mouse Pig 20 / 15

33 onstructing a benchmark data set 3) dd edge between sequences from diff species with PSI (09, 095] Human himp Mouse Pig 20 / 15

34 onstructing a benchmark data set 4) Search for cliques in graph Human himp Mouse Pig 20 / 15

35 onstructing a benchmark data set 5) dd clique with max PSI as subalignment to benchmark data set Human himp Mouse Pig Family subalignments (liques) 1 20 / 15

36 onstructing a benchmark data set 6) dd edge between sequences from diff species with PSI (08, 09] Human himp Mouse Pig Family subalignments (liques) 1 20 / 15

37 onstructing a benchmark data set 7) dd clique as subalignment to benchmark data set Human himp Mouse Pig Family subalignments (liques) / 15

38 onstructing a benchmark data set 8) dd edge between sequences from diff species with PSI (07, 08] Human himp Mouse Pig Family subalignments (liques) / 15

A graph kernel approach to the identification and characterisation of structured non-coding RNAs using multiple sequence alignment information

A graph kernel approach to the identification and characterisation of structured non-coding RNAs using multiple sequence alignment information graph kernel approach to the identification and characterisation of structured noncoding RNs using multiple sequence alignment information Mariam lshaikh lbert Ludwigs niversity Freiburg, Department of