A phylogenetic view on RNA structure evolution

3 2 9 4 7 3 24 23 22 8 phylogenetic view on RN structure evolution 9 26 6 52 7 5 6 37 57 45 5 84 63 86 77 65 3 74 7 79 8 33 9 97 96 89 47 87 62 32 34 42 73 43 44 4 76 58 75 78 93 39 54 82 99 28 95 52 46 49 94 85 67 68 69 35 38 83 88 64 92 36 98 48 55 72 6 56 59 53 66 29 2 27 4 8 0 25 Tanja esell, Bioinformatics Institute, Heinrich-Heine University Düsseldorf, ermany - February 06, Bled, Slovenia,

Modeling sequence evolution Sequence T T T d Sequence 2 T T

Modeling sequence evolution Sequence T T T d Sequence 2 T T Hamming distance H=3

Modeling sequence evolution Sequence T T Sequence 2 T T each sequence site does not evolve independently of the others

Example for site-specific interactions 2D-structure 3D-structure of RNs, e.g. trn, mrn... of proteins p codon positions

SIMULTION ESTIMTION Seq: UUUUUUUUU Seq2: UUUUUUUU Seq3: UUUUUU Seq: UUUUUUUUU Seq2: UUUUUUUU Seq3: UUUUUU

Model-based approaches T stationary and time homogeneous Markov model the probability that sequence x evolves to sequence y P xy (t) = exp(qt)

4x4 instantaneous rate matrix enerally Reversible (REV) 0 Q = B @ (aπ + bπ + cπ T ) aπ bπ cπ T aπ (aπ + dπ + eπ T ) dπ eπ T bπ dπ (bπ + dπ + f π T ) f π T cπ eπ f π (cπ + eπ + f π ) T T T J69-Modell K-Modell HKY-Modell α α α β α β βπ απ βπ T α α α β β α βπ βπ απ T α α α α β β απ βπ βπ T T α α α β α β βπ απ βπ TN93-Modell F8-Modell TR-Modell βπ α π βπ T π π π T aπ bπ cπ T βπ βπ α 2 π T π π π T aπ dπ eπ T α π βπ βπ T π π π T bπ dπ f π T T βπ α 2 π βπ π π π cπ eπ f π

ompensatory mutation 3 23 2 24 25 26 9 27 U U U U U Nucleotides in stem regions evolve in strong correlation with their pairing counterpart.

ompensatory mutation Schöniger and von Haeseler, 994 U U U U U U UU π π π U π - - - π - - - π U - - - π π π U - π - - - π - - - π U - - π π π U - - π - - - π - - - π U - U π π π - - - π U - - - π U - - - π UU π - - - π π π U π - - - π U - - - - π - - π π π U - π - - - π U - - - - π - π π π U - - π - - - π U - U - - - π U π π π - - - π U - - - π UU π - - - π - - - π π π U π U - - - - π - - - π - - π π π U - π U - - - - π - - - π - π π π U - - π U - U - - - π U - - - π U π π π - - - π UU U π - - - π - - - π - - - π U π U π UU U - π - - - π - - - π - - π U π U π UU U - - π - - - π - - - π - π U π U π UU UU - - - π U - - - π U - - - π U π U π U π U

How to simulate more complex interactions among nucleotide and other character based sequences? model, that represents a universal description of arbitrary complex dependencies among sites.

Neighbourhood system k =,, l sites in a (nucleotide) sequence x = (x,..., x l ) Sequence T T Sequence 2 T T Neighbourhood system N = (N k ) k=,2,,l :. N k {,..., l}, k / N k for each k 2. If i N k then k N i for each i, k. n k denotes the cardinality of N k.

Example: N (Pseudoknot) 22 2 9 23 24 25 26 27 8 7 6 7 6 2 2 5 5 4 4 24 3 3 4 3 2 9 8 7 6 5 4 3 2 9 8 23 24 25 26 27 N = {} N 2 = {6} N 3 = {5} N 4 = {4} N 5 = {} N 6 = {} N 7 = {} N 8 = {} N 9 = {27} N = {26} N = {25} N 2 = {24} N 3 = {23} N 4 = {4} N 5 = {3} N 6 = {2} N 7 = {} N 8 = {} N 9 = {} N = {} N 2 = {} N 22 = {} N 23 = {3} N 24 = {2} N 25 = {} N 26 = {} N 27 = {9}

Example: N (Stem including stacking) 22 23 24 25 26 27 28 4 4 22 3 3 23 2 9 8 2 24 25 26 9 27 8 28... N 8 = {9}, N 9 = {8, 27,, 26} N = {9, 27, 26, 25, } N = {, 26, 25, 24, 2} N 2 = {, 25, 24, 23, 3} N 3 = {2, 23, 24, 4} N 4 = {3}...... N 22 = {23} N 23 = {22, 3, 2, 24} N 24 = {23, 3, 2,, 25} N 25 = {24,,, 2, 26} N 26 = {25, 9,,, 27} N 27 = {26,, 9,, 28} N 28 = {27}...

Example: Ribozyme domain

Basic idea: Different substitution matrix for each site x 53 223 l nk Q k 3 2 2 0 6 x 6 256 x 256 64 x 64 64 x 64 4 x 4 q(x) =... q()... +... q(53)... +... q(223)... +... q()... +... q(l) Only one mutation is allowed at the current site

U U U U U U UU π π π U π - - - π - - - π U - - - π π π U - π - - - π - - - π U - - π π π U - - π - - - π - - - π U - U π π π - - - π U - - - π U - - - π UU π - - - π π π U π - - - π U - - - - π - - π π π U - π - - - π U - - - - π - π π π U - - π - - - π U - U - - - π U π π π - - - π U - - - π UU π - - - π - - - π π π U π U - - - - π - - - π - - π π π U - π U - - - - π - - - π - π π π U - - π U - U - - - π U - - - π U π π π - - - π UU U π - - - π - - - π - - - π U π U π UU U - π - - - π - - - π - - π U π U π UU U - - π - - - π - - - π - π U π U π UU UU - - - π U - - - π U - - - π U π U π U π U

U U U U U U UU - - - π - - - π - - - π U - - - - - - - π - - - π - - - π U - - - - - - - π - - - π - - - π U - U - - - - - - π U - - - π U - - - π UU π - - - - - - π - - - π U - - - - π - - - - - - π - - - π U - - - - π - - - - - - π - - - π U - U - - - π U - - - - - - π U - - - π UU π - - - π - - - - - - π U - - - - π - - - π - - - - - - π U - - - - π - - - π - - - - - - π U - U - - - π U - - - π U - - - - - - π UU U π - - - π - - - π - - - - - - U - π - - - π - - - π - - - - - U - - π - - - π - - - π - - - - UU - - - π U - - - π U - - - π U - - -

(k, i) U U U U U U UU π π π U π π π U π π π U U π π π π π π U π π π U π π π U U π π π π π π U π π π U π π π U U π π π U π U π U π UU U π U π U π UU U π U π U π UU UU π U π U π U

U π π π U π π π U π π π U U π π π U π π π U π π π U π π π U U π π π U π π π U π π π U π π π U U π π π U U U U U U π U π U π UU U π U π U π UU U π U π U π UU U U π U π U π U

y,..., y nk y,..., y nk y,..., y nk U y,..., y nk y,..., y nk π y,..., y nk π y,..., y nk π U y,..., y nk y,..., y nk π y,..., y nk π y,..., y nk π U y,..., y nk y,..., y nk π y,..., y nk π y,..., y nk π U y,..., y nk U y,..., y nk π y,..., y nk π y,..., y nk π y,..., y nk

Q = {Q k k =,..., l} π k (y) if H(s k, y) = and x k y 0 Q k (s k, z) if H(s k, y) = 0 Q k (s k, y) = z n k + z s k 0 otherwise with s k = (x k, x i,..., x ink ) n k +, where {i,..., i nk } = N k y = (y 0, y... y nk ) n k + Normalisation: d k = z n k + π k (z) Q k (z, z) =.

The total instantaneous substitution rate for x: q(x) = l Q k (s k, s k ) k= Relative mutability at site k: Probability to replace x k by y 0 : P(k) = Q k(s k, s k ) q(x) P(x k y 0 ) = Q k (s k, y) Q k (s k, s k )

N29 0. 0. 0.05 changes N27 N28 0.05 0.05 0.05 0.05 N23 N24 N25 N26 0.02 0.02 0.02 0.02 0.04 N9 N 0.05 0.05 0.05 N2 N22 0.02 0.02 0.03 0.03 0.03 0.03 0.03 0.03 0.0 N6 0.0 0.0 N7 0.0 0.0 N8 0.0 T T2 T3 T4 T5 T6 T7 T8 T9 T T T2 T3 T4 T5 SISSI: SImulating Sequence Evolution with Site-Specific Interactions (esell and von Haeseler, Bioinformatics in press, Epub. 05 Dec. 6) + + substitution model... }{{} 5 T T3 T2 T4 T5 T6 T7 T8 T9 T T T2 T3 UUUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUUUUUU UUUUUUUUUUUUUUUUUUUUUUU

Example: Bacillus subtilis RNase P database (Brown, 999) U U U U UU U U U U U U U U U U U U U U U U U P UU U P5 P5 U U U U U U U U U U U U U U P5. U UUUU U U U U U U U U UUUU U U U P3 UUU U U P4 U UU U U U U U U U U P2 U U U UUUUUU 5 P U 3 UU UU UU U P. P9 P8 P P7 P5. P2 P8 P9 Sequence M375, image created by Brown

Example: ounted frequencies from a RNase P sequence of Bacillus subtilis taken from the RNase P database: f nk = f nk =0 second nucleotide in doublet first nucleotide U 0.000423 0.004228 0.02685 0.6933 0.4223 0.004228 0.000423 0.26256 0.000423 0.55 0.02685 0.26256 0.000423 0.042283 0.2325 U 0.6933 0.000423 0.042283 0.0695 0.2325

Relationship between number of substitutions per site d and number of observed differences per site h: h: number of differences per site 0.0 0. 0.2 0.3 0.4 0.5 0.6 0.7 Fel8 Doublet 0 2 3 4 5 d: number of substitutions per site

pilot study of SISSI Does phylogeny matter? phylogenetic view on some existing structure prediction methods.

Influence of the tree topology 2 0 2 0 3 0 3 3 3 3 3 0 T3 T T5 0.0 T T T32 T2 T0 T29 0.04 T2 T34 T38 T28 T92 T6 T75 T T84 T67 T T26 0.0 T69 T76 T T59 T99 T96 T62 T2 T22 T23 T77 T5 T56 T87 T94 T T8 0.0 T6.0 T8 T79 T43 0.03 T4 T83 T36 T95 T T7 T35 0.009 T78 T3 T82 T4 T24 T T72 T53 T9 T8 T73 T5 T68 T97 0.0 T33 T57 0.008 T9 T7 0.0 T47 T6 T42 T55T89 T7 T85 T63 T7 T65 T25 T39 T45 T T4 T54 T93 T46 T9 T98 T27 T6 T64 T52 T44 T74 Examples with 5 bifurcating trees, with the same topology, but different mean branch length. + T58 T37 T48 T88 T49 T86 + substitutionmodel T 0.03, T 0.075, T 0., T 0.3, T 0.5

onstruct onstruction of RN consensus structures (Lück et al. 999), (Wilm,. & Steger,. 06, submitted) ombination of Sequence lignment, Thermodynamics and Mutual Information ontent. onsensus Structure Thermodynamic onsensus Dotplot: onsensus Dotplot using RNfold: Hofacker et al. (994) Mutual Information ontent: MI: (hiu & Kolodziejczak, 99, utell et al. 992) Prediction of Tertiary Interaction Maximum Weighted Matching: Tabaska et al. (998)

Mean branch length onsensus Structures based on Mutual Information ontent N 0 0 2 2 0 3 3 3 3 3 3 0 0.03 0 0 2 2 0 3 3 3 3 3 3 0 0.075 0 0 2 2 0 3 3 3 3 3 3 0 0. 0 0 2 2 0 3 3 3 3 3 3 0 0.3 0 0 2 2 0 3 3 3 3 3 3 0 0.5 0 0 2 2 0 3 3 3 3 3 3 0

Mean branch length Thermodynamic onsensus Structures N 0 0 2 2 0 3 3 3 3 3 3 0 0.03 0 0 2 2 0 3 3 3 3 3 3 0 0.075 0 0 2 2 0 3 3 3 3 3 3 0 0. 0 0 2 2 0 3 3 3 3 3 3 0 0.3 0 0 2 2 0 3 3 3 3 3 3 0 0.5 0 0 2 2 0 3 3 3 3 3 3 0

Prediction of tertiary interactions Using Maximum Weighted Matching: Tabaska et al. (998) N 0.03 0 0 2 2 0 3 3 3 3 3 3 0 0.075 0 0 2 2 0 3 3 3 3 3 3 0 0. 0 0 2 2 0 3 3 3 3 3 3 0 0.3 0 0 2 2 0 3 3 3 3 3 3 0 0.5 0 0 2 2 0 3 3 3 3 3 3 0

Prediction of tertiary interactions Using Maximum Weighted Matching and a threshold N 0.03 0 0 2 2 0 3 3 3 3 3 3 0 0.075 0 0 2 2 0 3 3 3 3 3 3 0 0. 0 0 2 2 0 3 3 3 3 3 3 0 0.3 0 0 2 2 0 3 3 3 3 3 3 0 0.5 0 0 2 2 0 3 3 3 3 3 3 0

Tree topology Fulltree: 0 sequences Subtree: sequences 2 2 3 9 26 6 7 52 37 84 5 57 45 86 63 79 7 65 3 74 77 33 9 97 8 96 62 89 47 87 73 43 42 32 34 76 4 44 58 75 78 93 39 54 82 99 46 95 52 28 94 49 67 68 69 85 88 64 92 35 38 83 98 36 59 56 6 48 55 72 53 66 29 4 27 2 8 9 4 24 7 22 25 8 0 3 6 23 5 3 0 8 27 9 2 5 7 2 9 26 6 3 68 47 7 4 24 23 5 6 25 8 22 4 Is maximisation of evolutionary divergence useful for structure predictions?

Evolve along the fulltree and the subtree 2 0 2 0 3 0 3 3 3 3 3 0 2 0 2 0 3 0 3 3 3 3 3 0 2 0 2 0 3 0 3 3 3 3 3 0 onsensus structures based on Mutual Information ontent Fulltree N Subtree

Evolve along the fulltree and the subtree, threshold 2 0 2 0 3 0 3 3 3 3 3 0 2 0 2 0 3 0 3 3 3 3 3 0 2 0 2 0 3 0 3 3 3 3 3 0 onsensus structures based on Mutual Information ontent Fulltree N Subtree

Reducing the alignment onsensus structures based on Mutual Information ontent Fullalignment 0 0 2 2 0 3 3 3 3 3 3 0 N 0 0 2 2 0 3 3 3 3 3 3 0 Subalignment 0 0 2 2 0 3 3 3 3 3 3 0

Reducing the alignment 2 0 2 0 3 0 3 3 3 3 3 0 2 0 2 0 3 0 3 3 3 3 3 0 onsensus structures based on Mutual Information ontent Fullalignment N Subalignment

Reducing the alignment Using Maximum Weighted Matching and a threshold Fullalignment N Subalignment

Influence of the tree topology Mutual Information: Long branches many true positives correlations Short branches few true positives correlations Many false positives omparative analysis ignores the phylogenetic information in the sequences, it tends to overestimate the amount of covariation between two positions.

ncestral correlations How long the branches need to be to avoid ancestral correlations? U U ancestral correlation UU

onclusion Including phylogenetic information in comparative analysis is potential useful Phylogeny can help to choose sequences for structure prediction methods: Statistical study is necessary. How looks the optimal tree for comparative structure prediction methods, with and without thermodynamics? Method for Reconstructing Dependencies with Phylogenetic Trees Self-consistent method, where no threshold is needed.

Extension of SISSI SISSI is not limited to F8 types of rate matrices E.g. inclusion of a transition-transversion parameter Inclusion of codon position-specific heterogeneity Studies with tertiary interactions Inclusion of mixture models Inclusion of energy values Inclusion of indels

cknowledgement rndt von Haeseler (enter for Integrative Bioinformatics Vienna) Thomas Schlegel (Bioinformatics Institute, HHU Düsseldorf) Minh Bui Quang (enter for Integrative Bioinformatics Vienna) Steffen Kläre (enter for Integrative Bioinformatics Vienna) erhard Steger (Institut für Physikalische Biologie, HHU Düsseldorf) ndreas Wilm (Institut für Physikalische Biologie, HHU Düsseldorf) Financial support from the DF grant SFB-TR is gratefully acknowledged.

Special thanks to the big communities of Phylogenetics and RN structures!