The Ensemble of RNA Structures Example: some good structures of the RNA sequence

Size: px

Start display at page:

Download "The Ensemble of RNA Structures Example: some good structures of the RNA sequence"

Domenic Holland
5 years ago
Views:

The Ensemble of RNA Structures Example: some good structures of the RNA sequence GGGGGUAUAGCUCAGGGGUAGAGCAUUUGACUGCAGAUCAAGAGGUCCCUGGUUCAAAUCCAGGUGCCCCCU free energy in kcal/mol (((((((..((((...)))).

1 The Ensemble of RNA Structures Example: some good structures of the RNA sequence GGGGGUAUAGCUCAGGGGUAGAGCAUUUGACUGCAGAUCAAGAGGUCCCUGGUUCAAAUCCAGGUGCCCCCU free energy in kcal/mol (((((((..((((...))))...((((...))))(((((...)))))))))))) (((((((..((((...))))...((((.(...).))))(((((...)))))))))))) ((((((((.((((...))))(((((((((..((((...))))..)))).)))))...)))))))) ((((((((.((((...))))(((((((((..((((...))))..))).))))))...)))))))) (((((((..((((...))))...((((...))))(((((...)))))))))))) (((((((..((((...))))...(((..(...)..)))(((((...)))))))))))) ((((((((.((((...)))).((((((((..((((...))))..)))).))))...)))))))) ((((((((.((((...)))).((((((((..((((...))))..))).)))))...)))))))) ((((((((.((((...))))...((((...)))).((((...)))))))))))) ((((((...((((...))))...((((...))))(((((...))))).)))))) (((((((...(((...(((...(((...)))..)))..)))...(((((...)))))))))))) ((((((((.((((...))))((((((((...((((...))))...))).)))))...)))))))) ((((((((.((((...))))((((((((...((((...))))...)).))))))...)))))))) ((((((((.((((...))))...((((.(...).)))).((((...)))))))))))) (((((((..((((...)))).((((((...).)))))...(((((...)))))))))))) (((((((..((((...))))...(((...)))(((((...)))))))))))) ((((((...((((...))))...((((.(...).))))(((((...))))).)))))) ((((((((.((((...))))(((((((((..(((...)))..)))).)))))...)))))))) ((((((((.((((...))))(((((((((..(((...)))..))).))))))...)))))))) ((((((((.((((...))))...((((...)))).((((...)))))))))))) (((((((..((((...)))).(((((...)))))...(((((...)))))))))))) ((((((...((((...))))...((((...))))(((((...))))).)))))) The set of all nc RNA structures of an RNA sequence S is called (structure) ensemble P of S. 1c M. Möhl

2 Probability of a Structure How probable is an RNA structure P for a RNA sequence S? GOAL: define probability Pr[P S]. IDEA: Think of RNA folding as a dynamic system of structures (=states of the system). Given much time, a sequence S will form every possible structure P. For each structure there is a probability for observing it at a given time. This means: we look for a probability distribution! Requirements: probability depends on energy the lower the more probable. No additional assumptions! 2c M. Möhl

3 Distribution of States in a System Definition (Boltzmann distribution) Let X = {X 1,..., X N } denote a system of states, where state X i has energy E i. The system is Boltzmann distributed with temperature T iff Pr[X i ] = exp( βe i )/Z for Z := i exp( βe i), where β = (k B T ) 1. Remarks call exp( βe i ) Boltzmann weight of X i. broadly used in physics to describe systems of whatever Boltzmann distribution is usually assumed for the thermodynamic equilibrium (i.e. after sufficiently much time) transfer to RNA easy to see: structures=states, energies why temperature? very high temperature: all states equally probable very low temperature: only best states occur k B J/K is known as Boltzmann constant; β is called inverse temperature. 3c M. Möhl

4 What next? We assume that the structure ensemble of an RNA sequence is Boltzmann distributed. What are the benefits? (More than just probabilities of structures... ) Why is it reasonable to assume Boltzmann distribution? (Well, a physicist told me... ) How to calculate probabilities efficiently? (McCaskill s algorithm) 4c M. Möhl

5 Benefits of Assuming Boltzmann Definition Probability of a structure P for S: Pr[P S] := exp( βe(p))/z. Allows more profound weighting of structures in the ensemble. We need efficient computation of partition function Z! Even more interesting: probability of structural elements Definition Probability of a base pair (i, j) for S: Pr[(i, j) S] := P (i,j) Pr[P S] Again, we need Z (and some more). Base pair probabilities enable a new view at the structure ensemble (visually but also algorithmically!). Remark: For RNA, we have real temperature, e.g. T = 37 C, which determines β = (k B T ) 1. For calculations pay attention to physical units! 5c M. Möhl

6 An Immediate Use of Base Pair Probabilities MFE structure and base pair probability dot plot 1 of a trna GGGGGUAUAGCUCAGGGGUAGAGCAUUUGACUGCAGAUCAAGAGGUCCCUGGUUCAAAUCCAGGUGCCCCCU dot.ps U G G G G G U A U A G C U C A G G G G U A G A G C A U U U G A C U G C A G A U C A A G A G G U C C C U G G U U C A A A U C C A G G U G C C C C C U G C G C G C G C G C U G A G G U A A U G G C G U C G A G G A G U A C A UU A UCC A U A GA C U C G U UG G U C C A G C C A U UC G G A A A G 1 computed by RNAfold -p G G G G G U A U A G C U C A G G G G U A G A G C A U U U G A C U G C A G A U C A A G A G G U C C C U G G U U C A A A U C C A G G U G C C C C C U G G G G G U A U A G C U C A G G G G U A G A G C A U U U G A C U G C A G A U C A A G A G G U C C C U G G U U C A A A U C C A G G U G C C C C C U G G G G G U A U A G C U C A G G G G U A G A G C A U U U G A C U G C A G A U C A A G A G G U C C C U G G U U C A A A U C C A G G U G C C C C C U 6c M. Möhl

7 Why Do We Assume Boltzmann We will give an argument from information theory. We will show: The Boltzmann distribution makes the least number of assumptions. Formally, the B.d. is the distribution with the lowest information content/maximal (Shannon) entropy. As a consequence: without further information about our system, Boltzmann is our best choice. [ What could further information mean in a biological context? ] [ low information in distribution high information in event] 7c M. Möhl

8 What is Shannon Entropy? Information of an event = how much more you know after it happened. Assume loaded dice that always yield the number 6. you know the outcome before you throw the dice. no information content (all information in distribution) Playing lottery, it is very hard to predict the lottery numbers. high information content. This is formalized as: For some random variable X = {X 1,..., X N } the amount of information I [X i ] of some event X i X is defined as I [X i ] = log b 1 Pr[X i ] = log b Pr[X i ] Entropy is the expected amount of information for X : N H[X ] = Pr[X i ]I [X i ] i=1 8c M. Möhl

9 Shannon Entropy (Example) We toss a coin. For our coin, heads and tails show up with respective probabilities p and q (not necessarily fair). How uncertain are we about the result? p = 0.5, q = 0.5 Answer: expected H = 1 information max. uncertainty 1 H = p log b p +q log 1 b q In general, define the Shannon entropy 2 as p * log2(1/p) + q * log2(1/q) p p = 1, q = 0 H = 0 no uncertainty N H( p) := p i log b p i. i=1 2 of a probability distribution p over N states X 1... X N 9c M. Möhl

10 Formalizing Least number of assumptions Example: Assume: we have N events. Without further assumptions, we will naturally assume the uniform distribution p i = 1 N. This is the uniquely defined distribution maximizing the entropy H( p) = i p i log b p i. It is found by solving the following optimization problem: maximize the function H( p) = i p i log b p i under the side condition i p i = 1. 10

11 Formalizing Least number of assumptions Theorem: Given a system of states X 1... X N and energies E i for X i. The Boltzmann distribution is the probability distribution p that maximizes Shannon entropy H( p) = N p i log b p i under the assumption that we have a probability distribution p i = 1 i i=1 and under the assumption of known average energy < E > of the system N < E >= p i E i. i=1 11

12 Proof We show that the Boltzmann distribution is uniquely obtained by solving N maximize function H( p) = p i ln p 3 i under the side conditions i=1 C 1 ( p) = i p i 1 = 0 and C 2 ( p) = i p ie i < E > = 0 by using the method of Lagrange multipliers. 3 whether using ln or log b is equivalent for maximization 12

13 Proof Using Lagrange Multipliers Following the trick of Lagrange, find the extreme value of L( p, α, β) = H( p) αc 1 ( p) βc 2 ( p). By construction, C 1 ( p) and C 2 ( p) are partial derivatives: L( p, α, β) α L 2 ( p, α, β) β = C 1 ( p) = C 2 ( p) Thus the side conditions hold at the optimum, since there all partial derivatives are 0. 13

14 Proof (Ctd.) Partial Derivatives w.r.t p j Futhermore, we need the partial derivatives with respect to p j L( p, α, β) p j = H( p) p j α C 1( p) p j = N i=1 p i ln p i p j β C 2( p) p j α i p i 1 β p j i p ie i < E > p j = (ln p j + 1) α βe j 14

15 Proof (Ctd.) Solve Equations Finally, we need to solve the system Remarks p i E i < E > = 0 (1) i p i 1 = 0 (2) for all j (ln p j + 1) α βe j = 0 (3) Resolving (3) to p j and putting into (2) yields a distribution of the same form as the Boltzmann distribution. We won t show the dependency of β = k B T 1 and < E >. i 15

16 Proof (Ctd) Equation (3) can be rewritten to: ln p j = βe j (α + 1). Thus by exponentiation on both sides p j = exp( βe j γ) = exp( βe j), (4) exp(γ) where γ = (α + 1). By substituting (4) in (2) i p i 1 = 0 we get 1 = i exp( βe i )/ exp(γ) and thus exp(γ) = i exp( βe i ) We insert this in 4 and finally obtain p j = exp( βe j) i exp( βe i) (5) 16

17 Partition Function Recall: For probabilities, Pr[P S] = exp( βe(p))/z, we need Z. Definition For an RNA sequence S, we call Z := P nc RNA structure for S exp( βe(p)) the partition function (of the RNA ensemble P) of S. Remark Naive computation of Z: exponential, since ensemble size is exponential in S. 17

18 Excursion: Counting of Structures Computation of Z similar to counting Problem of computing the partition function is similar to counting the structures in the ensemble P. Partition function is a weighted sum, in counting we weight structures by 1. How to count nc RNA structures for S? naïve: enumerate exponential efficient: DP with decomposition a la Nussinov Example: S=CGAGC for minimal loop length m=0. 18

19 Counting of Structures: S=CGAGC C 1 G 2 A 3 G 4 C 5 C 1 G 2 A 3 G 4 C 5 19

20 Counting of Structures: S=CGAGC C 1 G 2 A 3 G 4 C 5 {.} {..,()} {...,().} {...,()..,(..)} {...,()...,(..).,.(..),...(),().()} C 1 {.} {..} {...} {...,(..)} G 2 {.} {..} {...,.()} A 3 {.} {..,()} G 4 {.} C 5 19

21 Definition (Subensemble) Subensembles Define the ij-subensemble P ij of S (for 1 i j n) as P ij := set of all nc RNA ij-substructures P of S. where: Definition (RNA Substructure) An RNA structure P of S is called ij-substructure of S iff P {i,..., j} 2. Remarks Example: see last slide, P 14 = {{}, {(1, 2)}, {(1, 4)}}, P 15 = {{}, {(1, 2)}, {(1, 4)}, {(2, 5)}, {(4, 5)}, {(1, 2), (4, 5)}} ensemble P of S: P = P 1n P ij = {{}} for j < i + m (min. loop size m) 20

22 Efficient Counting of Structures Define: C ij := P ij. ( DP-matrix C ) Computation of C ij for j i m: C ij = 1, since P ij = {{}} for j i > m: recurse! P ij consists of structures P ij 1 ( j unpaired) and structures P ik 1 P k+1j 1 {{(k, j)}} ( k, j paired ), where: combines all structures in one set with all structures in a second set. Define: P Q := {P Q P P, Q Q}. 21

23 for j i > m: P ij = P ij 1 Computation of C ij i k<j m S k,s j compl. this means for C ij : recall C ij = P ij Remarks C ij = C ij 1 + P ik 1 P k+1j 1 {{(k, j)}} i k<j m S k,s j compl. C ik 1 C k+1j 1 1 by DP: compute ensemble size C 1n in O(n 3 ) time and O(n 2 ) space. why translates to + and to? all unions were disjoint! i.e.: 1.) cases in P ij consists of... are disjoint 2.) structures combined by are disjoint 22

24 Example decompose sequence S 15 =C 1 G 2 A 3 G 4 C 5 1. subsequence C 1 G 2 A 3 G 4 and C 5 unpaired C 15 C a.) k=2. C 1, A 3 G 4, base pair (2, 5) P 15 P 11 P 34 {{(2, 5)}} C 15 C 11 C 34 1 b.) k=4. C 1 G 2 A 3, base pair (4, 5) P 15 P 13 P 54 {{(4, 5)}} C 15 C 13 C 54 1 ad 2b.) P 13 P 54 {{(4, 5)}} = {{}, {(1, 2)}} {{}} {{(4, 5)}} = {{(4, 5)}, {(1, 2), (4, 5)}} 23

25 Example decompose sequence S 15 =C 1 G 2 A 3 G 4 C 5 1. subsequence C 1 G 2 A 3 G 4 and C 5 unpaired C 15 C a.) k=2. C 1, A 3 G 4, base pair (2, 5) P 15 P 11 P 34 {{(2, 5)}} C 15 C 11 C 34 1 b.) k=4. C 1 G 2 A 3, base pair (4, 5) P 15 P 13 P 54 {{(4, 5)}} C 15 C 13 C 54 1 ad 2b.) P 13 P 54 {{(4, 5)}} = {{}, {(1, 2)}} {{}} {{(4, 5)}} = {{(4, 5)}, {(1, 2), (4, 5)}} 23

26 Counting Counting vs. Structure Prediction init C ij = 1 (j i m) recurse C ij = C ij 1 + i k<j m C ik 1 C k+1j 1 1 S k,s j compl. Prediction init N ij = 0 (j i m) recurse N ij = max{n ij 1, max i k<j m N ik 1 + N k+1j 1 + 1} S k,s j compl. Remarks translation Prediction Counting : max +, + only possible since sets disjoint, i.e. disjoint cases (no ambiguity ) non-overlapping decomposition in each single case 24

27 Back to Computing the Partition Function Recall: For probabilities, Pr[P S] = exp( βe(p))/z, we need Z. We defined: Z := P P exp( βe(p)) We claimed: Problem of computing the partition function is similar to counting the structures in the ensemble P. Partition function is a weighted sum, in counting we weight structures by 1. Definition (Partition Function of a Set of Structures) In analogy to C ij = P ij = P P ij 1, define the partition function Z P for the set of RNA structures P of S by Z P := P P exp( βe(p)). Idea: compute the Z Pij recursively efficient by DP. 25

28 Disjoint Decomposition when to add? Definition (Disjoint Sets) Two sets of RNA structures P 1 and P 2 are (structurally) disjoint iff P 1 P 2 = {}. Proposition (Disjoint Decomposition) Let P, P 1, and P 2 be sets of structures of an RNA sequence S. If P 1 and P 2 are structurally disjoint and P = P 1 P 2, then Z P = Z P1 + Z P2. 26

29 Proof Proof. Z P = exp( βe(p)) P P = disjoint exp( βe(p)) P P 1 P 2 = exp( βe(p)) + exp( βe(p)) P P 1 P P 2 = Z P1 + Z P2 27

30 Independent Decomposition when to multiply? Definition (Independent Sets) Let S be an RNA sequence. Two sets of nc RNA structures P 1 and P 2 for S are structurally independent iff for all P 1 P 1 and P 2 P 2 1. P 1 P 2 = {}. 2. each loop/secondary structure element of the RNA structure P = P 1 P 2 is either a loop of P 1 or one of P 2. Proposition (Independent Decomposition) Let P 1 and P 2 be structurally independent sets of nc RNA structures for RNA sequence S and P = P 1 P 2. Then: Z P = Z P1 Z P2 Remark: Condition (1) suffices for energy functions based on scoring base pairs (like in Nussinov). For loop-based energy models, we need (2), which implies E(P 1 P 2 ) = E(P 1 ) + E(P 2 ). 28

31 Proof. Z P = exp( βe(p)) P P = indep.(1) Proof exp( βe(p 1 P 2 )) P 1 P 1,P 2 P 2 = indep.(2) exp( β(e(p 1 ) + E(P 2 ))) P 1 P 1,P 2 P 2 = exp( βe(p 1 )) exp( βe(p 2 )) P 1 P 1 P 2 P 2 = exp( βe(p 1 )) exp( βe(p 2 )) P 1 P 1 P 2 P 2 = exp( βe(p 1 ))Z P2 P 1 P 1 = Z P1 Z P2 29

32 Adding and Multiplying of Partition Functions in the same way as for counts! Counting init C ij = 1 (j i m) recurse C ij = C ij 1 + i k<j m C ik 1 C k+1j 1 1 S k,s j compl. Partition Function (Nussinov Style) init ZP N ij = 1 (j i m) recurse ZP N ij = ZP N ij 1 + i k<j m ZP N ik 1 ZP N k+1j 1 exp( β E(basepair) ) S k,s j compl. Remarks E(basepair) : e.g. -1 or depending on S i and S j for base pair (i, j) This partitition function variant of the Nussinov algorithm can not compute the partition function for the loop-based energy model(!) 30

33 Way to RNA Partition Function Partition function adding/multiplying like in counting Attention: only for disjoint/independent sets Loop energy model Zuker: how to decompose structure space how to compute the energies (as sum of loop energies) What next? Develop recursions for partition function using real RNA energies Plan: rewrite Zuker-algo into its partition function variant What is missing? Is Zuker s decomposition of structure space disjoint? independent? 31

34 Zuker to Partition Function Variant: Example W ij = min { W ij 1 min i k<j m W ik 1 + V kj Z Pij = Z Pij 1 + i k<j m Z Pik 1 Z P kj 32

35 Zuker to Partition Function Variant: Example W ij = min { W ij 1 min i k<j m W ik 1 + V kj Z Pij = Z Pij 1 + i k<j m Z Pik 1 Z P kj? 32

36 Decomposition by Zuker { W ij 1 W ij = min min i k<j m W ik 1 + V kj { eh(i, j), min i<i V ij = min <j <j V i,j + esbi(i, j, i, j ) min i<k<j WM i+1k 1 + WM kj 1 + a { WM ij 1 + c, WM i+1j + c, V ij + b WM ij = min min i<k<j WM ik 1 + WM kj Remark Combine energy function for stacking and interior loop: ( esbi(i, j, i, j es(i, j) i = i + 1 j = j 1 ) := el(i, j, i, j ) otherwise 33

37 W ij = min Decomposition by Zuker { W ij 1 min i k<j m W ik 1 + V kj = i j i j i k j { eh(i, j), min i<i V ij = min <j <j V i,j + esbi(i, j, i, j ) min i<k<j WM i+1k 1 + WM kj 1 + a { WM ij 1 + c, WM i+1j + c, V ij + b WM ij = min min i<k<j WM ik 1 + WM kj Remark Combine energy function for stacking and interior loop: ( esbi(i, j, i, j es(i, j) i = i + 1 j = j 1 ) := el(i, j, i, j ) otherwise 33

38 Decomposition by Zuker { W ij 1 W ij = min min i k<j m W ik 1 + V kj { eh(i, j), min i<i V ij = min <j <j V i,j + esbi(i, j, i, j ) min i<k<j WM i+1k 1 + WM kj 1 + a = i j i j i i' j' j i k j WM ij = min { WM ij 1 + c, WM i+1j + c, V ij + b min i<k<j WM ik 1 + WM kj Remark Combine energy function for stacking and interior loop: ( esbi(i, j, i, j es(i, j) i = i + 1 j = j 1 ) := el(i, j, i, j ) otherwise 33

39 Decomposition by Zuker { W ij 1 W ij = min min i k<j m W ik 1 + V kj { eh(i, j), min i<i V ij = min <j <j V i,j + esbi(i, j, i, j ) min i<k<j WM i+1k 1 + WM kj 1 + a { WM ij 1 + c, WM i+1j + c, V ij + b WM ij = min min i<k<j WM ik 1 + WM kj = i j i j i j i j i k j Remark Combine energy function for stacking and interior loop: ( esbi(i, j, i, j es(i, j) i = i + 1 j = j 1 ) := el(i, j, i, j ) otherwise 33

40 Decomposition of Structure Space by Zuker = i j i j i k j = i j i j i i' j' j i k j = i j i j i j i j i k j 34

41 Ambiguity Example = i j i j i j i j i k j 35

42 Ambiguity Example = i j i j i j i j i k j j-1 j i 35

43 Ambiguity Example = i j i j i j i j i k j j i i+1 35

44 Ambiguity Example = i j i j i j i j i k j k j i 35

45 Ambiguity Example = i j i j i j i j i k j k j i 35

46 Ambiguity Example = i j i j i j i j i k j k j i 35

47 Ambiguity Example = i j i j i j i j i k j i i+1 k k k j-1 j 35

48 Discarding Ambiguity = i j i j i j i j i k j idea for unique choice: always cut at rightmost base pair = i j i k l j i k l j efficient way to do it with additional matrix: 36

49 Discarding Ambiguity = i j i j i j i j i k j idea for unique choice: always cut at rightmost base pair = i j i k l j i k l j efficient way to do it with additional matrix: = i j i k j i k j = i j i j-1 j i j 36

50 Discarding Ambiguity ambiguity in case for closed structures: = i j i j i i' j' j i k j 37

51 Discarding Ambiguity ambiguity in case for closed structures: = i j i j i i' j' j i k j = i j i j i i' j' j i k j 37

52 Unambiguous Decomposition: Summary = i j i j i k j = i j i j i i' j' j i k j = i j i k j i k j = i j i j-1 j i j 38

53 McCaskill-Algorithm: Matrices Given RNA sequence S = S 1... S n. As usual, P ij denotes the ij-subensemble of S. Matrix Q = (Q ij ) 1 i<j n Matrix Q b = (Q b ij ) 1 i<j n Matrix Q m = (Q m ij ) 1 i<j n Matrix Q m1 = (Qij m1 ) 1 i<j n 39

54 McCaskill-Algorithm: Matrices Given RNA sequence S = S 1... S n. As usual, P ij denotes the ij-subensemble of S. Matrix Q = (Q ij ) 1 i<j n Qij := ZPij Matrix Q b = (Q b ij ) 1 i<j n Q b ij := Z P b ij, where P b ij := {P P ij (i, j) P} Matrix Q m = (Q m ij ) 1 i<j n Q m ij := ZP m, where P m ij m ij := {P P ij P non-empty}, Z m includes multiloop-contributions (for unpaired bases, inner base pairs), see earlier modification E m. Matrix Q m1 = (Qij m1 where P m1 ij ) 1 i<j n Q m1 ij := Z m P, ij m1 := {P P ij k : (i, k) P k+1,..., j unpaired in P}. 39

55 Q-recursion = i j i j i k j Q ij =1 (i j m) Q ij =Q ij 1 + Q ik 1 Qkj b i k<j m 40

56 Q b -recursion = i j i j i i' j' j i k j Q b ij =0 (i j m) Qij b = exp( β eh(i, j)) + exp( β esbi(i, j, i, j )) Qi b j + i<i <j <j i<k<j m 1 Q m i+1k 1 Qm1 kj 1 exp( βa) 41

57 Q m -recursion and Q m1 -recursion = i j i k j i k j Qij m =0 (i j m) Qij m = i k<j m (Q m ik 1 + exp( β(k i)c)) Qm1 kj = i j i j-1 j i j Qij m1 =0 (i j m) Qij m1 = exp( βb) exp( β(j k)c) i+m<k j Q b ik 42

58 McCaskill Partition Function Summary Q ij =1 (i j m) Q ij =Q ij 1 + i k<j m Q b ij =0 (i j m) Q b ij = exp( β eh(i, j)) + + i<k<j m 1 Q ik 1 Q b kj i<i <j <j Q m i+1k 1 Qm1 kj 1 exp( βa) Qij m =0 (i j m) Qij m = (exp( β(k i)c) + Qik 1 m ) Qm1 i k<j m exp( β esbi(i, j, i, j )) Q b i j Qij m1 =0 (i j m) Qij m1 = exp( βb) exp( β(j k)c) i+m<k j Q b ik kj 43

59 McCaskill Remarks Partition function of the ensemble of S in Z = Q 1n Correctness due to disjoint (=unambiguous) and independent decomposition Complexity O(n 2 ) space, O(n 3 ) time (after bounding size of interior loops) Probabilities of a structure Pr[P S] = Z 1 exp( βe(p)) of a structure set Pr[P S] = Z 1 P P exp( βe(p)) of a base pair (i, j) (efficient!) Pr[(i, j) S] = Z 1 P (i,j) exp( βe(p)) (?) (depends on P ) 44

60 Base Pair Probabilities dot.ps G G G G G U A U A G C U C A G G G G U A G A G C A U U U G A C U G C A G A U C A A G A G G U C C C U G G U U C A A A U C C A G G U G C C C C C U G G G G G U A U A G C U C A G G G G U A G A G C A U U U G A C U G C A G A U C A A G A G G U C C C U G G U U C A A A U C C A G G U G C C C C C U G G G G G U A U A G C U C A G G G G U A G A G C A U U U G A C U G C A G A U C A A G A G G U C C C U G G U U C A A A U C C A G G U G C C C C C U Pr[(i, j) S] := Pr[P S] P (i,j) G G G G G U A U A G C U C A G G G G U A G A G C A U U U G A C U G C A G A U C A A G A G G U C C C U G G U U C A A A U C C A G G U G C C C C C U 45

61 McCaskill: Efficient Base Pair Probabilities Idea: Compute p kl := Pr[(k, l) S] recursively (DP!), recurse from long base pairs (outside) to small ones (inside) 1) simple case (external base pair) Definition (Probability of external base pair) p E kl := Pr[P E kl ], where P E kl := {P P P, (k, l) is external base pair in P}, where (k, l) is external base pair in P iff (k, l) P and (i, j) P : i < k < l < j. Z P E kl = Q 1k 1 Q b kl Q l+1n p E kl = Z P E kl Z = Q 1k 1Q b kl Q l+1n Q 1n 46

62 McCaskill: Efficient Base Pair Probabilities 2) general case a) (k, l) is external base pair b) (k, l) limits stacking, bulge, or interior loop closed by (i, j) c) (k, l) inner base pair of multiloop closed by (i, j) 47

63 McCaskill: Efficient Base Pair Probabilities 2) general case a) (k, l) is external base pair b) (k, l) limits stacking, bulge, or interior loop closed by (i, j) pkl SBI (i, j) := p ij Pr[loop i, j, k, l (i, j)] = p ij Z {P Pij P has loop i,j,k,l} Z P b ij exp( β esbi(i, j, k, l))qkl b = p ij Qij b. c) (k, l) inner base pair of multiloop closed by (i, j) 47

64 McC: Base Pair Probabilities Multiloop Case 2) general case c) (k, l) inner base pair of multiloop closed by (i, j) p M kl (i, j) := p ij Pr[multiloop with inner base pair (k, l) closed by (i, j) (i, j)] Three cases: position of (k, l) in the multiloop (i) (k,l) leftmost base pair (ii) (k,l) middle base pair (iii) (k,l) rightmost base pair 48

65 McC: Base Pair Probabilities Multiloop Case 2) general case c) (k, l) inner base pair of multiloop closed by (i, j) p M kl (i, j) := p ij Pr[multiloop with inner base pair (k, l) closed by (i, j) (i, j)] Three cases: position of (k, l) in the multiloop (i) (k,l) leftmost base pair Qkl b Qm l+1j 1 exp( β(a + b + (k i 1)c)) (ii) (k,l) middle base pair Qi+1k 1 m Qb kl Qm l+1j 1 exp( β(a + b)) (iii) (k,l) rightmost base pair Qi+1k 1 m Qb kl exp( β(a + b + (j l 1)c)) 48

66 McC Multiloop Case (Ctd.) Recall p M kl (i, j) := p ij Pr[multiloop with inner base pair (k, l) closed by (i, j) (i, j)] putting the three cases of (k, l) position together pkl M (i, j) = p ij[qkl b Qm l+1j 1 exp( β(a + b + (k i 1)c)) + Qi+1k 1 m Qb kl Qm l+1j 1 exp( β(a + b)) + Q m i+1k 1 Qb kl exp( β(a + b + (j l 1)c))]Q 1 ij 49

67 McCaskill Base Pair Probabilities Summary Remarks p kl = p E kl + i<k,l<j Recursive formula for p kl furnishes DP p SBI kl (i, j) + i<k,l<j p M kl Efficient calculation of all p kl in O(n 4 ) time/o(n 2 ) space (i, j) Time reduction to O(n 3 ) possible (not shown, but you learned the trick ) The algorithm by the p kl recursion is an outside algorithm; in contrast the algo for computing Z and the Q ij is inside. For getting the probabilities, we combined inside and outside. 50

68 Summary Part I Algorithms Nussinov Zuker McCaskill Common O(n 3 ) time, O(n 2 ) space non-crossing structure (= no pseudoknots ) Differences realism: base pairs free energy (loop-based) mfe ensemble 51

69 Next? Comparing RNAs 52

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 8.3.1 Simple energy minimization Maximizing the number of base pairs as described above does not lead to good structure predictions.