Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space Pablo Gainza CPS 296: Topics in Computational Structural Biology Department of Computer Science Duke University 1
Outline 1) Problem definition 2) Formulation as an inference problem ) Graphical Models 4) tbmmf algorithm 5) Results 6) Conclusions 2
1. Problem Definition Rotamer Library Energy f(x) Positions to design & allowed Rotamers/Amino Acids Protein structure Protein design algorithm www.cs.duke.edu/donaldlab GMEC 2 S
1. Problem Definition (2) r! Rotamer assignment (RA) 1 r 1 2 r i! Rotamer at position i for RA r 4
1. Problem Definition () E i (r i )! Energy between rotamer r i and xed backbone E ij (r i ; r j )! Energy between rotamers r i and r j E i (r 1 ) E ij (r 1 ; r 2 ) 1 2 E(r)! Energy of rotamer assignment r E(r) = X i E i (r i ) + X i;j E ij (r i ; r j ) 5
1. Problem Definition (4) T(k)! returns amino acid type of rotamer k T(r)! returns sequence of rotamer assignment r r 1 2 T(r 1 ) = hexagon T(r 2 ) = cross T(r) = hexagon; cross 6
1. Problem Definition (5) Rotamer Library Energy f(x) Positions to design & allowed Rotamers/Amino Acids Protein structure Protein design algorithm www.cs.duke.edu/donaldlab GMEC 2 S S = T(arg min r E(r)) 7
1. Problem Definition (6) Related Work DEE / A* BroMAP BWM Exact Methods Global Minimum Energy Conformation Low energy conformation Probabilistic Methods SCMF MCSA 8
1. Problem Definition (7) Rotamer Library Energy f(x) Positions to design & allowed Rotamers/Amino Acids Protein structure Model Inaccurate! www.cs.duke.edu/donaldlab 9
1. Problem Definition (8) Protein design algorithm Algorithm Fast or provable S = T(arg min r E(r)) 10
1. Problem Definition (9) Too stable Not fold to target Low binding specificity Low energy conformation 11
1. Problem Definition (10) Provable Methods DEE/A* Ordered set of gap-free low energy conformations, including GMEC Set of low energy conformations Solution: Find a set of low energy sequences Probabilistic Methods tbmmf 12
Problem Definition: Summary Protein design algorithms search for the sequence with the Global Minimum Energy Conformation (GMEC). Our model is inaccurate: more than one low energy sequence is desirable. Fromer et al. Propose tbmmf to generate a set of low energy sequences. 1
2. Our problem as an inference problem Probabilistic factor for self-interactions à i (r i ) = e E i (r i ) T Probabilistic factor for pairwise interactions à ij (r i ; r j ) = e E ij (r i ;r j ) T 14
2. Inference problem (2) Partition function Z = X r e E(r) T Probability distribution for rotamer assignment P (r 1 ; :::; r N ) = 1 Z Y i à i (r i ) Y i;j à ij (r i ; r j ) = 1 Z e E(r) T r 15
2. Inference problem () Minimization goal (from definition) S = T(arg min r E(r)) Minimization goal for a graphical model problem S = T(arg max r P r(r)) 16
2. Inference problem (4) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 Example: Inference problem 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - E(r 0 ) =? E(r 00 ) =? What is our GMEC?? 17
2. Inference problem (5) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - E(r 0 ) = ( 1 + 2) + ( 5 + 2) = 10 E(r 00 ) = ( 1 + 4) + ( + 4) = 12 r 00 is our GMEC 18
2. Inference problem (6) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) -1-5 - T = 1 (for our example) r 00 à i (r1 0 ) = e E i (r T 0 1 ) = e à i (r2) 0 = e E i (r 0 2 ) T = e 5 à i (r 00 1 E i(r ) = e T à i (r 00 2 ) = e 00 1 ) = e E 00 i(r2 ) T = e 19
2. Inference problem (7) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - Ã ij (r 0 1 ; r0 2 ) = e E ij (r0 1 ;r0 2 ) T = e 2 Ã ij (r1 00 ; r2 00 ) = e E ij (r00 1 ;r00 2 ) T = e 4 T = 1 (for our example) Z = X r e E(r) T = e 10 + e 12 20
2. Inference problem (8) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 P (r 0 1; r 0 2) = 1 Z Y à i (r 0 i) Y i;j à ij (r 0 i; r 0 j) - = e 10 e 10 + e 12 i T = 1 (for our example) 21
2. Inference problem (9) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - P (r 00 1 ; r 00 2 ) = 1 Z = e 12 e 10 + e 12 Y i à i (r 0 i) Y i;j à ij (r 00 i ; r 00 j ) T = 1 (for our example) 22
2. Inference problem (10) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - S = T(arg max r P r(r)) S = T(r 00 ) T = 1 (for our example) 2
2. Inference problem (11) Minimization goal (from definition) S = T(arg min r E(r)) Minimization goal for a graphical model problem S = T(arg max r P r(r)) We still have a non-polynomial problem! But formulated as an inference problem Probabilistic methods 24
Summary: Inference problem We model our problem as an inference problem. We can use probabilistic methods to solve it. 25
. Graphical models for protein design and belief propagation (BP) 1. Model each design position as a random variable 2. Build interaction graph that shows conditional independence between variables Source: Fromer M, Yanover, C. Proteins (2008) SspB dimer interface: Inter-monomeric interactions (Cα) 26
. Graphical Models/BP (2) Example: Belief propagation 2 r 0 r 00 2 node in the graphical model: interacting residue in the structure. 1 1 node in the graphical model: random variable 27
. Graphical Models/BP () Example: Belief propagation b r 0 r 00 2 2 edge: energy interaction between two residues. edge: causal relationship between two nodes 1 1 4 4 If two residues are distant from each other, no edge between them. 28
. Graphical Models/BP (4) Example: Belief propagation r 0 r 0 2 2 r 0 2 r 0 2 r 00 r 0 1 1 r 00 1 r 0 1 Every random variable can be in one of several states: allowable rotamers for that position 29
. Graphical Models/BP (5) Example: Belief propagation r 0 r 0 2 2 r 0 2 r 0 2 r 00 r 0 1 1 r 00 1 r 0 1 The energy of each state depends on: - its singleton energy - its pairwise energies - the energies of the states of its parents 0
. Graphical Models/BP (6) Example: Belief propagation m 2! (r 0 ) m 2! (r 00 ) r 0 m 2! 2 r 0 2 Belief propagation: each node tells its neighbors nodes what it believes their state should be r 00 1 A message is sent from node i to node j r 0 1 The message is a vector where # of dimensions: allowed states/rotamers in recipient 1
. Graphical Models/BP (7) Example: Belief propagation r 0 2 Who sends the first message? r 0 2 r 00 1 r 0 1 2
. Graphical Models/BP (8) Example: Belief propagation r 0 2 Who sends the first message? r 0 2 In a tree: the leaves - Belief propagation is proven to be correct in a tree! r 00 1 r 0 1
. Graphical Models/BP (9) Example: Belief propagation r 0 2 Who sends the first message? r 0 2 In a graph with cycles: Set initial values Send in parallel r 00 1 r 0 1 No guarantees can be made! There might not be any convergence 4
. Graphical Models/BP (10) Example: Belief propagation m 2! (r 0 ) = 1 m 2! (r 00 ) = 1 m!2 (r 0 2) = 1 r 0 2 We iterate from there. r 0 r 00 m 1! m!1 m 2! m!2 m1!2 2 1 m2!1 m 1!2 (r 0 2) = 1 m 2!1 (r 0 1) = 1 m!1 (r 0 1) = 1 r 0 1 m 1! (r 0 ) = 1 m 1! (r 00 ) = 1 5
. Graphical Models/BP (11) Example: Belief propagation m 2! (r 0 ) = 1 m 2! (r 00 ) = 1 Node receives messages from nodes 1 and 2 r 0 m 2! 2 r 0 2 r 00 m 1! 1 r 0 1 m 1! (r 0 ) = 1 m 1! (r 00 ) = 1 6
. Graphical Models/BP (12) Example: Belief propagation What message does node send to node 1 on the next iteration? r 0 2 r 0 2 m!1 r 00 1 m!1 (r 0 1) =? r 0 1 7
Belief propagation: message passing N(i)! Neighbors of variable i Message that gets sent on each iteration m i!j (r j ) = max:e Ei(ri) Eij (ri;rj ) t r i Y k2n(i)nj m k!i (r i ); 8
Pairwise energies E ij (r 1 ; r 2 ) Position #1 E i (r 1 ) Example: Belief propagation Singleton energies E i (r 2 ) E i (r ) Position #2-4 -1-2 -6-2 r 0 r 00 Position #2 E ij (r 2 ; r ) Position # -1 Position # - r 0 r 00 E ij (r 1 ; r ) Iteration 0: m!1 (r1 0 ) = max :e Ei(r) Eij (r 0 ;r 1 ) t m 2! (r 0 ); r =? e E i (r ) E ij (r 00 ;r 1 ) t m 2! (r 00 ); Position #1-1 -4 r 0 r 00 9
. Graphical Models/BP (15) Example: Belief propagation Once it converges we can compute the belief each node has about itself r 0 m 2! 2 r 0 2 Belief about one's state: Multiply all incoming messages by singleton energy r 00 m 1! 1 r 0 1 40
Belief propagation: Max-marginals Belief about each rotamer MM i (r i ) = e E i (r i ) t Y k2n(i) m k!i (r i ) Pr 1 i (r i ) = max r 0 :r 0 i =r i Pr(r 0 ) Most likely rotamer for position i r i = arg max Pr 1 i (r i ) r i 2Rots i 41
. Graphical Models/BP (17) Fromer M, Yanover, C. Proteins (2008) Fromer M, Yanover, C. Proteins (2008) 42
. Graphical Models/BP (18) Fromer M, Yanover, C. Proteins (2008) Fromer M, Yanover, C. Proteins (2008) 4
. Graphical Models: Summary Formulate as an inference problem Model our design problem as a graphical model Establish edges between interacting residues Use Belief Propagation to find the beliefs for each position 44
4. tbmmf: type specific BMMF Paper's main contribution Builds on previous work by C. Yanover (2004) Uses Belief propagation to find lowest energy sequence and constrains space to find subsequent sequences 45
TBMMF (simplification) 1. Find the lowest energy sequence using BP 2. Find the next lowest energy sequence while excluding amino acids from the previous one. Partition into two subspaces using constraints according to the next lowest energy sequence 46
4. tbmmf () Example: tbmmf (1) Fromer M, Yanover, C. Proteins (2008) 47
4. tbmmf (4) Example: tbmmf (2) Fromer M, Yanover, C. Proteins (2008) 48
Results Fromer M, Yanover, C. Proteins (2008) 49
Results (2) Fromer M, Yanover, C. Proteins (2008) 50
Results() Algorithms tried: DEE / A* (Goldstein, 1-split, 2-split, Magic Bullet) tbmmf Ros: Rosetta SA: Simulated annealing over sequence space 51
Results (4): Assessment results Fromer M, Yanover, C. Proteins (2008) 52
Results(5) Fromer M, Yanover, C. Proteins (2008) 5
Results(6) Fromer M, Yanover, C. Proteins (2008) 54
Results(6) Fromer M, Yanover, C. Proteins (2008) 55
Results (7) DEE/A* was not feasible for any case except the prion SspB: A* could only output one sequence DEE also did not finish after 12 days BD/K* did not finish after 12 days 56
Results (8) Predicted sequences where highly similar between themselves. (high sequence identity) Very different from wild type sequence Solution: grouped tbmmf: apply constraints to whole groups of amino acids proof of concept only 57
Conclusions Fast and accurate algorithm Outperforms all other algorithms: A* is not feasible Better accuracy than other probabilistic algorithms 58
Conclusions (2) tbmmf produces a large set of very similar low energy results. This might be due to the many inaccuracies in the model Grouped tbmmf can produce a diverse set of low energy sequences 59
Conclusions () The results lack experimental data for validation. 60
Related Work: (Fromer et al. 2008) Fromer F, Yanover C. A computational framework to empower probabilistic protein design. ISMB 2008 Phage display: 10 9 10 10 randomized protein sequences Simultaneously tested for relevant biological function 61
Related Work: (Fromer et al. 2008) Fromer M, Yanover, C. Bioinformatics (2008) 62
Related Work: (Fromer et al. 2008) Uses sum-product instead of max-product Obtain per-position amino acid probabilities Tried until convergence or 100000 iterations; all structures converged 6
Related Work: (Fromer et al. 2008) Conclusions: Model results in probability distributions far from those observed experimentally. Limitations of the model: Imprecise energy function Decomposition into pairwise energy terms Assumption of a fixed backbone Discretization of side chain conformations 64
tbmmf algorithm 65