Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space

Similar documents
Texas A&M University

Probabilistic Graphical Models

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

13 : Variational Inference: Loopy Belief Propagation

Dual Decomposition for Inference

13 : Variational Inference: Loopy Belief Propagation and Mean Field

9 Forward-backward algorithm, sum-product on factor graphs

Machine Learning for Data Science (CS4786) Lecture 24

Copyright 2000 N. AYDIN. All rights reserved. 1

Course Notes: Topics in Computational. Structural Biology.

Probabilistic Graphical Models

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

Probabilistic Graphical Models Lecture Notes Fall 2009

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

De Novo Protein Structure Prediction

Bayesian networks: approximate inference

Molecular Modeling Lecture 11 side chain modeling rotamers rotamer explorer buried cavities.

Machine Learning Lecture 14

Probabilistic Graphical Models

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

12 : Variational Inference I

Statistical Approaches to Learning and Discovery

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Loopy Belief Propagation for Bipartite Maximum Weight b-matching

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Graphical Models and Kernel Methods

Machine Learning 4771

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Probabilistic Graphical Models

Probabilistic Graphical Models (I)

Machine Learning 4771

Side-chain positioning with integer and linear programming

Learning in Bayesian Networks

Probability Propagation in Singly Connected Networks

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

Context of the project...3. What is protein design?...3. I The algorithms...3 A Dead-end elimination procedure...4. B Monte-Carlo simulation...

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

6.867 Machine learning, lecture 23 (Jaakkola)

Directed and Undirected Graphical Models

Tightness of LP Relaxations for Almost Balanced Models

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

COMPSCI 650 Applied Information Theory Apr 5, Lecture 18. Instructor: Arya Mazumdar Scribe: Hamed Zamani, Hadi Zolfaghari, Fatemeh Rezaei

What is Protein Design?

COMP90051 Statistical Machine Learning

Clique trees & Belief Propagation. Siamak Ravanbakhsh Winter 2018

Graphical Model Inference with Perfect Graphs

Linear Response for Approximate Inference

11 The Max-Product Algorithm

Variational Inference (11/04/13)

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models

Computational Protein Design

Lecture 4 : Introduction to Low-density Parity-check Codes

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

17 Variational Inference

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Variational algorithms for marginal MAP

Probabilistic Graphical Models

Message Passing Algorithms and Junction Tree Algorithms

Appendix of Computational Protein Design Using AND/OR Branch and Bound Search

Bayesian Machine Learning - Lecture 7

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Undirected Graphical Models

Stat 521A Lecture 18 1

Probabilistic Graphical Models

Bio nformatics. Lecture 23. Saad Mneimneh

5. Sum-product algorithm

Lecture 8: Bayesian Networks

Dynamic Approaches: The Hidden Markov Model

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Message passing and approximate message passing

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Inference in Bayesian Networks

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

3 : Representation of Undirected GM

Chapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall

Intelligent Systems:

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Expectation Propagation Algorithm

Fast and Accurate Algorithms for Protein Side-Chain Packing

CS281A/Stat241A Lecture 19

4 : Exact Inference: Variable Elimination

Efficient Partition Function Estimatation in Computational Protein Design: Probabalistic Guarantees and Characterization of a Novel Algorithm

Discrete Inference and Learning Lecture 3

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Max$Sum(( Exact(Inference(

A Generalized Loop Correction Method for Approximate Inference in Graphical Models

Undirected graphical models

Goals. Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions

Protein Structure Prediction, Engineering & Design CHEM 430

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

On the Choice of Regions for Generalized Belief Propagation

5. Density evolution. Density evolution 5-1

Computational methods for predicting protein-protein interactions

Polynomial Linear Programming with Gaussian Belief Propagation

Transcription:

Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space Pablo Gainza CPS 296: Topics in Computational Structural Biology Department of Computer Science Duke University 1

Outline 1) Problem definition 2) Formulation as an inference problem ) Graphical Models 4) tbmmf algorithm 5) Results 6) Conclusions 2

1. Problem Definition Rotamer Library Energy f(x) Positions to design & allowed Rotamers/Amino Acids Protein structure Protein design algorithm www.cs.duke.edu/donaldlab GMEC 2 S

1. Problem Definition (2) r! Rotamer assignment (RA) 1 r 1 2 r i! Rotamer at position i for RA r 4

1. Problem Definition () E i (r i )! Energy between rotamer r i and xed backbone E ij (r i ; r j )! Energy between rotamers r i and r j E i (r 1 ) E ij (r 1 ; r 2 ) 1 2 E(r)! Energy of rotamer assignment r E(r) = X i E i (r i ) + X i;j E ij (r i ; r j ) 5

1. Problem Definition (4) T(k)! returns amino acid type of rotamer k T(r)! returns sequence of rotamer assignment r r 1 2 T(r 1 ) = hexagon T(r 2 ) = cross T(r) = hexagon; cross 6

1. Problem Definition (5) Rotamer Library Energy f(x) Positions to design & allowed Rotamers/Amino Acids Protein structure Protein design algorithm www.cs.duke.edu/donaldlab GMEC 2 S S = T(arg min r E(r)) 7

1. Problem Definition (6) Related Work DEE / A* BroMAP BWM Exact Methods Global Minimum Energy Conformation Low energy conformation Probabilistic Methods SCMF MCSA 8

1. Problem Definition (7) Rotamer Library Energy f(x) Positions to design & allowed Rotamers/Amino Acids Protein structure Model Inaccurate! www.cs.duke.edu/donaldlab 9

1. Problem Definition (8) Protein design algorithm Algorithm Fast or provable S = T(arg min r E(r)) 10

1. Problem Definition (9) Too stable Not fold to target Low binding specificity Low energy conformation 11

1. Problem Definition (10) Provable Methods DEE/A* Ordered set of gap-free low energy conformations, including GMEC Set of low energy conformations Solution: Find a set of low energy sequences Probabilistic Methods tbmmf 12

Problem Definition: Summary Protein design algorithms search for the sequence with the Global Minimum Energy Conformation (GMEC). Our model is inaccurate: more than one low energy sequence is desirable. Fromer et al. Propose tbmmf to generate a set of low energy sequences. 1

2. Our problem as an inference problem Probabilistic factor for self-interactions à i (r i ) = e E i (r i ) T Probabilistic factor for pairwise interactions à ij (r i ; r j ) = e E ij (r i ;r j ) T 14

2. Inference problem (2) Partition function Z = X r e E(r) T Probability distribution for rotamer assignment P (r 1 ; :::; r N ) = 1 Z Y i à i (r i ) Y i;j à ij (r i ; r j ) = 1 Z e E(r) T r 15

2. Inference problem () Minimization goal (from definition) S = T(arg min r E(r)) Minimization goal for a graphical model problem S = T(arg max r P r(r)) 16

2. Inference problem (4) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 Example: Inference problem 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - E(r 0 ) =? E(r 00 ) =? What is our GMEC?? 17

2. Inference problem (5) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - E(r 0 ) = ( 1 + 2) + ( 5 + 2) = 10 E(r 00 ) = ( 1 + 4) + ( + 4) = 12 r 00 is our GMEC 18

2. Inference problem (6) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) -1-5 - T = 1 (for our example) r 00 à i (r1 0 ) = e E i (r T 0 1 ) = e à i (r2) 0 = e E i (r 0 2 ) T = e 5 à i (r 00 1 E i(r ) = e T à i (r 00 2 ) = e 00 1 ) = e E 00 i(r2 ) T = e 19

2. Inference problem (7) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - Ã ij (r 0 1 ; r0 2 ) = e E ij (r0 1 ;r0 2 ) T = e 2 Ã ij (r1 00 ; r2 00 ) = e E ij (r00 1 ;r00 2 ) T = e 4 T = 1 (for our example) Z = X r e E(r) T = e 10 + e 12 20

2. Inference problem (8) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 P (r 0 1; r 0 2) = 1 Z Y à i (r 0 i) Y i;j à ij (r 0 i; r 0 j) - = e 10 e 10 + e 12 i T = 1 (for our example) 21

2. Inference problem (9) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - P (r 00 1 ; r 00 2 ) = 1 Z = e 12 e 10 + e 12 Y i à i (r 0 i) Y i;j à ij (r 00 i ; r 00 j ) T = 1 (for our example) 22

2. Inference problem (10) Position #2 E ij (r 1 ; r 2 ) Position #1 Allowed -4-2 r 0 1 2 E i (r 1 ) E i (r 2 ) r 00-1 -5 - S = T(arg max r P r(r)) S = T(r 00 ) T = 1 (for our example) 2

2. Inference problem (11) Minimization goal (from definition) S = T(arg min r E(r)) Minimization goal for a graphical model problem S = T(arg max r P r(r)) We still have a non-polynomial problem! But formulated as an inference problem Probabilistic methods 24

Summary: Inference problem We model our problem as an inference problem. We can use probabilistic methods to solve it. 25

. Graphical models for protein design and belief propagation (BP) 1. Model each design position as a random variable 2. Build interaction graph that shows conditional independence between variables Source: Fromer M, Yanover, C. Proteins (2008) SspB dimer interface: Inter-monomeric interactions (Cα) 26

. Graphical Models/BP (2) Example: Belief propagation 2 r 0 r 00 2 node in the graphical model: interacting residue in the structure. 1 1 node in the graphical model: random variable 27

. Graphical Models/BP () Example: Belief propagation b r 0 r 00 2 2 edge: energy interaction between two residues. edge: causal relationship between two nodes 1 1 4 4 If two residues are distant from each other, no edge between them. 28

. Graphical Models/BP (4) Example: Belief propagation r 0 r 0 2 2 r 0 2 r 0 2 r 00 r 0 1 1 r 00 1 r 0 1 Every random variable can be in one of several states: allowable rotamers for that position 29

. Graphical Models/BP (5) Example: Belief propagation r 0 r 0 2 2 r 0 2 r 0 2 r 00 r 0 1 1 r 00 1 r 0 1 The energy of each state depends on: - its singleton energy - its pairwise energies - the energies of the states of its parents 0

. Graphical Models/BP (6) Example: Belief propagation m 2! (r 0 ) m 2! (r 00 ) r 0 m 2! 2 r 0 2 Belief propagation: each node tells its neighbors nodes what it believes their state should be r 00 1 A message is sent from node i to node j r 0 1 The message is a vector where # of dimensions: allowed states/rotamers in recipient 1

. Graphical Models/BP (7) Example: Belief propagation r 0 2 Who sends the first message? r 0 2 r 00 1 r 0 1 2

. Graphical Models/BP (8) Example: Belief propagation r 0 2 Who sends the first message? r 0 2 In a tree: the leaves - Belief propagation is proven to be correct in a tree! r 00 1 r 0 1

. Graphical Models/BP (9) Example: Belief propagation r 0 2 Who sends the first message? r 0 2 In a graph with cycles: Set initial values Send in parallel r 00 1 r 0 1 No guarantees can be made! There might not be any convergence 4

. Graphical Models/BP (10) Example: Belief propagation m 2! (r 0 ) = 1 m 2! (r 00 ) = 1 m!2 (r 0 2) = 1 r 0 2 We iterate from there. r 0 r 00 m 1! m!1 m 2! m!2 m1!2 2 1 m2!1 m 1!2 (r 0 2) = 1 m 2!1 (r 0 1) = 1 m!1 (r 0 1) = 1 r 0 1 m 1! (r 0 ) = 1 m 1! (r 00 ) = 1 5

. Graphical Models/BP (11) Example: Belief propagation m 2! (r 0 ) = 1 m 2! (r 00 ) = 1 Node receives messages from nodes 1 and 2 r 0 m 2! 2 r 0 2 r 00 m 1! 1 r 0 1 m 1! (r 0 ) = 1 m 1! (r 00 ) = 1 6

. Graphical Models/BP (12) Example: Belief propagation What message does node send to node 1 on the next iteration? r 0 2 r 0 2 m!1 r 00 1 m!1 (r 0 1) =? r 0 1 7

Belief propagation: message passing N(i)! Neighbors of variable i Message that gets sent on each iteration m i!j (r j ) = max:e Ei(ri) Eij (ri;rj ) t r i Y k2n(i)nj m k!i (r i ); 8

Pairwise energies E ij (r 1 ; r 2 ) Position #1 E i (r 1 ) Example: Belief propagation Singleton energies E i (r 2 ) E i (r ) Position #2-4 -1-2 -6-2 r 0 r 00 Position #2 E ij (r 2 ; r ) Position # -1 Position # - r 0 r 00 E ij (r 1 ; r ) Iteration 0: m!1 (r1 0 ) = max :e Ei(r) Eij (r 0 ;r 1 ) t m 2! (r 0 ); r =? e E i (r ) E ij (r 00 ;r 1 ) t m 2! (r 00 ); Position #1-1 -4 r 0 r 00 9

. Graphical Models/BP (15) Example: Belief propagation Once it converges we can compute the belief each node has about itself r 0 m 2! 2 r 0 2 Belief about one's state: Multiply all incoming messages by singleton energy r 00 m 1! 1 r 0 1 40

Belief propagation: Max-marginals Belief about each rotamer MM i (r i ) = e E i (r i ) t Y k2n(i) m k!i (r i ) Pr 1 i (r i ) = max r 0 :r 0 i =r i Pr(r 0 ) Most likely rotamer for position i r i = arg max Pr 1 i (r i ) r i 2Rots i 41

. Graphical Models/BP (17) Fromer M, Yanover, C. Proteins (2008) Fromer M, Yanover, C. Proteins (2008) 42

. Graphical Models/BP (18) Fromer M, Yanover, C. Proteins (2008) Fromer M, Yanover, C. Proteins (2008) 4

. Graphical Models: Summary Formulate as an inference problem Model our design problem as a graphical model Establish edges between interacting residues Use Belief Propagation to find the beliefs for each position 44

4. tbmmf: type specific BMMF Paper's main contribution Builds on previous work by C. Yanover (2004) Uses Belief propagation to find lowest energy sequence and constrains space to find subsequent sequences 45

TBMMF (simplification) 1. Find the lowest energy sequence using BP 2. Find the next lowest energy sequence while excluding amino acids from the previous one. Partition into two subspaces using constraints according to the next lowest energy sequence 46

4. tbmmf () Example: tbmmf (1) Fromer M, Yanover, C. Proteins (2008) 47

4. tbmmf (4) Example: tbmmf (2) Fromer M, Yanover, C. Proteins (2008) 48

Results Fromer M, Yanover, C. Proteins (2008) 49

Results (2) Fromer M, Yanover, C. Proteins (2008) 50

Results() Algorithms tried: DEE / A* (Goldstein, 1-split, 2-split, Magic Bullet) tbmmf Ros: Rosetta SA: Simulated annealing over sequence space 51

Results (4): Assessment results Fromer M, Yanover, C. Proteins (2008) 52

Results(5) Fromer M, Yanover, C. Proteins (2008) 5

Results(6) Fromer M, Yanover, C. Proteins (2008) 54

Results(6) Fromer M, Yanover, C. Proteins (2008) 55

Results (7) DEE/A* was not feasible for any case except the prion SspB: A* could only output one sequence DEE also did not finish after 12 days BD/K* did not finish after 12 days 56

Results (8) Predicted sequences where highly similar between themselves. (high sequence identity) Very different from wild type sequence Solution: grouped tbmmf: apply constraints to whole groups of amino acids proof of concept only 57

Conclusions Fast and accurate algorithm Outperforms all other algorithms: A* is not feasible Better accuracy than other probabilistic algorithms 58

Conclusions (2) tbmmf produces a large set of very similar low energy results. This might be due to the many inaccuracies in the model Grouped tbmmf can produce a diverse set of low energy sequences 59

Conclusions () The results lack experimental data for validation. 60

Related Work: (Fromer et al. 2008) Fromer F, Yanover C. A computational framework to empower probabilistic protein design. ISMB 2008 Phage display: 10 9 10 10 randomized protein sequences Simultaneously tested for relevant biological function 61

Related Work: (Fromer et al. 2008) Fromer M, Yanover, C. Bioinformatics (2008) 62

Related Work: (Fromer et al. 2008) Uses sum-product instead of max-product Obtain per-position amino acid probabilities Tried until convergence or 100000 iterations; all structures converged 6

Related Work: (Fromer et al. 2008) Conclusions: Model results in probability distributions far from those observed experimentally. Limitations of the model: Imprecise energy function Decomposition into pairwise energy terms Assumption of a fixed backbone Discretization of side chain conformations 64

tbmmf algorithm 65