De Novo Protein Structure Prediction

Size: px

Start display at page:

Download "De Novo Protein Structure Prediction"

Poppy Russell
5 years ago
Views:

1 De Novo Protein Structure Prediction

2 Multiple Sequence Alignment A C T A T T T G G

3 Multiple Sequence Alignment A C T ACT-- A T T ATT-- T --TGG G G

4 Multiple Sequence Alignment What about the simple extension from 2D?! There are seven possible endings : u v u v m v m u u u v m w n v m w w n w n w n endings for 2 k 1 k sequences. Why?

5 Multiple Sequence Alignment s i 1,j 1,k 1 + (u i,v j,w k ) s i 1,j 1,k + (u i,v j, ) s i 1,j,k 1 + (u i,,w k ) s i,j,k = max s i,j 1,k 1 + (,v j,w k ) s i 1,j,k + (u i,, ) s i,j,k 1 + (,,w k ) s i,j 1,k + (,v j, ) s i,j,k 1 + (,,w k ) (x, y, z) is an entry in the 3D scoring matrix Time and space grow exponentially with number of sequences

6 Scoring Sum-of-Pairs Scoring (SP): S(A) = m 1 i=1 m j=i+1 S( s i, s j ) A s i S( s i, s j ) a multiple alignment projection of sequence i ( i with gaps) score of pairwise alignment Idea: A good multiple alignment should contain good pairwise alignments

7 Pruning the DP Matrix The dynamic programming matrix is large, but we only want the best alignment, and most matrix elements are not on that path.! Can we direct the search to avoid evaluating cells that are provably not on the best path? S v S v Score of the best path from start to v v F v F v Bound on the best path from to the end v K Score of best known alignment What if: S v + F v <K

8 Pruning the DP Matrix ARSTVK, ASVK, ARTR Let v = (3, 2, 2) S v F v is score of best alignment of: ARS, AS, AR is upper bound on score of aligning: TVK, VK, TR If S v + F v <K then mark v as dead-ending (aka prune v )

9 Pruning the DP Matrix We know the alignment score is: m 1 m S(A) S(s k,s l ) k=1 l=k+1 Observation: S(A) = m 1 m i=1 j=i+1 S( s i, s j ) S( s i, s j ) S(s i,s j ) So our bound can be: F v = m 1 k=1 m l=k+1 S(s k v k +1...n k,s l v l +1...n l ) Runtime for computing F (using dynamic programming): O(n 2 m 2 )

10 Backbone Native State Each point on the energy landscape defines a conformation and associated energy.! How many degrees of freedom should we have? How many do we want? A protein conformation can be represented by a vector of DOF choices, and the conformation with minimum (potential) energy is: =(...,r i,...) E( ) = X i6=j E i,j + X i E i

11 Backbone Native State Each point on the energy landscape defines a conformation and associated energy.! How many degrees of freedom should we have? How many do we want? A protein conformation can be represented by a vector of DOF choices, and the conformation with minimum (potential) energy is: =(...,r i,...) = arg min E( )

12 Primary Sequence Energy Function Conformation Space In order to apply a discrete optimization technique, we need a discretized search space!

13 Primary Sequence Energy Function Algorithm Conformation Space

14 (Homologous) Backbone Energy Function Algorithm Conformation Space

15 (Homologous) Backbone X-ray Data Algorithm Conformation Space

16 (Homologous) Backbone X-ray Data Energy Function Algorithm Conformation Space

17 (Homologous) Backbone X-ray Data Energy Function PDB Statistics Algorithm Conformation Space

18 (Homologous) Backbone NMR Data Energy Function PDB Statistics Algorithm Conformation Space

19 Prior Knowledge and Observations (Sequence/Fold, Energy Function, Statistics, Experimental Data) Conformation AlgorithmSpace Best-Fit Model (3D Structure, Backbone, Sidechains, Docking, design)

20 Prior Knowledge and Observations (Sequence/Fold, Energy Function, PDB Statistics, Experimental Data) Fast (enough)? Accurate (enough) model? Conformation AlgorithmSpace Correct (enough) solution? Best-Fit Model (3D Structure, Backbone, Sidechains, Docking, design)

21 Prior Knowledge and Observations (Sequence/Fold, Energy Function, Statistics, Experimental Data) Fast (enough)? We want O(n c ) and not O(c n ) Conformation AlgorithmSpace Correct objective function? Guarantees on solution quality? Best-Fit Model (3D Structure, Backbone, Sidechains, Docking, design)

[3] 1979 B BIND 23 NA James and Sielecki [5] 1983 B BIND 5 1.8, R-factor < 0.15 B enedetti et al. [6] 1983 B BIND 238 peptides R-factor < 0.10 Ponder and Richards [7] 1987 B BIND 19 2.

22 Discretizing Sidechains Table 1 Published rotamer libraries. Authors Year Type of library Number of proteins in library Resolution (Å) C handrasekaran and Ramachandran [2] 1970 B BIND 3 NA Janin et al. [4] 1978 B BIND, SSDEP Bhat et al. [3] 1979 B BIND 23 NA James and Sielecki [5] 1983 B BIND 5 1.8, R-factor < 0.15 B enedetti et al. [6] 1983 B BIND 238 peptides R-factor < 0.10 Ponder and Richards [7] 1987 B BIND Mc Gregor et al. [8] 1987 SSDEP Tuffery et al. [9] 1991 B BIND Dunbrack and Karplus [10] 1993 B BIND, B B DEP Schrauber et al. [11] 1993 B BIND, S S DEP Kono and Doi [12] 1996 B BIND 103 NA De Maeyer et al. [13] 1995 B BIND Dunbrack and C ohen [14] B BIND, B B DEP 850* 1.7 Lovell et al. [15 ] 2000 B BIND, SSDEP *Latest update, May NA, not available. [Dunbrack, Rotamer Libraries in the 21st Century, 2002]

23 Energy Functions Standard approaches (e.g. Amber, CHARMM, GROMACS) model potential energy as : E total = E bonded + E unbonded where: E bonded = E bond + E angle + E dihedral E nonbonded = E electrostatic + E vdw

25 (Homologous) Backbone Energy Function Algorithm Conformation Space

26 R Phenylalanine ! n For a protein with amino acids, the protein backbone has 2n 2 degrees of freedom.! Sidechain conformations are also defined by dihedral angles, but can be discretized by rotamers. [Shapovalov, Dunbrack 11]

Native State Backbone Each point on the energy landscape defines a conformation and associated energy.! For sidechain placement, we have n degrees of freedom.

27 Native State Backbone Each point on the energy landscape defines a conformation and associated energy.! For sidechain placement, we have n degrees of freedom. Each amino acid has a number of states equal to the number of rotamers for that type. A sidechain conformation can be represented by a vector of rotamer choices, and the conformation with minimum (potential) energy is: =(...,i r,...) = arg min E( )

28 Dead End Elimination One of the only deterministic, non-trivial, and effective combinatorial optimization algorithms in Computational Structural Biology Prunes rotamers that are provably NOT part of the GMEC Used For Side-Chain Placement (tertiary structure prediction) Protein Design Original DEE

29 Dead End Elimination Total Energy 1 3 2

30 Dead End Elimination Total Energy i r i t

31 Dead End Elimination Total Energy i r i t

32 Dead End Elimination Total Energy i r i t

33 Dead End Elimination Original DEE (Simplified) i r i t? 3? 3 2? 2?

34 Dead End Elimination Original DEE (Simplified) i r min i t max?? min 3 max 3?? 2 2

35 Dead End Elimination Original DEE (Simplified) Pierce, Spriet, Desmet, Mayo, JCC, 2000

36 Dead End Elimination Original DEE: Pierce, Spriet, Desmet, Mayo, JCC, 2000

37 Dead End Elimination Goldstein Criterion: E(i r ) E(i t )+ X j6=i min s {E(i r,j s ) E(i t,j s )} > 0 Pierce, Spriet, Desmet, Mayo, JCC, 2000

38 Dead End Elimination Goldstein Criterion: E(i r ) E(i t )+ X j6=i min s {E(i r,j s ) E(i t,j s )} > 0 Pierce, Spriet, Desmet, Mayo, JCC, 2000

39 Dead End Elimination Generalized Goldstein Criterion: E(i r ) X t=1,t C t E(i t )+X j6=i {min s E(i r,j s ) X t=1,t C t E(i t,j s )} > 0 Pierce, Spriet, Desmet, Mayo, JCC, 2000

40 Conformation Space k c k a k b k d k e The idea behind bottom line DEE is that the conformation space can be partitioned to improve pruning.! If a particular rotamer can be eliminated in any partition, then it is not in the GMEC.

41 Dead End Elimination Simple Split DEE (for each partition): E(i r ) E(i t )+ X j6=k6=i min s {E(i r,j s ) E(i t,j s )} +[E(i r,k v ) E(i t,k v )] > 0 Pierce, Spriet, Desmet, Mayo, JCC, 2000

42 TABLE II. CPU Minutes Consumed Using Goldstein (T = 1) DEE, Split (s = 1) DEE, and Split (s = 2 mb )DEEforEachof Three Test Cases. Case Method (T = 1) time (s = 1) time (s = 2 mb )time Doublestime Totaltime 1 Goldstein (T = 1) a Split (s = 1) Split (s = 2 mb ) Goldstein (T = 1) a Split (s = 1) Split (s = 2 mb ) Goldstein (T = 1) a Split (s = 1) a Split (s = 2 mb ) a Failed to converge due to combinatorial explosion in the number of superrotamers created by unification. FIGURE 5. Plastocyanin core design (the two split methods are indistinguishable). FIGURE 6. Protein G core/boundary design. FIGURE 7. Protein G surface design.

43 Extensions and Results Sidechain placement vs. design, is there a difference? DEE can be an extremely powerful pruning strategy, what do we do in cases where the conformation space remains large? Can we do better than looking at conformations exhaustively?

44 Conformation Space 1 Far apart Explicit: (1,1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3) An explicit representation considers all possible conformations individually, leading to an exponentially sized conformation space.

45 Factorized Conformation Space 1 Far apart Implicit: (1, 2, 3), (1, 2, 3) Rather than defining the conformation space explicitly, by considering only local interactions we can obtain a compact, factored representation of the conformation space.

46 Factorized Conformation Space 1 i E(i, j) E(i, i 0 ) 2 3 E(i, j) E(j, j 0 ) 1 2 j 3 Implicit: (1, 2, 3), (1, 2, 3) Rather than defining the conformation space explicitly, by considering only local interactions we can obtain a compact, factored representation of the conformation space.

47 Factorized Conformation Space Rather than defining the conformation space explicitly, by considering only local interactions we can obtain a compact, factored representation of the conformation space.

48 Protein Interaction Graph We can construct an interaction graph of residues in which edges are defined for residues that are close enough. These define pairwise energy terms for any chosen pair of rotamers.

49 Protein Interaction Graph The sidechain placement problem is then to select rotamers at each position so as to minimize the sum over all edges of interaction energies.

50 Linear Programming LP Solver A, b, c Minimize X i c i x i Subject to: A x apple b x 1,x 2,...,x n Linear programming is a general-purpose tool for optimization a linear objective function under linear constraints. The general problem of Linear Programming is polynomial-time solvable.

51 Linear Programming ILP Solver A, b, c Minimize X i c i x i Subject to: A x apple b x i 2 Z x 1,x 2,...,x n Linear programming is a general-purpose tool for optimization a linear objective function under linear constraints. Integer Linear Programming is not known to be polynomial-time solvable.

52 Linear Programming ILP Solver A, b, c Minimize X i c i x i Subject to: A x apple b x i 2{0, 1} x 1,x 2,...,x n Linear programming is a general-purpose tool for optimization a linear objective function under linear constraints. Integer Linear Programming is not known to be polynomial-time solvable.

53 3 P 1 1 [Wikipedia] The feasible region of a set of constraints can be viewed as the set of all points that satisfy the constraints.! All LP solvers search the space of solutions and try to find a point that maximizes the objective function.

54 3 P 1 1 [Wikipedia] Fact: Some vertex of the feasible region is optimal. Fact: A vertex is optimal if there is no better neighboring vertex.! Dantzig (1947) came up the simplex algorithm: Set v = any vertex!! While a neighbor vertex v has better cost:!!! v = v

55 LP for Sidechain Placement { } Minimize E = u V E uux uu + {u,v} D E uvx uv subject to u V j x uu = 1 for j = 1,..., p u V j x uv = x vv for j = 1,..., p and v V \ V j x uu, x uv {0, 1}. (IP1) [Kingsford et al. 05] Integer linear programming (ILP) gives linear constraints on a set of variables, and a linear cost function.! The goal is to minimize cost (determined by variable choices) while satisfying the constraints.! ILP does not care about the energy function, or about the fact that the interaction graph comes from a protein structure.

56 LP for Sidechain Placement Minimize E = u V E uux uu + {u,v} D E uv x uv subject to x uu = 1 u V j for j = 1,..., p x uv = x vv for j = 1,..., p and v N + (V j ) u V j x uv x vv for j = 1,..., p and v N + (V j ) u V j :E uv <0 x uu, x uv {0, 1} (IP2) [Kingsford et al. 05] One simple optimization is to only include rotamer pairs that will ever interact with a non-zero pairwise energy.! These pairs can be precomputed ahead of time, and we can reduce the number of constraints.

57 LP for Sidechain Placement Minimize E = u V E uux uu + {u,v} D E uv x uv subject to x uu = 1 u V j for j = 1,..., p x uv = x vv for j = 1,..., p and v N + (V j ) u V j x uv x vv for j = 1,..., p and v N + (V j ) u V j :E uv <0 x uu, x uv {0, 1} x uu,x uv apple 1 x uu,x uv 0 (IP2) What does it mean for the integrality constraints to be relaxed?

58 LP for Sidechain Placement Minimize E = u V E uux uu + {u,v} D E uv x uv subject to x uu = 1 u V j for j = 1,..., p x uv = x vv for j = 1,..., p and v N + (V j ) u V j x uv x vv for j = 1,..., p and v N + (V j ) u V j :E uv <0 x uu, x uv {0, 1} x uu,x uv apple 1 x uu,x uv 0 (IP2) Relaxing the integrality constraints allows the application of a polynomial-time algorithm for finding an optimal solution for the given set of constraints and objective. What does it mean to have a fractional solution?

59 LP for Sidechain Placement Minimize E = u V E uux uu + {u,v} D E uv x uv subject to x uu = 1 u V j for j = 1,..., p x uv = x vv for j = 1,..., p and v N + (V j ) u V j x uv x vv for j = 1,..., p and v N + (V j ) u V j :E uv <0 x uu, x uv {0, 1} x uu,x uv apple 1 x uu,x uv 0 (IP2) u S k x uu p 1 for k = 1,..., m 1 We can also run this method iteratively, excluding previously identified minimum-energy conformations from being selected.

60 Table 3. Prediction of side-chain conformations on native backbones, with a comparison of the LP/ILP prediction with those of other methods and the crystal structure Table 4. Prediction of side-chain conformations using homology modeling, with a comparison of the LP/ILP prediction with those of other methods and the crystal structure Core residues All residues Core residues (Å) All residues (Å) (a) LP/ILP χ 1 /χ %/62% 80%/51% (b) Scwrl χ 1 /χ %/60% 80%/49% (c) LP/ILP rmsd Å Å (d) Scwrl rmsd Å Å All values are averaged over the 25 proteins of Table 1. (a) The percentage of residues over all proteins for which LP/ILP predicted conformation has the χ 1 and χ 1+2 dihedral angles within 20 of the native structure; (b) these values for Scwrl; (c) the rmsd of the predicted side-chain conformations from those of the native side chains using the LP/ILP method; and (d) these are values for Scwrl. (a) LP/ILP rmsd (b) Scwrl rmsd (c) Backbone rmsd All values are averaged over the 33 problems of Table 2. (a) The rmsd between just sidechain atoms when comparing the LP/ILP predicted structure with the crystal structure; (b) this value when comparing the Scwrl predictions with the native structure; and (c) the rmsd between template and target structures when only considering backbone atoms.

61 Table 5. Proteins for which the core was redesigned Prot. Var len Rot Size Time (ILP) Rel gap N 1aac e2 (1.3e2) aho Integral 1b9o e2 (9.4) c5e e1 Integral 1c9o e1 (4.6e1) cc e1 (2.4) cex e3 (7.0e2) cku Integral 1ctj e1 Integral 1cz e3 (3.2e2) czp e2 (1.4e2) d4t e2 (8.9e1) igd Integral 1mfm e3 (5.4e3) plc e2 (1.3e2) qj e4 (4.5e5) qq e3 (6.9e2) qtn e2 (7.0e1) qu e2 (6.4) rcf e3 (9.6e1) vfy Integral 2pth e4 (2.4e4) lzt e2 (3.9e2) p e3 (1.3e4) rsa e2 (1.4e1) Relative gap aac Instance 1aho 1aac 1ctj 1igd 1cex Fig. 2. Relative gap between the optimal solution (with value OPT) and the nine next lowest-energy solutions (where the i-th solution has value x i ). Inset shows relative gaps for the 100 lowest-energy solutions for 1aac. Relative gap at each iteration i is defined as 100( OPT x i / OPT ).

62 Factorized Conformation Space 1 i E(i, j) E(i, i 0 ) 2 3 E(i, j) E(j, j 0 ) 1 2 j 3 Implicit: (1, 2, 3), (1, 2, 3) Rather than defining the conformation space explicitly, by considering only local interactions we can obtain a compact, factored representation of the conformation space.

63 Factorized Conformation Space Rather than defining the conformation space explicitly, by considering only local interactions we can obtain a compact, factored representation of the conformation space.

64 Protein Interaction Graph We can construct an interaction graph of residues in which edges are defined for residues that are close enough. These define pairwise energy terms for any chosen pair of rotamers.

65 Factor Graphs x f 3 Suppose we know that the likelihood function is: f 2 v y f 4 MAP Configuration Find the configuration of variables that maximizes : f 1 u [Loeliger et al. 01] [Pearl 88, Jordan...] z f 5 Marginalization Find the marginal value of on a particular variable. For example: g z = X f 1 (u, v) f 2 (v, x) f 3 (x, y) f 4 (y, z) f 5 (z) u,v,x,y

66 Factor Graphs x f 3 Suppose we know that the likelihood function is: f 2 v y f 4 MAP Configuration Find the configuration of variables that maximizes : f 1 u [Loeliger et al. 01] [Pearl 88, Jordan...] z f 5 Here, variables take on a fixed number of states, and factors define local interactions.

67 Factor Graphs x f 3 Suppose we know that the likelihood function is: f 2 v f 1 y f 4 z max z f 5 (z) MAP Configuration Find the configuration of variables that maximizes : max f 4 (y, z) y max hu,v,x,y,zi g(u, v, x, y, z) = max x f 3(x, y) max f 2 (v, x) max f 1(u, v) v u u [Loeliger et al. 01] [Pearl 88, Jordan...] f 5 We can define likelihoods using the Boltzmann distribution: Pr[ ] / e E( )

68 f 3 y f 1 x f 2 f 4 f 5 z To construct a protein factor graph, we take each amino acid in the primary sequence as a variable, and its sidechains as states. Univariate and bivariate factors are defined using self- and pairwise energies (i.e., probabilities). A MAP configuration corresponds to a minimum-energy conformation. 1UBQ - Ubiquitin Boltzmann distribution: Pr[ ] / e E( ) The model (with appropriate parameters) can be used to analyze protein energetics [Yanover/Weiss 02, Xu 05, Kamisetty et al 07, 11].

69 Max-Product Algorithm x f 3 Maximization can be computed by message passing : f 2 v f 1 u y f 4 z f 5 max z µ fj!x i (x i ) = max X j \x i f j (X j ) Y x2x j \x i µ x!fj (x) Once all messages have been passed, we can assign a maximizing configuration starting at leaf factors. f 5 (z) max f 4 (y, z) max f 3(x, y) max f 2 (v, x) max f 1(u, v) y x v u [Pearl 88]

70 Dealing with Cycles f 2 v f 1 u x??? f 3 [Pearl 88, Yedidia et al ] y f 4 z f 5 Computing marginals or MAP configurations exactly in a model with cycles is NP-hard.! However, we can still use the sumproduct algorithm in two ways:! Collapse multiple variables into a single variable to eliminate cycles.! Run sum-product as before, but until convergence.! One method is exact, while the other is approximate.

71 Dealing with Cycles x f 1 u v f 2 f 3 [Pearl 88, Yedidia et al ] x y f 4 z f 5 Computing marginals or MAP configurations exactly in a model with cycles is NP-hard.! However, we can still use the sumproduct algorithm in two ways:! Collapse multiple variables into a single variable to eliminate cycles.! Run sum-product as before, but until convergence.! Variable/Factor grouping must be chosen carefully to avoid state-space explosion.

72 Unfortunately exact methods are prohibitively expensive if we consider longer-range interactions. We can approximate by stopping message passing near (or at) convergence.

73 Tree Decomposition h fh b d f g abcdefm fg m c e a i clk eij l k j Fig. 1. Example of a residue interaction graph. Fig. 2. Example of the biconnected-component decomposition of a graph. The width of this decomposition is 6. Given a factor graph, we can actually reorganize it as long as we don t lose any dependencies. But, we don t want to add too many unnecessary ones either.

74 Tree Decomposition fh abd acd cdem defm fg clk eij Fig. 3. Example of a tree decomposition of a graph with width 3. Given a factor graph, we can actually reorganize it as long as we don t lose any dependencies. But, we don t want to add too many unnecessary ones either.! In general, which trees capture the original graph, and how can we measure how good a particular tree it?

75 Tree Decomposition fh abd acd cdem defm fg clk eij Fig. 3. Example of a tree decomposition of a graph with width 3. A tree decomposition is a tree on vertex subsets that satisfies the following:! 1. The union of all vertex subsets equals the original vertex set. 2. For any edge in the original graph, there is some with. 3. If and, then for all on the path between and. X j X i (u, v) X i u, v 2 X i v 2 X i v 2 X j v 2 X k X k X i

76 Standard Applications Stereo Vision [Sontag 10] Signal Processing Coding [Söding 05] [McEliece et al. 98]

77 General Graphs a b a b c a b d c d c e f d e x f y Tree Decomposition (NP-Hard) e x f y u v u x v y Loopy Graph Junction Tree To deal with graphs with cycles, we group variables such that the original likelihood function is unchanged but we obtain a tree-structured model. If this junction-tree has treewidth, sum-product requires O(n d ) time.

78 Sum-Product is Fragile x f 3 x f 3 f 2 y f 2 y v f 1 f 4 z Update v f 1 f 4 z Updating a tree-structured factor graph can change messages in an execution of the sum-product algorithm. u f 5 u f 5 c e x a b d f y Add (u, v) c e u u e x a b c u u d a b f d u f u y Adding a cycle to the input graph can change nodes in the junction tree (and associated factor graph). u v u x v u y

79 Clustering in Factor Graphs x f 3 x Cluster Functions f 2 y f 2 f 3 y v f 4 Rake, Compress,, v f 4 f 1 z f 1 z u f 5 ū f 5 In each round of clustering, we rake all leaves and compress a maximal independent set of degree-two nodes [Miller/Reif 84], while computing cluster functions.

80 Tree Contraction x f 3 f 2 v y f 4 Rake, Compress f 2 f 1 x y z Compress f 2 y z f 1 z f 1 (u, v) = f 1 (u, v) Finalize ū f 5 f 2 (y) = f 1 (u, v) x (x, y)f 2 (v, x, y) ȳ u,v,x = f 1 (u, v)f 3 (x, y)f 2 (v, x, y) u,v,x How long do intermediate cluster function computations take? How may rounds until everything is eliminated?

81 Cluster Tree x f 3 ȳ f 2 y O(n d 3 ) time f 2 v f 4 f 1 x z f 1 z u f 5 ū v f 3 f4 f5 We also keep track of the boundaries, defined as the set of edges leaving a cluster at the time of its creation during contraction.

82 Computing Marginals Mȳ ȳ x f 3 ' x f 2 y f 2 M f2 f 1 x x z z v f 4 M f1 f 1 z v ' v ū v f 3 f4 f5 u f 5 ' z Any marginal can be computed in O(d 2 log n) time.

83 Dealing with Cycles f 2 v f 1 u x f 3 [Pearl 88, Yedidia et al ] y f 4 z f 5 Computing marginals or MAP configurations exactly in a model with cycles is NP-hard.! However, we can still use the sumproduct algorithm in two ways:! Collapse multiple variables into a single variable to eliminate cycles.! Run max-product as before, but until convergence.! The focus of research in approximate methods is in improving convergence times.

84 Message Passing and Free Energy We have been trying to minimize the potential energy of a protein conformation. But given that proteins exist in an ensemble of conformations, what do we minimize?! The free energy of a protein is defined as:!!! H G = H TS Where is the enthalpy of the system, and is the entropy of the system. S How does this relate to graphical models? We can define:!! G = X p( )E( )+T X p( )ln(p( )) =!! Here, is the normalizing constant, or partition function. Z ln Z

85 Approximate Inference Can we simplify the model in order to make it tractable? How do we do this? What can we say about the associated global likelihood? We d like to relate our approximation b( ) with the underlying global distribution p( ). The Kullback-Leibler distance between p( ) and b( ) is defined as: D(b; p) = X b( )ln b( ) p( ) Using that p( ) =e E( ) /Z we get that: D(b; p) = X which is minimized when b( )ln(b( )) + X b = p and we get: b( )E( )+lnz G = X p( )E( )+T X p( )ln(p( )) = ln Z

86 Variational Inference Now, the fit of our estimated b( ) can be measured using:! D(b; p) = X b( )ln(b( )) + X b( )E( )+lnz The variational approach to message-passing seeks to perform inference efficiently, while using bounds on ln Z to obtain a goodness of fit. Can you think of a lower bound for ln Z? An upper bound? A key area of research is to develop bounds useful for performing inference. How does all this relate back to protein structure?

87 [Lange Lab, TU-München] How does the potential-energy based view of protein design differ from the free-energy based view?

88 Free Energy Is it easy to compute the free energy of a given protein sequence (with fixed backbone)? Can we minimize the free energy for a particular choice of sequence for protein design? How can we use graphical models? Are there other (more efficient/accurate) approaches?

Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space

Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space Pablo Gainza CPS 296: Topics in Computational Structural Biology Department of Computer