Novel Monte Carlo Methods for Protein Structure Modeling Jinfeng Zhang Department of Statistics Harvard University
Introduction Machines of life Proteins play crucial roles in virtually all biological processes. Myosin Rhodopsin Hemoglobin Pepsin From Protein Data Bank (PDB) 2
Introduction From Sequence to Structure to Function The function of a protein is governed by its three-dimensional structure. Structure is determined by sequence. Genome Sequencing Projects Function 3
Structural Genomics The next step beyond the human genome project X-ray, NMR At least one structure in each protein family (~8000) Computational prediction More than 3,000,000 proteins 4
Outline Markov Chain Monte Carlo (CMC) method for protein structure prediction Sequential Monte Carlo (SMC) method for characterizing ensemble of protein conformations Summary & Discussions 5
Protein Folding p( x) E( x)/ kbt e = Z( T ) = e Z( T ) i E( x )/ k i B T 6
Simplified Protein Folding Model HP Model Protein sequence Hydrophobic (H) and polar (P) residues e.g. HHHPHHPHHHHPHHP. Protein conformation Self-avoiding chain on a 2D square or 3D cubic lattice. Energy E = - n, where n is number of hydrophobic contacts. The assumption The native conformation is the one with minimum energy. Finding the conformation with minimum energy NP complete problem. 7
A Sequence of 64 Residues HHHHHHHHHHHHPHPHPPHHPPHHPPHPPHHP PHHPPHPPHHPPHHPPHPHPHHHHHHHHHHHH E = -42 8
HP Model and Lattice Polymers Lattice models. A Brief Review Polymer science since 1950s HP model, 1985 by Dill. Simple with basic features of protein folding. Studying protein thermodynamics, folding principle, designability, protein evolution Studies on folding HP sequences HZ, CHCC, CI, GA, SA, GTS, EMC, GMC, GSA, PERM, SISPER, ACO, EES, npermis, npermh. Unsolved problem for sequences with medium length. 9
Exploring the Energy Landscape Metropolis-Hastings Algorithm a( x, Start with a conformation x. Draw y from q(x,y) ) = q(y x). Accept the new conformation with probability y) = = π ( y) q( y, x) min 1, π ( x) q( x, y) min { [ E( x) E( y)]/ k T } B 1, e 10
Moves on Lattice Pivot move Local moves End move Corner move Crankshaft move 11
Fragment Re-growth via Energyguided Sequential Sampling (FRESS) 1. Sample a l from L min to L max. For example, p(l) U[L min = 2, L max = 12], and l = 5. 2. Select and delete a random fragment of length l. 3. Sample the fragment sequentially. p j exp( E j / T ) 4. Accept or reject the newly sampled conformation using Metropolis criterion: p = min{1, e t+ ( E 1 E )/ T t } p = min{1, w w t t+ 1 e ( E t+ 1 Et )/ T } 5. Simulated Annealing. 12
FRESS Movie 13
2D Sequences Seq. code L Sequence 2D50 50 HHPHPHPHPHHHHPHPPPHPPPHPPPPHPPPHPPPHPHHHHPHPHPHPHH 2D60 60 2D64 64 2D85 85 2D100a 100 2D100b 100 PPHHHPHHHHHHHHPPPHHHHHHHHHHPHPPPHHHHHHHHHHHHPPP PHHHHHHPHHPHP HHHHHHHHHHHHPHPHPPHHPPHHPPHPPHHPPHHPPHPPHHPPHHPPH PHPHHHHHHHHHHHH HHHHPPPPHHHHHHHHHHHHPPPPPPHHHHHHHHHHHHPPPHHHHHH HHHHHHPPPHHHHHHHHHHHHPPPHPPHHPPHHPPHPH PPPPPPHPHHPPPPPHHHPHHHHHPHHPPPPHHPPHHPHHHHHPHHHHH HHHHHPHHPHHHHHHHPPPPPPPPPPPHHHHHHHPPHPHHHPPPPPPHP HH PPPHHPPHHHHPPHHHPHHPHHPHHHHPPPPPPPPHHHHHHPPHHHHHH PPPPPPPPPHPHHPHHHHHHHHHHHPPHHHPHHPHPPHPHHHPPPPPPHH H 14
Comparison on Folding Long HP Sequences 2D seq. EMC SISPER GSA EES npermis FRESS 2D50-21 -21 NA -21 NA -21 2D60-35 -36-36 -36-36 -36 2D64 2D85 2D100a 2D100b -39 NA NA NA -39-52 -48-49 -42-52 -48-50 -42-53 -48-49 -42-53 -48-50 -42-53 -48-50 15
Minimum Energy Conformations of 2D64 16
Minimum Energy Conformations of 2D100 17
3D Sequences Seq. code L Sequence 3D58 58 3D64 64 3D67 67 3D88 88 3D103 103 3D124 124 3D136 136 PHPHHHPHHHPPHHPHPHHPHHHPHPHPHHPPHHHPPHPHPPPPHP PHPPHHPPHPPH PHHPHHPHHHPPHPHPPHPHPPHHHPHHPHHPPHHPHHPHHHPPH PHPPHPHPPHHHPHHPHHP PHPHHPHHPHPPHHHPPPHPHHPHHPHPPHHHPPPHPHHPHHPHPP HHHPPPHPHHPHHPHPPHHHP PHPHHPHHPHPPHHPPHPPHPPHPPHPPHPPHHPPHHHPPHHHPPH HHPPHHHPPHPHHPHHPHPPHPPHPPHHPPHPPHPPHHPPHP PPHHPPPPPHHPPHHPHPPHPPPPPPPHPPPHHPHHPPPPPPHPPHPH PPHPPPPPHHHPPPPHHPHHPPPPPHHPPPPHHHHPHPPPPPPPPHHH HHPPHPP PPPHHHPHPPPPHPPPPPHHPPPPHHPPHHPPPPHPPPPHPPHPPHHP PPHHPHPHHHPPPPHHHPPPPPPHHPPHPPHPHPPHPPPPPPPHPPHH HPPPPHPPPHHHHHPPPPHHPHPHPHPH HPPPPPHPPPPHPHHPHHPPPPHPHHHPPPPHPHPHHHHPPPPPPPPP PPHPPHPPPHPHHPPPHHPPHPPHPHPHPPPPPPPPHPPPHHHHHHPP PHHPPHHHPPPHHPHHHHHPPPPPPPPPHPPPPHPHPPPP 18
Comparison on 3D HP Sequences 3D seq. CI (1996) npermis (2003) npermh (2005) FRESS (2006) 3D58-42 -44 (0.19*) -44 (1.10) -44 (0.09) 3D64 NA -56 (0.45) -56 (0.47) -56 (0.53) 3D67 NA -56 (1.10) -56 (0.33) -56 (1.41) 3D88 NA -69 (NA) -69 (0.45) -72 (5.03) 3D103-49 -54 (3.12) -55 (0.25) -57 (4.47) 3D124-58 -71 (12.3) -71 (1.19) -75 (280) 3D136-65 -80 (110) NA -83 (350) * Times are in CPU hours. CPU: npermis 667MHz, npermh 1.84 GHz, FRESS 1.4 GHz. For 3D124, -74 was found in 4.8 hours, and for 3D136, -82 was found in 6.4 hours. 19
Minimum Energy Conformations Sequence 3D88: PHPHHPHHPHPPHHPPHPPHPPHPPHPPHPPHHPPHHHPPHHHP PHHHPPHHHPPHPHHPHHPHPPHPPHPPHHPPHPPHPPHHPPHP, E = -72, previous minimum energy is -69. 20
Minimum Energy Conformations Sequence 3D103, E = -57, previous minimum energy is -55. Sequence 3D124:, E = -75, previous minimum energy is -71. c 21
Minimum Energy Conformations Sequence 3D136: HPPPPPHPPPPHPHHPHHPPPPHPHHHPPPPHPHPHHHHPPPP PPPPPPPHPPHPPPHPHHPPPHHPPHPPHPHPHPPPPPPPPHPPPHHHHHHPPPH HPPHHHPPPHHPHHHHHPPPPPPPPPHPPPPHPHPPPP, E = -83, previous minimum energy is -80. 22
Characterizing Ensemble Protein Conformations by Sequential Monte Carlo Method 23
X-ray Structures 24
Proteins are Dynamic Molecules 25
YJ. Huang and GT. Montelione, Nature, 438, (2005), 36-37. 26
N Furnham, T Blundell, M DePristo, Nature Structure & Molecular Biology, (2006) 13:184-185. A more suitable representation of a macro-molecular crystal structure would be an ensemble of models. The range of structures in the ensemble would be considered by any user of the structural information. 27
Characterizing Ensemble Conformations Backbone Ensemble of structures with different backbone conformations. J. Zhang et. al. (2007), Proteins, 66: 61-68. Side-chain Ensemble of structures with the same backbone but different side-chain conformations. J. Zhang & JS Liu (2006), PLoS Comp Biol, 2(12): e168. 28
Ensemble of Side-chain Conformations 29
Ensemble of Side-chain Conformations Number of side chain conformations, N sc. Side chain conformational entropy. S sc = k B ln(n sc ) http://wishart.biology.ualberta.ca/moviemaker Protein stability. G = H-T S 30
Side-chain Modeling All heavy atoms are explicitly modeled. Side-chain flexibility Rotamer library by D. Richardson Excluded volume effect A pair of atoms i and j are considered to be a hard clash if r ij < a ( r0 ( i) + r0 ( j)) r ij : distance; r 0 (i) and r 0 (j) : van der Waals radii of the two atoms; a : scaling coefficient. 31
Sequential Monte Carlo (SMC) S n = (r 1,, r j,, r n ), r j R j = {1,, M j }. SMC: sample a side-chain (r) one at a time and fix the residues that are already sampled. S n Ω n h ( S n ) For each sample i, there is an associated weight, w (i). At step t, a residue, r t, is picked, and a rotamer, k, is sampled from a given distribution with probability p k. Update weight by w w / ( i) = t+1 h( S = 1 m n S Ω m n n i= 1 ) ( i) t p k w ( i) n h( S ( i) n ) 32
Performance S sc k B 5 10 15 20 25 30 a 2ovo 3ebx Enumeration SMC 9 13 17 Length Standard deviation 0.0 0.5 1.0 1.5 2.0 b 2erl (40 aa, 48.1) 4rnt (104 aa, 109.3) 1fi2a (201 aa, 250.0) 1uzba (516 aa, 672.5) 100 1000 2000 Sample size The total number of self-avoiding side-chain conformations for the fragment of 3ebx, residue 1-17, is 396,325,923,840 3.96 10 11, SMC estimate is 4.01 10 11 with a sample size of 1000. 33
Sequential Sampling of Side Chains 34
Effect of SCE on Protein Stability Native & Decoy Structures Decoy R Us database 35
Native & Decoys Structures 1ctf Native S sc k B 55 65 75 85 400 500 600 Contact Number S sc can differ by more than 20 in k B unit, which corresponds to -11.9 kcal/mol of free energy at 300K. The stability of a protein is around -5 to -20 kcal/mol. 36
Incorporating SCE in Energy Function G = H H : Residue contact potential. G = H - T S sc H : Residue contact potential. S sc : Side-chain entropy. T = 1. 37
ΔH vs. ΔH - ΔS sc Protein ID ΔH ΔH - ΔS sc Protein ID ΔH ΔH - ΔS sc 1ctf (A, 630)* 6 1 1beo (D, 2000) 67 2 1r69 (A, 675) 24 5 1ctf (D, 2000) 10 1 1sn3 (A, 660) 86 10 1dkt-A (D, 2000) 588 5 2cro (A, 674) 63 5 1fca (D, 2000) 136 10 3icb (A, 653) 19 25 1nkl (D, 2000) 217 3 4pti (A, 687) 143 83 1pgb (D, 1572) 12 1 4rxn (A, 677) 14 7 1b0n-B (E, 497) 114 104 1fc2 (B, 500) 7 5 1ctf (E, 497) 13 4 1hdd-C (B, 500) 10 5 1dtk (E, 215) 1 1 2cro (B, 500) 47 17 1fc2 (E, 500) 32 3 4icb (B, 500) 1 1 1igd (E, 500) 159 6 1bl0 (C, 971) 851 4 1shf-A (E, 437) 2 2 1eh2 (C, 2413) 995 3 2cro (E, 500) 1 1 1jwe (C, 1407) 288 1 2ovo (E, 347) 19 2 smd3 (C, 1200) 266 1 4pti (E, 343) 1 1 * A: 4state_reduced, B: fisa, C: fisa_casp3, D: lattice_ssfit, E: lmds. 38
Protein Interactions and SCE http://wishart.biology.ualberta.ca/moviemaker 39
Native & Decoy Structure of Protein Complexes 1spb 1brc S sc k B 374 378 382 Native 20 30 40 50 60 Contact Number S sc k B 305 315 325 Native 25 30 35 40 Contact Number 40
X-ray & NMR Structures Protein in crystal Protein in solution 41
SCE vs. R g of X-ray and NMR Structures 23 proteins with both X-ray and NMR structures Average S sc k B 0 5 10 15 20 S sc = Ssc, X ray Ssc, NMR R g = Rg, X ray Rg, NMR 0.4 0.0 0.2 0.4 0.6 0.8 Average R g 42
SCE in Protein Design 1bth 2ptc S sc 20 15 10 5 S sc 15 10 5 60 80 100 15 20 25 30 35 40 Contact Number Contact Number S sc = S sc,complex -S sc,protein_1 -S sc,protein_2 43
Summary Sampling method: a new MC method for protein structure prediction Global optimization and sampling. Simple and effective. Characterizing ensemble conformations SMC for estimating entropy and free energy. SCE is important for protein folding and structure modeling. 44
Acknowledgement Prof. Jun Liu Prof. Jie Liang Prof. Rong Chen Prof. Sam Kou Department of Statistics Harvard University Bioengineering Department University of Illinois at Chicago Department of Information and Decision Science University of Illinois at Chicago Department of Statistics Harvard University NIH, NSF for financial support! 45
Future Work Extend FRESS to real protein simulation. Apply FRESS to other optimization and sampling problems. Apply side-chain modeling to protein structure prediction, protein interaction, and protein design. SMC for other statistical and computational problems. 46
10 Benchmark Sequences of Length 48 s48a HPHHPPHHHHPHHHPPHHPPHPHHHPHPHHPPHHPPPHPPPPPPPPHH -32 s48b HHHHPHHPHHHHHPPHPPHHPPHPPPPPPHPPHPPPHPPHHPPHHHPH -34 s48c PHPHHPHHHHHHPPHPHPPHPHHPHPHPPPHPPHHPPHHPPHPHPPHP -34 s48d PHPHHPPHPHHHPPHHPHHPPPHHHHHPPHPHHPHPHPPPPHPPHPHP -33 s48e PPHPPPHPHHHHPPHHHHPHHPHHHPPHPHPHPPHPPPPPPHHPHHPH -32 s48f HHHPPPHHPHPHHPHHPHHPHPPPPPPPHPHPPHPPPHPPHHHHHHPH -32 s48g PHPPPPHPHHHPHPHHHHPHHPHHPPPHPHPPPHHHPPHHPPHHPPPH -32 s48h PHHPHHHPHHHHPPHHHPPPPPPHPHHPPHHPHPPPHHPHPHPHHPPP -31 s48i PHPHPPPPHPHPHPPHPHHHHHHPPHHHPHPPHPHHPPHPHHHPPPPH -34 s48j PHHPPPPPPHHPPPHHHPHPPHPHHPPHPPHPPHHPPHHHHHHHPPHH -33 Yue, K., et al. Proc. Natl. Acad. Sci. U.S.A. 92:325, 1995. 47
Comparison on Benchmark Sequences Method # Seq Avg. Time* FRESS 10 0.77 LM 2 3.54 npermh 10 4.28 ACO 10 296 CG 9 73.6 MA 8 260 * CPU time in minutes. CPUs: FRESS 1.4 GHz, npermh 1.84 GHz, ACO 2.4 GHz. 48
The Key Ingredients of FRESS Method # Seq Avg. Time Optimal 10 0.77 NE 3 1.37 L max = 4 5 1.94 L = 12 8 1.42 Optimal condition: starting temperature T h = 3.5, minimum temperature T l = 0.1, temperature decreases by 0.98 geometrically; L min = 2, L max = 12, with 5 10 4 moves at each temperature. 49
The Strategy of FRESS 50
3D Sequences 3D seq. CI npermis npermh FRESS (2006) E* 3D58-42 -44 (0.19*) -44 (1.10) -44 (0.09) -47 3D64 3D67 3D88 3D103 3D124 3D136 NA NA NA -49-58 -65-56 (0.45) -56 (1.10) -69 (NA) -54 (3.12) -71 (12.3) -80 (110) -56 (0.47) -56 (0.33) -69 (0.45) -55 (0.25) -71 (1.19) NA -56 (0.53) -56 (1.41) -72 (5.03) -57 (4.47) -75 (280) -83 (350) -59-59 -72-59 -82-85 51
Large Scale Study of SCE S sc k B 0 400 800 a α = 0.8 + α = 0.6 0 400 800 Length S sc,buried S sc 0.0 0.2 0.4 0.6 b α = 0.8 + α = 0.6 0 400 800 Length J Zhang, JS Liu (2006) PLoS Comp Biol, 2(12): e168. 52
Protein Interactions and SCE http://wishart.biology.ualberta.ca/moviemaker 53