The computational Core of Molecular Modeling. (The What? Why? And How? )

The computational Core of Molecular Modeling. (The What? Why? And How? ) CS Seminar. Feb 09. Alexey Onufriev, Dept of Computer Science; Dept. of Physics and the GBCB program.

Thanks to: Faculty co-authors: T.M. Murali, M. Prisant, L. Heath, C. Simmerling Student co-authors: J. Gordon, J. Myers, A. Fenley, J. Ruscio, R. Anandakrishnan, D. Kumar, M. Shukla, V. Sojia Sponsors: NIH, VT. Support: System-X team, N. Polys (myoglobin graphics)

Will focus on the computational side of solving problems in the following areas: a. Rational Drug Design b. From atomic motion to biological function c. The grand challenge of computational science: the Protein Folding Problem.

The emergence of in virtuo Science. in vivo in vitro in virtuo

Biological function = f( 3D molecular structure ) A T G C DNA sequence Protein structure Bilogical function How can we predict/understand/modify function if we know the structure? Can we predict the structure? Key challenges: Biomolecular structures are complex (e.g. compared to crystal solids). Biology works on many time scales. Experiments can only go so far. A solution: Computational methods. Model movements of individual atoms.

The paradigm: All things are made of atoms, and everything that living things do can be understood in terms of the wigglings and jigglings of atoms R. Feynmann Suggests the approach: model what nature does, i.e. let the molecule evolve with time according to underlying physics laws.

A protein on a surface. Atomic resolution

Theme 1. Rational (structurebased) design of new medicines: Picture: Design of a HIV protease inhibitor. Hornak, V.; Okur, A., Rizzo, R. and Simmerling, C., HIV-1 protease flaps spontaneously open and reclose in molecular dynamics simulations, Proc. Nat. Acad. Sci. USA, 103:915-920 (2006)

Example: rational drug design. If you block the enzymes function you kill the virus. e.g:viral protease (chops up proteins) Drug agent

Example of successful computer-aided (rational) drug design: One of the drugs that helped slow down the AIDS epidemic (part of anti-retro viral cocktail). The drug blocks the function of a key viral protein. To design the drug, one needs a precise 3D structure of that protein.

A computational challenge: Need high quality, complete protein structures. But the experiment (Xray) does not ``see hydrogen atoms. These have to be placed computationally.

Combinatorial explosion problem: A molecule with N sites, each of which can be occupied by an H+ atom (or be empty), has 2 N possible states. X 1 X + 2 + + W 23 X 3 All of the possible charge (protonation) states must be taken into account. For a typical protein with 100 ionizable amino-acids this means 2 100 ~ 10 30 possible variants to consider! And we need to find the minimum energy state among them (for the experts: what we really need is, of course, the partition function Z, see below. Which is an even more computationally demanding job) N ΔG k (ph) = Σ x i k (kt ln10 ph - Δ E i calc ) + 1/2ΣΣx ik x j k W ij Free energy i X i = 1 or 0 (occupied or empty in state k) k protonation state (out of 2 N ) N N i j trouble Matrix of site-site interactions. Z = Σ all states exp(-δg k /kt)

Beating the combinatorics: the clustering approach. Example: a protein with N = 6 ionizable groups. Total number of protonation states = 2 N = 64 Every site interacts with every other site, a complete graph Cluster 1, N=3 Cluster 2, N=3 Solution: neglect the ``weak edges. Cluster the strong ones. After clustering: Total number of states = 2 3 + 2 3 = 16 < 64.

http://biophysics.cs.vt.edu/h++ A web-server that adds hydrogens to molecular structures. Launched by Onufriev s group in June 2005. ~1000 registered users since.

Intrigued? Suggested reading: The Many roles of computation in drug discovery, W. Jorgensen, Sceince 303, 1813 (2004). + references therein.

THEME II The protein folding challenge. Nature does it all the time. Can we? Amino-acid sequence translated genetic code. MET ALA ALA ASP GLU GLU--. How? Experiment: amino acid sequence uniquely determines proteinʼs 3D shape (ground state). Why bother: proteinʼs shape determines its biological function.

Protein Structure in 3 steps. Step 1. Two amino-acids together (di-peptide) Peptide bond Amino-acid #1 Amino-acid #2

Protein Structure in 3 steps. Step 2: Most flexible degrees of freedom:

A protein is simply a chain of amino-acids: φ 1 φ 2 φ 3 φ 4 Each configuration {Φ1, Φ2, Φκ} has some energy. The folded (biologically functional) protein has the lowest possible energy - global minimum. So just find this conformation by some kind of a minimization algorithm what s he big deal?

The magnitude of the protein folding challenge: Enormous number of the possible conformations of the polypeptide chain φ 1 φ 2 φ 3 φ 4 A typical protein is a chain of ~ 100 mino acids. Assume that each amino acid can take up only 10 conformations (vast underestimation) Total number of possible conformations: 10 100 Suppose each energy estimate is just 1 float point operation. Suppose you have a Penta-Flop supercomputer. An exhaustive search for the global minimum would take 10 85 seconds ~ 3*10 78 years. Age of the Universe ~ 2*10 10 years.

2 Free energy 1 3 Folding coordinate Adopted from Ken Dill s web site at UCSF Finding a global minimum in a multidimensional case is easy only when the landscape is smooth. No matter where you start (1, 2 or 3), you quickly end up at the bottom - - the Native (N), functional state of the protein.

Adopted from Ken Dill s web site at UCSF Realistic landscapes are much more complex, with multiple local minima folding traps. Proteins trapped in those minima may lead to disease, such as Altzheimer s

Adopted from Dobson, NATURE 426, 884 2003

Since minimization won t work, choose an alternative. Do what Nature does: just let it fold on its own, at normal temperature. Method: Molecular Dynamics

Principles of Molecular Dynamics (MD): Y Each atom moves by Newton s 2 nd Law: F = ma F = de/dr System s energy - Bond spring + x E = Kr 2 Bond stretching + A/r 12 B/r 6 VDW interaction + Q 1 Q 2 /r Electrostatic forces +

Now we have positions of all atoms as a function of time. Can compute statistical averages, fluctuations; Analyze side chain movements, Cavity dynamics, Domain motion, Etc.

Molecular Dynamics: PRICIPLE: Given positions of each atom x(t) at time t, its position at next time-step t + Δt is given by: force x(t + Δt) x(t) + v(t) Δt + ½*F/m * (Δt) 2 Key parameter: integration time step Δt. Controls accuracy and speed of numerical integration routines. Smaller Δt more accurate, but need more steps. How many needed to simulate biology? How many can one afford?

As a result, we can not quite get into the biological time scales. Currently accessible times biology 10-14 10-6 10 0 H-C bond vibration Protein folding Characteristic time scales [sec] Time-step, Δt For stability, Δt must be at least an order of magnitude less than the fastest motion, i.e Δt ~ 10-15 s. Example: to simulate folding of the fastest folding protein, at least 10-6 /10-15 = 10 9 steps will be needed.

The bottleneck of the methodology: computation of long-range interactions. Electrostatic interactions fall of as inverse distance between atoms. Too strong to neglect. Need to account for all of them. Very expensive. Up to 99% of total cost for a protein.

Massive parallel machines help. Virginia Tech s supercomputer, System-X

The worst problem for parallel computations: Processor #1 Processor #2 X 1, F(on X 1 ) X 2, F(on X 2 ) Force acting on each atom depends upon positions of every other atom in the system. Processor #4 Processor #3 Computed coordinates have to be communicated between all processors at each step

Without approximations, computation of long-range electrostatic forces will cost you O(N 2 ), where the number of atoms N may be as large as N ~ 10 6. Too expensive for large systems. Every atom interacts with every other atom, a complete graph A solution: combine charges (vertices) into groups, that is use coarse-graining. +3 After coarse-graining costs can be as low as Nlog(N) Works because macromolecules are naturally partitioned into hierarchical levels: atoms -> amino-acids -> proteins -> complexes..

Simulated Refolding pathway of the 46-residue protein. Molecular dynamics based on AMBER-7 Movie available at: www.scripps.edu/~onufriev/research/in_virtuo.html 0 1 1 2 3 4 5 6 NB: due to the absence of viscosity, folding occurs on much shorter time-scale than in an experiment.

Intrigued? Suggested articles: 1. Protein Folding and Misfolding, C. Dobson, Nature 426, 884 (2003). 2. Design of a Novel Globular Protein Fold with Atomic-level Accuracy, Kuhlman et al., Science, 302, 1364, (2003) + references therein.

Theme III: how does oxygen get inside the protein myoglobin? Trivia: it is myoglobin (with oxygen bound to it) that gives fresh meat its red color. The protein stores oxygen.

Myoglobin the hydrogen atom of structural biology Exactly what are the migration pathways of small non-polar ligands (O2, CO, NO) inside the protein? A 50 year old question. O 2

The heme group the heart of myoglobin. O2

Myoglobin: protein responsible for oxygen storage in all breathing organisms. O 2? How does Oxygen, or other small non-polar ligands get in to bind to Fe atom in its core? Fe

After 50 years of research many questions still remain. Example: one or many pathways? Single dominant pathway, HIS gate e.g. Olson et al. Multiple pathays, e.g. Elber and Karplus

Two complementary approaches: #1 Simulate explicit ligand (CO) trajectories via microseconds of room temperature, explicit solvent MD. #2 Compute dynamic volume fluctuations in the protein.

Approach #1: Straightforward MD (explicit solvent, PME, NPT, 300K, 7 µs total) 1. Simulation type #1. Model photodetachement of carbon monoxyde (CO) ligand. 20 trajectories, 90 ns each. The ligand starts at the docking site (Fe) and comes out, sometimes. 2. Simulation type #2. Model diffusion of the ligand from the outside. The ligand starts in the solvent, and sometimes diffuses all the way to the docking site inside. 48 trajectories, each 90 ns long.

Raw Result #1: a sample of simulated trajectories that show ligand migration paths multiple individual trajectories combined in one slide Photodetachement experiment. CO Initially at the binding site CO initially in the water

Summary of the MD trajectories: Pathway #1 Pathway #2 Concern: Have we explored ALL the pathways? Challenge: can we get the same pathways cheaper?

Approach #2. PATHFINDER dynamic free volume analysis. Run a separate, much shorter (15 ns) MD in which the ligand does not move. In each snapshot, identify cavities large enough to hold the ligand. Purely steric criterion. Combine the cavities over the entire trajectory, determine connectivity. timeline pathway

The challenge: protein shape is very complex, internal cavities are hiding within, also very complex shape. Connectivity to the outside is a non-trivia issue. In this case, it is critical to have a SIMPLE and robust algorithm, to have just ONE complexity to deal with, not a product of the two: (problem complexity) x (algorithm complexity).

Free volume analysis (PATHFINDER)

Pathways from volume fluctuations (PATHFINDER):

Combining the two approaches: Explicit trajectories (MD) Free volume fluctuations J. Ruscio et al. Atomic level computational identification of ligand migration pathways between solvent and binding site in myoglobin, Proc. Natl. Acad. Sci. USA, 105, 9204 (2008)

The end result: oxygen diffusion pathways in myoglobin: The pathways (blue) occur inside the rigid scaffold structures (helices, white).

The law of (bio/phys/chem)- computational science: speed easy hard genius 0 accuracy

Lunch time