Structural biomathematics: an overview of molecular simulations and protein structure prediction

Similar documents
Molecular Modelling. part of Bioinformatik von RNA- und Proteinstrukturen. Sonja Prohaska. Leipzig, SS Computational EvoDevo University Leipzig

CAP 5510 Lecture 3 Protein Structures

Quiz 2 Morphology of Complex Materials

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Statistical Physics of The Symmetric Group. Mobolaji Williams Harvard Physics Oral Qualifying Exam Dec. 12, 2016

UC Berkeley. Chem 130A. Spring nd Exam. March 10, 2004 Instructor: John Kuriyan

Kyle Reing University of Southern California April 18, 2018

STRUCTURAL BIOINFORMATICS I. Fall 2015

Short Announcements. 1 st Quiz today: 15 minutes. Homework 3: Due next Wednesday.

Protein Folding & Stability. Lecture 11: Margaret A. Daugherty. Fall How do we go from an unfolded polypeptide chain to a

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

What is the central dogma of biology?

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

BME 5742 Biosystems Modeling and Control

Computational Biology 1

Rapid Introduction to Machine Learning/ Deep Learning

Quantitative Biology Lecture 3

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

Assignment 6: MCMC Simulation and Review Problems

Lecture 11: Protein Folding & Stability

Protein Folding & Stability. Lecture 11: Margaret A. Daugherty. Fall Protein Folding: What we know. Protein Folding

Basics of protein structure

The role of form in molecular biology. Conceptual parallelism with developmental and evolutionary biology

Data Mining in Bioinformatics HMM

Protein Secondary Structure Prediction

Contents. xiii. Preface v

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

ALL LECTURES IN SB Introduction

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

Bi 8 Midterm Review. TAs: Sarah Cohen, Doo Young Lee, Erin Isaza, and Courtney Chen

Protein Folding experiments and theory

Protein Structure Analysis and Verification. Course S Basics for Biosystems of the Cell exercise work. Maija Nevala, BIO, 67485U 16.1.

MCB100A/Chem130 MidTerm Exam 2 April 4, 2013

From gene to protein. Premedical biology

The protein folding problem consists of two parts:

Protein Structures. 11/19/2002 Lecture 24 1

Protein Folding. I. Characteristics of proteins. C α

Introduction to Evolutionary Concepts

MCB100A/Chem130 MidTerm Exam 2 April 4, 2013

Molecular dynamics simulation. CS/CME/BioE/Biophys/BMI 279 Oct. 5 and 10, 2017 Ron Dror

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Biomolecules. Energetics in biology. Biomolecules inside the cell

BME Engineering Molecular Cell Biology. Structure and Dynamics of Cellular Molecules. Basics of Cell Biology Literature Reading

Algorithms in Bioinformatics

Computational Biology: Basics & Interesting Problems

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Principles of Physical Biochemistry

Why is Deep Learning so effective?

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Analysis and Prediction of Protein Structure (I)

Entropy and Free Energy in Biology

MAP Examples. Sargur Srihari

5 Mutual Information and Channel Capacity

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Calculations of the Partition Function Zeros for the Ising Ferromagnet on 6 x 6 Square Lattice with Periodic Boundary Conditions

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

3. General properties of phase transitions and the Landau theory

Unit 1: Chemistry - Guided Notes

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Properties of amino acids in proteins

Dept. of Linguistics, Indiana University Fall 2015

Boolean models of gene regulatory networks. Matthew Macauley Math 4500: Mathematical Modeling Clemson University Spring 2016

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Bio nformatics. Lecture 23. Saad Mneimneh

Signal Processing - Lecture 7

Energetics and Thermodynamics

Biophysics II. Key points to be covered. Molecule and chemical bonding. Life: based on materials. Molecule and chemical bonding

Computational Systems Biology: Biology X

BIBC 100. Structural Biochemistry

Introduction solution NMR

Lecture 7: Two State Systems: From Ion Channels To Cooperative Binding

Many proteins spontaneously refold into native form in vitro with high fidelity and high speed.

Bioinformatics Chapter 1. Introduction

Protein Structure Basics

Translation Part 2 of Protein Synthesis

Introduction to" Protein Structure

Protein structure and folding

Motif Prediction in Amino Acid Interaction Networks

1.5 Sequence alignment

Protein Structure Determination

Laying down deep roots: Molecular models of plant hormone signaling towards a detailed understanding of plant biology

SECOND PUBLIC EXAMINATION. Honour School of Physics Part C: 4 Year Course. Honour School of Physics and Philosophy Part C C7: BIOLOGICAL PHYSICS

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Entropy and Free Energy in Biology

Proteins are not rigid structures: Protein dynamics, conformational variability, and thermodynamic stability

Introduction to Computational Structural Biology

Introduction to Information Theory

Solid-state NMR and proteins : basic concepts (a pictorial introduction) Barth van Rossum,

Molecular Modeling lecture 2

Information Content in Genetics:

QB LECTURE #4: Motif Finding

DNA/RNA Structure Prediction

BCMP 201 Protein biochemistry

Sugars, such as glucose or fructose are the basic building blocks of more complex carbohydrates. Which of the following

Transcription:

: an overview of molecular simulations and protein structure prediction

Figure: Parc de Recerca Biomèdica de Barcelona (PRBB).

Contents 1 A Glance at Structural Biology 2 3

1 A Glance at Structural Biology 2 3

All the biological information of the human body is encoded in our DNA. Human Genome Project: Sequentiation of the whole human genome completed on 2001, by Francis Collins (Public Project) & Craig Venter (Celera Genomics). About 3 billion base pairs (A, C, T and G). Estimation of 30000 genes (around 3000bp per gene). Less than 2% of the genome codes for proteins. Unknown function for over the half of the discovered genes!

All the biological information of the human body is encoded in our DNA. Human Genome Project: Sequentiation of the whole human genome completed on 2001, by Francis Collins (Public Project) & Craig Venter (Celera Genomics). About 3 billion base pairs (A, C, T and G). Estimation of 30000 genes (around 3000bp per gene). Less than 2% of the genome codes for proteins. Unknown function for over the half of the discovered genes!

All the biological information of the human body is encoded in our DNA. Human Genome Project: Sequentiation of the whole human genome completed on 2001, by Francis Collins (Public Project) & Craig Venter (Celera Genomics). About 3 billion base pairs (A, C, T and G). Estimation of 30000 genes (around 3000bp per gene). Less than 2% of the genome codes for proteins. Unknown function for over the half of the discovered genes!

All the biological information of the human body is encoded in our DNA. Human Genome Project: Sequentiation of the whole human genome completed on 2001, by Francis Collins (Public Project) & Craig Venter (Celera Genomics). About 3 billion base pairs (A, C, T and G). Estimation of 30000 genes (around 3000bp per gene). Less than 2% of the genome codes for proteins. Unknown function for over the half of the discovered genes!

DNA Transcription RNA Translation Protein 1 1 Table taken from Wikipedia.

Protein structure??? Protein function Gene function

2 Primary: Amino acid linear sequence. Secondary: α-helices and β-strands. Tertiary / Domains: Functionally independent part of the sequence. Quaternary: Multi-subunit complex of domains or proteins. 2 Figure taken from C.Branden & J.Tooze, Introduction to Protein Structure.

Main question: How can we find the structure of a given protein? Crystallography. Nuclear magnetic resonance spectroscopy. Molecular simulation. Prediction of structure (structural biology). NOT AN EASY TASK!

Main question: How can we find the structure of a given protein? Crystallography. Nuclear magnetic resonance spectroscopy. Molecular simulation. Prediction of structure (structural biology). NOT AN EASY TASK!

1 A Glance at Structural Biology 2 3

3 3 Both images were obtained using VMD software

Lysine Internal (mechanical) energy of the system And these are not the only forces and energies implied in a molecular simulation!

PLC-β2 simulation This simulation lasts around 20ns, with timesteps of 4fs 4, using the ACEMD software with the AMBER forcefield. The simulation has been visualized using VMD software. The protein has 708 amino acids, for a total of around 150000 atoms in the simulation (counting water and lipid molecules). In the simulation can be observed the folding of the X/Y linker in order to cover the hydrophobic active site of the protein. 4 1ns = 10 9 seconds, 1fs = 10 15 seconds

Afinsen s Dogma The native structure of a protein is unique and is determined only by it s amino acid sequence. The folding to its native state is almost spontaneous. Levinthal s Paradox Due to the huge number of degrees of freedom in an unfolded protein, the number of possible conformations is astronomically large. Then... how can proteins fold? Partially folded transition states. Funnel-like energy landscapes....?

Afinsen s Dogma The native structure of a protein is unique and is determined only by it s amino acid sequence. The folding to its native state is almost spontaneous. Levinthal s Paradox Due to the huge number of degrees of freedom in an unfolded protein, the number of possible conformations is astronomically large. Then... how can proteins fold? Partially folded transition states. Funnel-like energy landscapes....?

1 A Glance at Structural Biology 2 3

Let X, Y be two (discrete) random variables. The (self-)information of X is I(X) = log(p(x)). The entropy of X is the measure of uncertainty associated with X: S(X) = E(I(X)). The mutual information of X and Y (also called Kullback-Leibler divergence) is MI(X; Y ) = ( ) P(x, y) P(x, y)log P(x)P(y) x X y Y Maximum Entropy Principle Given a proposition that expresses testable information, the probability distribution that best represents the current state of knowledge is the one with largest entropy.

Figure: Multiple Sequence Alignment (MSA) for aathep1.

From the previous MSA let s define: f i (A) = frequency of apparitions of aa A in the position i of the MSA f i,j (A, B) = frequency of simultaneous apparitions of aa A and B in respective positions i and j of the MSA MI i,j = A,B ( ) fi,j (A, B) f i,j (A, B)ln f i (A)f j (B) Be careful! By definition, this mutual information of these frequencies is local in the amino acid chain, thus is noised by transitivity of correlations.

From the previous MSA let s define: f i (A) = frequency of apparitions of aa A in the position i of the MSA f i,j (A, B) = frequency of simultaneous apparitions of aa A and B in respective positions i and j of the MSA MI i,j = A,B ( ) fi,j (A, B) f i,j (A, B)ln f i (A)f j (B) Be careful! By definition, this mutual information of these frequencies is local in the amino acid chain, thus is noised by transitivity of correlations.

We want P(A 1,..., A L ) a general model for the probability of a particular amino acid sequence A 1... A L to be member of the iso-structural family under consideration, and such that P i (A) f i (A), P i,j (A, B) f i,j (A, B), where P i (A) = P(A 1,..., A L ), P i,j (A, B) := A k A P(A 1,..., A L ). A k A,B Many distributions satisfying this: Maximum Entropy Principle!!!!

We want P(A 1,..., A L ) a general model for the probability of a particular amino acid sequence A 1... A L to be member of the iso-structural family under consideration, and such that P i (A) f i (A), P i,j (A, B) f i,j (A, B), where P i (A) = P(A 1,..., A L ), P i,j (A, B) := A k A P(A 1,..., A L ). A k A,B Many distributions satisfying this: Maximum Entropy Principle!!!!

Optimization problem: maximize S = subject to A i i=1,...,l P(A 1,..., A L )lnp(a 1,..., P L ) P i,j (A, B) = f i,j (A, B) P i (A) = f i (A) Solution: disordered Q-state Potts model P(A 1,..., A L ) = 1 Z exp e i,j (A i, A j ) + h i (A i ) where: 1 i<j L 1 i L the parameters e i,j (A i, A j ), h i (A i ) are the Lagrange multipliers of the system, Z is the normalization constant (partition function).

Optimization problem: maximize S = subject to A i i=1,...,l P(A 1,..., A L )lnp(a 1,..., P L ) P i,j (A, B) = f i,j (A, B) P i (A) = f i (A) Solution: disordered Q-state Potts model P(A 1,..., A L ) = 1 Z exp e i,j (A i, A j ) + h i (A i ) where: 1 i<j L 1 i L the parameters e i,j (A i, A j ), h i (A i ) are the Lagrange multipliers of the system, Z is the normalization constant (partition function).

Geometrically, this probability distribution is given by the Boltzmann-Gibbs distribution: P(A 1,..., A L ) = 1 Z e H(A 1,...,A L ) Formally, the marginals of this distribution are obtained from lnz h i (A) 2 lnz h i (A) h j (B) = P i (A) = P i,j (A, B) + P i (A)P j (B) but the direct computation is computationally prohibitive.

Geometrically, this probability distribution is given by the Boltzmann-Gibbs distribution: P(A 1,..., A L ) = 1 Z e H(A 1,...,A L ) Formally, the marginals of this distribution are obtained from lnz h i (A) 2 lnz h i (A) h j (B) = P i (A) = P i,j (A, B) + P i (A)P j (B) but the direct computation is computationally prohibitive.

The Lagrange multipliers can be obtained using Mean Field Aproximation technique 5 : Introduce a new parameter α in the partition function (via the disturbed Hamiltonian): H(α) = exp α e i,j (A i, A j ) + h i (A i ) i=1,...,l 1 i<j L 1 i L Consider the Legendre transform of the Gibbs free energy F = lnz(α) (Gibbs potential): G(α) = lnz(α) h i (A)P i (A). i=1,...,l A 5 Plefka, T., Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model (1982)

The Lagrange multipliers can be obtained using Mean Field Aproximation technique 5 : Introduce a new parameter α in the partition function (via the disturbed Hamiltonian): H(α) = exp α e i,j (A i, A j ) + h i (A i ) i=1,...,l 1 i<j L 1 i L Consider the Legendre transform of the Gibbs free energy F = lnz(α) (Gibbs potential): G(α) = lnz(α) h i (A)P i (A). i=1,...,l A 5 Plefka, T., Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model (1982)

Considering the empirical connected correlation matrix: C i,j (A, B) = f i,j (A, B) f i (A)f j (B). As a consecuence of the functional form of the Legendre transform h i (A) = G(α) P i (A) (C 1 ) i, (A, B) = h i(a) P j (B) = 2 G(α) P i (A) P j (B) Expand the Gibbs potential up to first order Taylor expansion around α = 0: G(α) G(0) + α G(α) α α=0

A computation over the two terms of the Taylor expansion of G leads us to an expression which is easily derivable. First and second derivatives with respect the marginal distributions P i (A) provide self-consistent equations for the local fields, from which we obtain (C 1 ) i,j (A, B) α=0 = e i,j (A, B), for i j. Finally, the parameters h i can be estimated imposing empirical single-site frequency counts as marginal distributions and considering gauge conditions: f i (A) = P i,j (A, B). B

A computation over the two terms of the Taylor expansion of G leads us to an expression which is easily derivable. First and second derivatives with respect the marginal distributions P i (A) provide self-consistent equations for the local fields, from which we obtain (C 1 ) i,j (A, B) α=0 = e i,j (A, B), for i j. Finally, the parameters h i can be estimated imposing empirical single-site frequency counts as marginal distributions and considering gauge conditions: f i (A) = P i,j (A, B). B

This leads us to the effective pair probabilities Pi,j Dir (A, B) = 1 { exp e i,j (A, B) + Z h i (A) + h } j (B). i,j From which we can define its Kullback-Leibler divergence, that will be called Direct Information: DI i,j := ( ) P Dir Pi,j Dir i,j (A, B) (A, B)ln. f i (A)f j (B) A,B

For each pair of positions in the sequence, we "know" if they are spatially close. And now... what? Depending on the previous, not an unique folding for the protein is possible. We must remove knotted structures (Alexander polynomial or Heegaard Floer homology). A scoring method over the resulting foldings must be defined, in order to decide which one of the structures is better. A short simulation of the system may be run in order to optimize the energies of the folding.

For each pair of positions in the sequence, we "know" if they are spatially close. And now... what? Depending on the previous, not an unique folding for the protein is possible. We must remove knotted structures (Alexander polynomial or Heegaard Floer homology). A scoring method over the resulting foldings must be defined, in order to decide which one of the structures is better. A short simulation of the system may be run in order to optimize the energies of the folding.

For each pair of positions in the sequence, we "know" if they are spatially close. And now... what? Depending on the previous, not an unique folding for the protein is possible. We must remove knotted structures (Alexander polynomial or Heegaard Floer homology). A scoring method over the resulting foldings must be defined, in order to decide which one of the structures is better. A short simulation of the system may be run in order to optimize the energies of the folding.

For each pair of positions in the sequence, we "know" if they are spatially close. And now... what? Depending on the previous, not an unique folding for the protein is possible. We must remove knotted structures (Alexander polynomial or Heegaard Floer homology). A scoring method over the resulting foldings must be defined, in order to decide which one of the structures is better. A short simulation of the system may be run in order to optimize the energies of the folding.

Figure: Chewbacca mounted on a squirrel wants to thank you for your assistance!