Scoring functions. Talk Overview. Eran Eyal. Scoring functions what and why

Scoring unctions Talk Overview Scoring unctions what and why Force ields based on approximation o molecular orces as we understand them Knowledge-based potentials let the data speak May 2011 Eran Eyal Atoms are composed o smaller particles. egatively charged electrons are distributed with some probability unction around the nucleus In order to perorm accurate calculations regarding the orces acting between atoms and molecules, sophisticated quantum-mechanical calculations are needed to calculate probabilities and the quantum energy states QM calculations requires CPU time which grows exponentially with the number o atoms Today's hardware may acilitate calculations with up to ew hundred atoms and are thereore easible only or small molecules or or restricted regions o macromolecules To perorm calculations on larger molecules with need simpler representation and approximation or the basic detailed physical representation o the system In molecular mechanics we treat an atom as a ball with a deine radius This radius doesn t represent a clear physical border. Instead, it represents the space where electrons are distributed most o the time

Atom radii: element H C O F P S Cl van-der-waals radius / Å 1.20 1.70 1.55 1.52 1.47 1.80 1.80 1.89 Force ields Force ield is the name given to an expressions empirically account or all orces acting on molecules Molecular mechanics tools make use o orce ields to evaluate orces and energies in the molecular systems The use o orce ields is predominantly to evaluate conormations during modeling procedures and or ranking dierent solutions. There are various orce ields which are dierent in the energy components they consider and in parameterization o equivalent terms Components o orce ields can be divided to two types: -Intramolecular -Intermolecular Intra-molecular terms include potentials resulted rom deviation rom canonical geometry between atoms separated by up to 3 covalent bonds Intermolecular terms include potentials resulted rom all physical orces acting between non covalently bonded atoms, but orm electrostatic interactions in space. The most popular orce ields are CHARMM (implemented within the CHARMM package and CHARMm commercial sotware), AMBER (implemented within the AMBER package) and GROMOS (implemented within GROMACS).

Intra-molecular potential Intra-molecular potential potential resulted rom bond stretching/shortening + potential resulted rom angle distortion + potential resulted rom conormational distortion + Deviation rom optimal bond length The most common way to account or deviation rom optimal bond length is by parabolic unction: Intra-molecular potentials Deviation rom optimal angles Potential resulted rom deviations rom optimal values (e.g. 109 in tetrahedral carbon are punished.

Deviation rom optimal dihedral angles The potential resulted rom deviation rom optimal dihedral angles is expressed using a periodic unction such as COS Energy o non-bonded atoms Composed o orces acting between atoms which are not covalently attached

The main orces are van der Waals attraction orces, van der Waals repulsion orces, electrostatic orces. Van der Waals orces are meaningul only in short distances. The van der Waals attraction and repulsion orces are requently represented by the 6-12 equation (Lennard Jones equation). van der Waals attraction orces are also known as London orces and act between any two atoms, including neutral atoms, although they have also electrostatic nature. Johannes Diderik van der Waals 1837-1923 Dutch scientist with a distinguisherd contribution or physics and thermodynamics van der Waals equation: Electrostatic orces Acting between charged bodies. Similar charges repulse each other, while opposite charges attract each other The potential is determined by the Coulomb low : Winner o 1910 obel price in physics

Hydrogen bonds Formed between small electronegative atoms Element H C O F P S Cl Electronegativity (s'pauling) 2.1 2.5 3.0 3.5 4.0 2.1 2.5 3.0 Hydrogen bonds are responsible or the special properties o water including relatively high melting temperature Hydrogen bonds are much stronger than attractive van der Waals orces but weaker than covalent bonds Hydrogen bonds have essential role in shaping structure o macromolecules A maor determinant o protein stability is the ability to satisy most possible hydrogen bonds. When the protein is not olded all hydrogen bonds are satisied with water molecule partners. To get a olded state which is more stable, the protein must have the vast maority o its potential hydrogen bonds satisied within the molecule.

Hydrogen bonds network in immunoglobulin Salt bridges A combination o two noncovalent interactions: hydrogen bonding and electrostatic interactions The salt bridge most oten arises rom the anionic carboxylate (RCOO-) o either aspartic acid or glutamic acid and the cationic ammonium (RH3+) rom lysine or the guanidinium (RHC(H2)2+) o arginine Short range interactions (< 4 Å)

Free energy is composed o energy and entropy ΔG ΔH-TΔS Free energy potential energy entropy Usually we compare two possible states o the system and thereore talk about G. What matters is the dierence between the two states and not the absolute energy values ΔG and ΔΔG ΔG between two given conormations determine the stability o the molecule with respect to these conormations. We can then calculate the ratio between the numbers o molecules in the two conormations Oten we like to compare stability o two dierent molecules. For this purpose we compare ΔG o one molecule to that o the other molecule. The dierence between the two is designated by ΔΔG The Protherm database holds thermodynamic data on thousands o mutations in hundreds o dierent proteins and is the most comprehensive collection or such data http://gibk26.bse.kyutech.ac.p/ouhou/protherm/protherm.html

Protherm data statistics (5/2011): G 1 G G1 - G2 http://gibk26.bio.kyutech.ac.p/cgi-bin/ouhou/protherm/pp_stat.pl G 2 The hydrophobic eect is mainly an entropic eect The is no such thing hydrophobic orce The presence o many hydrophobic groups in hydrophilic environment creates large contact surace area between water molecules and these groups. In these regions the water are relatively ordered and this reduces the overall entropy E local minima Global minima When the hydrophobic groups o the protein contact each other, less water molecules are constrained and the system has larger overall entropy Conormations

Knowledge based potentials The persistent need or accurate scoring unction or evaluation o structures The problem o our insuicient understanding o the orces the drive molecular interactions The rapid increase in size o biological databases Residue level/atom level Choosing highly detailed system composed o many type o elements, or example each atom type, might provide detailed inormation. The disadvantages are that a lot o data is required in order to obtain accurate potentials or the many pair potentials in the system. Choosing lower level o representation (or example residue types), might bring to less inormative potentials. The advantages are system which is less sensitive to minor structural inaccuracies and more robust statistics available or obtaining the potentials. Representation o residues When working on the residue level, it is necessary to determine how to represent the residue Usually the representation is simply by a point. The most common positions are C α, C β

Another well accepted position is the center o mass or center o geometry o the side chain atoms Kocher et al., 1994 Zhang et al., 2003 Representation o residues Residues can be represented also by several atoms or by vectors which provide also some inormation about the general orientation o the residue in space Database used to derive the potentials Reliable data on-redundant data (representative data) Oriented database (toward speciic goal) C α (i) C β (i) C β () C α ()

Boltzmann distribution the relation between probabilities and energy p i e E i / Z kt Boltzmann distribution dependence on temperature E 1 < E 2 < E 3 p i e E i / Z kt Z e kl E kl / kt P 3 P 3 K is the Boltzmann s constant T is the absolute temperature p is the probability density unction P 2 P 1 P 2 P 1 Z is the partition unction. In order to know the value o Z we will need to know the energy value o every state in the system. T 1 > T 2 i e E i / Z kt E kt ln[ ] kt ln[ Z] E kt ln[ k, l ] kt ln[ k, l Z ΔE E E k, l kt ln[ ] ( kt ln[ ΔE kt ln[ k l ] kt ln[, Inverse Boltzmann relation ] k, l ] kt ln[ Z ] ] kt ln[ Z ]) ΔE kt ln[ k l ] kt ln[, ΔE kt(ln[ k l ] ln[, Sometimes we are working with some pseudo energetic scores which are linearly related to the energy, but it is not exactly clear (and not really important) how. In this case we can also ignore the constants which their physical meaning is anyway not clear in our case ] ]) ΔE ln[ k, l ] ln[ ] Δ S ln[ k, l ] ln[ ] ΔE

What we have rom known data is the relative requencies o the dierent states i, 1 which we hope represent well the real lie probabilities p i, 1 It is thereore crucial that the set will be as close as possible to the real probabilities p The reerence state A crucial actor in building knowledge based potentials is the reerence state. The reerence state is the probability o some event that we expect by chance alone. I a particular conormation is ound to have the reerence state probability then it does not give inormation about the system Every probability should be normalized with respect to the reerence state probability Another way to look at the reerence state the state which its energy is equal to zero Corrections or the problem o small sample size i, 1 p i, 1 For very large sample size n the real the probability distribution and the observed are similar: lim n p In real lie this is oten not the case, and the sample size is small. In such cases or many states () is not a good approximation o p. I we can not increase the number o observation (which depends on the database size), we usually orce to have insuicient amount o data The best we can do is to minimize the damage rom large possible deviations between and p. Sippl was the irst to introduce a method to account or this problem The idea is to give weight to the inormation we take rom the database according to the number o observations we have

i 1 [ re + σn i z ',, ] re is the reerence probability σ is a constant which represents the weight o each observation in the database ' k, l ( re + nk, l ) 1 k, l k, l z σ + σn total z is the new sum o pseudo requencies and is needed or normalization 1 1+ σn [ + σn ' re k, l total ' 1 n n totalσ re + 1+ σ n 1+ n σ n 1 total ] total k, l total ' total re + 1+ σ ntotal 1+ ntotalσ n σ I there is plenty o data, we rely on the data to derive accurate potentials Contact potentials 1 n σ lim + total ntotal re 1+ σntotal 1+ ntotalσ I there is insuicient amount o data, we preer to use mean value, namely the reerence state probability lim 1 total n re total 0 + 1+ σntotal 1+ ntotalσ n σ re Kocher et al., 1994

Problems with the concept to the Boltzmann model when applied to protein? The Boltzmann model was originally introduced or gas state. Does it appropriate or proteins? Distribution o peptide bonds in Proline is correctly predicted based on Boltzmann distribution On the other hand, interactions are not independent, and protein atoms are constrained by covalent bonds. The connectivity between atoms might introduce bias Values in the potentials might be inluenced by the dominance o other interactions

Factors inluencing on the prediction

Examples o applications o knowledge based potentials S S geometry + S contact + S neighbors + S clashing + S backbone + S rotamer + S local knowledge based potentials are applied or variety o problems in proteins One application is to determine the thermo-stability o proteins and o mutants S geometry G ln i)) i g geo g i) ) Cα (i) Centroid (i) Cα () Centroid () The problem is to decide which amino acid is better in a given position The geometry potential S geometry determines how likely is the interaction between speciic pair o residues. This is done according to the probability to ind the pair o residues in that speciic geometry in the database. S S geometry + S contact + S neighburs + S clashing + S backbone + S rotamer + S local S contact i ln i) i) ) ) x 2 aa y xy (i) () + The contact potential S contact determines how likely is the interaction between a given pair o residues. This is determined by the probability to ind this pair o residues in close contact, relatively to other residue pairs. Contact potential

3.29 2.98 8.30 4.45 7.64 6.59 7.03 7.36 8.17 11.86 3.24 11.38 15.47 2.69 5.65 2.24 8.92 6.61 10.99 8.32 6.81 4.26 7.74 8.39 7.86 8.35 3.25 6.42 10.38 4.32 6.67 1.55 7.45 8.98 8.85 5.26 4.39 0.00 0.00 4.39 S S geometry + S contact + S neighbors + S clashing + S backbone + S rotamer + S local 6.24 3.27 6.35 10.88 15.50 19.59 6.73 4.16 15.13 12.56 4.16 11.47 12.09 10.96 1.83 5.03 13.38 0.00 5.26 8.85 7.74 12.95 11.51 5.22 5.82 9.19 7.23 10.30 5.09 3.40 12.98 5.49 5.25 4.96 14.91 8.93 0.00 13.38 8.98 7.45 1.91 7.73 5.50 6.57 4.02 5.36 3.72 6.00 10.71 9.35 11.30 9.86 6.57 11.81 15.90 2.14 12.17 17.11 21.20 8.22 6.36 5.36 9.21 5.03 1.65 6.27 9.02 6.06 1.57 5.49 8.06 8.83 11.24 7.95 4.35 7.38 7.84 6.69 6.36 0.00 16.68 14.09 5.19 12.78 13.62 12.49 0.00 6.36 6.31 4.96 9.22 6.81 6.55 0.00 12.49 6.69 4.14 2.47 11.61 1.30 0.00 6.55 13.62 7.84 8.93 5.03 1.55 14.91 1.83 6.67 4.96 10.96 6.42 5.25 12.09 8.35 4.32 10.38 3.25 7.86 S neighbours i Ci ln bins i) (i) 5.95 10.38 8.92 1.31 6.79 9.64 5.76 8.33 4.90 2.93 11.25 0.00 1.30 6.81 12.78 7.38 5.49 11.47 8.39 7.74 5.54 2.75 5.67 10.28 14.34 18.43 5.89 3.22 14.61 11.92 0.00 11.25 11.61 9.22 5.19 4.35 12.98 4.16 4.26 6.81 6.76 11.97 10.53 3.02 4.34 7.95 6.25 9.32 3.29 0.00 11.92 2.93 2.47 4.96 14.09 7.95 3.40 12.56 8.32 6.61 10.05 15.26 13.82 5.61 3.67 4.98 9.54 12.61 0.00 3.29 14.61 4.90 2.62 2.67 4.53 7.42 13.18 17.27 3.07 0.00 12.61 9.32 3.22 8.33 1.91 5.72 4.56 4.83 10.11 14.20 0.00 3.07 9.54 6.25 5.89 5.76 14.71 19.92 18.48 10.27 4.09 0.00 14.20 17.27 4.98 7.95 18.43 9.64 10.62 15.83 14.39 6.88 0.00 4.09 10.11 13.18 3.67 4.34 14.34 6.79 5.38 9.65 8.21 0.00 6.88 10.27 4.83 7.42 5.61 3.02 10.28 1.31 5.29 3.82 0.00 8.21 14.39 18.48 4.56 4.53 13.82 10.53 5.67 8.92 4.14 8.83 6.06 9.02 6.27 1.65 9.86 6.31 8.06 5.03 9.21 5.36 6.36 9.35 16.68 11.24 5.09 15.13 10.99 8.92 5.49 1.57 10.30 4.16 2.24 5.65 8.22 2.14 7.23 6.73 2.69 3.24 21.20 15.90 9.19 19.59 15.47 11.86 17.11 11.81 5.82 15.50 11.38 8.17 12.17 6.57 5.22 10.88 7.36 7.03 6.00 5.36 11.51 6.35 6.59 7.64 The neighbors potential S neighbors evaluate the environment o a given residues. This is determined by counting the number o neighbors o a given residue, and giving the probability to ind the residue with this number o neighbors in the database. 5.21 0.00 3.82 9.65 15.83 19.92 5.72 2.67 15.26 11.97 2.75 10.38 11.30 10.71 3.72 4.02 12.95 3.27 4.45 8.30 0.00 5.21 5.29 5.38 10.62 14.71 1.91 2.62 10.05 6.76 5.54 5.95 6.57 5.50 7.73 1.91 7.74 6.24 2.98 3.29 S S geometry + S contact + S neighbors + S clashing + S backbone + S rotamer + S local S backbone i) bb( i) ln i x x bb( i) aa The backbone potential S backbone relects the probability to ind the residue given the local backbone conormation (which is ixed in our problem). For example, Pro in α-helix will have small S backbone. eighbors potential

S S geometry + S contact + S neighbors + S clashing + S backbone + S rotamer + S local S S geometry + S contact + S neighbors + S clashing + S backbone + S rotamer + S local S rotamer PR ( i)) bb( i) i ln r P rot r( i) bb( i)) The rotamer potential S rotamer relects the probability to ind this rotamer in the protein database given the local backbone conormation. The probabilities o the rotamers where taken rom pre-calculated rotamer library. S local i ln ( s i), ) S, D d i), ), s, d x y s d )( x y xysd x, y, S, D The local potential S local relects the probability to ind the residue in the local environment (±3 amino acids on the sequence) given both the sequence and the structural conormation in that region. ) S S geometry + S contact + S neighbors + S clashing + S backbone + S rotamer + S local S S geometry +K c S contact +K n S neighbors + K w S clashing +K d S dipeptide +K b S backbone + K l S local Optimization o the K s was done on dataset471 using Monte Carlo procedure. The K s were the parameters to be optimized and the correlation coeicient (r) between the calculated scores and the experimental scores was the obective unction.

r 0.73, trend 0.78 Sequence design r 0.54, trend 0.77 r 0.57, trend 0.74 Knowledge based Potentials can be used to design sequences that will be compatible to a given structure For each position the potentials help to determine which is the most appropriate residue. Ota et al., 1997

Detection and evaluation o sequence-structure compatibility By ar the most common applications o the knowledge based potentials, especially using the residue level, is or evaluation o protein structures This includes: Ranking dierent models o a given protein Evaluate individual protein structures 1D 3D alignment Inverse olding problem searching sequence databases with a given structure