Introduction to Computational Modelling and Functional Analysis of Proteins

Introduction to Computational Modelling and Functional Analysis of Proteins AG Prof. Dr. Monika Fritz Pure and Applied Biomineralisation Institute for Biophysics AG Prof. Dr. Manfred Radmacher Institute for Biophysics Aim and scope of the lab class The aim of this lab class is to give an introduction into the concept and fundamentals of protein structure modelling and computational functional analysis of proteins. In the course of the class you are introduced to useful tools for protein structure modelling. These tools comprise protein sequence comparison techniques, sequence alignment techniques, comparative modelling algorithms, surface potential calculators, systematic docking programs and protein structure visualizers. Introduction Along with the advent of synchrotrons as high quality X- ray sources (see e.g. EMBL at DESY) and NMR facilities one might assume that computational protein structure modelling is old- fashioned. Nevertheless it is often very laborious and/or expensive it might even be impossible - to obtain the protein of interest in a sufficient amount, to purify it to a high degree and to prepare a protein crystal for X- ray investigation. The last step might be obsolete in the future when free electron lasers (like XFEL) will be operational. In contrast to the experimental structure determination, which requires expensive equipment and methods computational modelling requires only a modern workstation and the appropriate software. Today structure modelling is even possible on web servers. This discrepancy is reflected in the number of available experimental protein structures (Protein Data Base: 83627 structures) compared to the number of available models (Protein Model Portal: 3.853.448 proteins covered by at least one model). To date (July 2012) about 24 million proteins have been sequenced. Consequently there are still opportunities to reveal structures of a lot of proteins. Proteins play a major role in every organism. They take part in nearly every cellular process e.g. metabolism, transport, signalling, cell structure, immune defence etc. From the protein structure or from a reasonable protein model important features can be deduced. The protein surface may reveal possible binding sites for ligands. From surface charge calculations the protein behaviour in different aqueous environments may be obtained. Even the function of catalytic sites in enzymes can be determined. For every modelling of an unknown protein structure the sequence of the protein of interest is needed. The protein sequence is the sequence of amino acids that constitute the macromolecule and is said to be the primary structure. The amino acid sequence can be obtained experimentally or determined from the DNA fragment that encodes the protein if available. Once the protein sequence is available a data base search against the sequences of all known structures may reveal relationships to known proteins. If related proteins have been found their structure can be used as a template in the modelling

process. The central assumption of every template based modelling is that the protein sequence is related to the protein structure: similar sequences should result in similar structures. Further reading Alberts, B. et al., Molecular Biology of the Cell, Garland Publishing Inc., 3 rd edition, 1994 (Chapter about Protein Structure) Martí- Renom, M.A. et al., Comparative Protein Structure Modeling of Genes and Genomes, Annu. Rev. Biophys. Biomol. Struct. 29, 291 325 (2000) Computational Protein Sequence Comparison As outlined above every modelling of an unknown structure starts with a search of related or homologous protein sequences (remember that the structures must be available also) to the sequence of the protein of interest. During the evolution of species parts of the DNA are modified and/or the transcription of the DNA is influenced. The first process is called mutation. Since the DNA is transcribed and translated into the amino acid sequence of proteins these macromolecules are changed accordingly. Some mutations proved to be advantageous to some species while other might have let to a decline of an organism. There are two important purposes of computational protein sequence comparison. One is the construction of phylogenetic trees. Phylogenetic trees depict the evolutionary dependency of proteins of the same class in different clades. The determination of the evolutionary distance of two proteins requires a measure of similarity of two (or more) sequences. Therefore the determination of evolutionary dependency of two or more proteins is closely related to the second purpose of computational protein sequence comparison: The alignment of two sequences and determination of the similarity of two sequences under certain constraints. Consider the following example. Given the following two sequences one is a N- acetyl- D- glucosamine kinase and the other is a human glucokinase. >sp Q9UJ70 NAGK_HUMAN N-acetyl-D-glucosamine kinase OS=Homo sapiens GN=NAGK PE=1 SV=4 MAAIYGGVEGGGTRSEVLLVSEDGKILAEADGLSTNHWLIGTDKCVERINEMVNRAKRKA GVDPLVPLRSLGLSLSGGDQEDAGRILIEELRDRFPYLSESYLITTDAAGSIATATPDGG VVLISGTGSNCRLINPDGSESGCGGWGHMMGDEGSAYWIAHQAVKIVFDSIDNLEAAPHD IGYVKQAMFHYFQVPDRLGILTHLYRDFDKCRFAGFCRKIAEGAQQGDPLSRYIFRKAGE MLGRHIVAVLPEIDPVLFQGKIGLPILCVGSVWKSWELLKEGFLLALTQGREIQAQNFFS SFTLMKLRHSSALGGASLGARHIGHLLPMDYSANAIAFYSYTFS >sp P35557 HXK4_HUMAN Glucokinase OS=Homo sapiens GN=GCK PE=1 SV=1 MLDDRARMEAAKKEKVEQILAEFQLQEEDLKKVMRRMQKEMDRGLRLETHEEASVKMLPT YVRSTPEGSEVGDFLSLDLGGTNFRVMLVKVGEGEEGQWSVKTKHQMYSIPEDAMTGTAE MLFDYISECISDFLDKHQMKHKKLPLGFTFSFPVRHEDIDKGILLNWTKGFKASGAEGNN VVGLLRDAIKRRGDFEMDVVAMVNDTVATMISCYYEDHQCEVGMIVGTGCNACYMEEMQN

VELVEGDEGRMCVNTEWGAFGDSGELDEFLLEYDRLVDESSANPGQQLYEKLIGGKYMGE LVRLVLLRLVDENLLFHGEASEQLRTRGAFETRFVSQVESDTGDRKQIYNILSTLGLRPS TTDCDIVRRACESVSTRAAHMCSAGLAGVINRMRESRSEDVMRITVGVDGSVYKLHPSFK ERFHASVRRLTPSCEITFIESEEGSGRGAALVSAVACKKACMLGQ Possible tasks could be to determine the relationship or the degree of similarity between these two proteins or to perform an alignment between these two sequences in order to identify conserved protein motifs. The figure below shows an alignment of the two sequences above in the residue range (127-180:172-229). P35557 EVGDFLSLDLGGTNFRVMLVKVGEGEEGQWSVKTKH--QMYSIPEDAMTGTAEMLFDYIS 127 Q9UJ70 --GGVVLISGTGSNCRLINPDGSESGCGGWGHMMGDEGSAYWIAHQA----VKIVFDSID 172 *.: :. *:* *::..*. * *... * *.:*.:::** *. P35557 ECISDFLDKHQMKHKKLPLGFTFSFPVRH-------EDIDKGILLNWTKGFKASGAEGNN 180 Q9UJ70 NLEAA---PHDIGYVKQAMFHYFQVPDRLGILTHLYRDFDKCRFAGFCRKIAEGAQQGDP 229 : : *:: : * :. *..* *.*:** : : : :.. :*: If an alignment is real or if the sequences are evolutionary related it can be expected that identities (same amino acids) and conservative substitutions (amino acids with similar physico- chemical properties) are more likely than random alignments. These kind of aligned amino acids should be assigned a positive score. At the same time non- conservative changes should be observed less frequent than random alignments. These changes should have a negative score. Consider two sequences x and y each composed of letters x, y A of an alphabet A at position i (here the 20 amino acids or four nucleic acids). If an amino acid occurs with the frequency q then the probability of a random alignment can be stated as P x, y R = q q The probability that an alignment occurs according to some match model is P x, y M = p Taking the logarithm of the odds ratio gives S = s x, y = log p q q S is the total score of an alignment and stated as the sum of individual amino acid pairs score. While the frequencies of individual amino acids q can be determined quite easily from a large database of protein sequences the frequency of real substitutions of amino acids p is more difficult to determine. Following figure shows a (scaled) BLOSUM62 substitution matrix. This scheme assigns a score to each amino acid pair. The numbers on the diagonal are positive and relatively large. On the contrary the pair Trp- Lys has a score of - 3. A small (negative) number means, that during evolution the substitution. The Lys- Arg pair has a score of 2. These

two amino acids are more likely to be substituted during evolution. Exercise 2 will shed more light on the properties of this substitution matrix. A R N D C Q E G H I L K M F P S T W Y V A 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 C 0-3 -3-3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 G 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3 H -2 0 1-1 -3 0 0-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 The basic steps of the construction of the BLOSUM (block substitution matrices) by Henikoff & Henikoff were the following. The authors started with a set of aligned ungapped regions (the so- called blocks) of protein families. The sequences inside these blocks were clustered according to certain levels of percent identity of the sequences. Sequences that share at least the given level of identity were treated as one sequence in the alignment therefore reducing multiple contributions. In this reduced blocks the relative frequencies of each amino acid (q ) as well as the relative frequencies of each aligned amino acid pair (p " ) were determined. Sometimes not only a substitution might have occurred during evolution but an insertion or deletion. Since the inserted or deleted amino acid is not known, a gap is inserted. Usually gaps are penalised according to a linear scheme with a constant negative score times gap length or an affine penalty is used with a different negative score for the gap opening and the extension of the gap. Several alignment algorithms are available. There are speed- optimized algorithms such as BLAST or FASTA and the more rigorous Needleman- Wunsch and Smith- Waterman algorithms. The Needleman- Wunsch algorithm searches for the optimal global alignment and the Smith- Waterman algorithm for the best local alignment. The exercise 3 clarifies the principles of the two algorithms. Unfortunately the raw score of an optimal alignment gives no information 1. if the sequences are evolutionary related and 2. about the statistical significance of an alignment score. Analytical solutions exist only for the problem of global and local ungapped alignments. It turns out that in these cases the probability of a score X of a random alignment greater than S is given by P X > S = 1 exp E S = 1 exp ( K m n exp ( λ S))

m and n are the length of the sequences that are aligned, K corrects for multiple starting points for local alignments and λ can be thought of as a scale for the substitution scores. If a large database is searched for related proteins the product of the probability given above and the database length gives the number of random alignments to be expected above score S. Further reading Durbin, R., Eddy, S., Krogh, A., Mitchinson, G. Biological sequence analysis Probabilistic models of proteins and nucleic acids, Cambridge University Press, 12 th edition (Chapter 1, 2.1-2.3, 2.7, 2.8) Pearson, W.R., Protein sequence comparison and Protein Evolution Tutorial ISMB2000, Pearson Group, Dept. of Biochemistry and Molecular Genetics, University of Virginia Pearson, W.R., Guide to the FASTA program package, Pearson Group, Dept. of Biochemistry and Molecular Genetics, University of Virginia Pearson, W.R., Empirical Statistical Estimates for Sequence Similarity Searches, J. Mol. Biol. 276, 71-84 (1998) Henikoff, S. & Henikoff, J.G., Amino acid substitution matrices from protein blocks, PNAS 89, 10915-10919 (1992) Protein structure modelling Comparative modelling with satisfaction of spatial restraints In principle there are three different classes of protein structure prediction approaches, which differ in their requirements and resolution/quality of the model [Baker, D. and Sali, A. 2001]. a.) de novo prediction (resolution approximately 4 to 8 A ) deduce functional sites derive structural similarities o for short sequences up to 80 amino acids o relies solely on force fields o requires large computational resources b.) threading / fold recognition (resolution 3 to 4 A ) determine secondary structures determine conserved domains o no information on less structured regions c.) comparative modelling (resolution 3 to 2 A ) no limited length of sequence site- directed mutagenesis docking studies only modest computational resources required template for X- ray or NMR structure determination

o template structure required with at least approx. 30% sequence identity o alignment critical In the following the program MODELLER (A. Sali, salilab.org) is introduced. MODELLER performs comparative modelling with satisfaction of spatial restraints. The question which is answered by MODELLER reads: What is the most probable structure for a sequence given its alignment with related structures? [Sali, A. & Blundell, J. Mol. Biol. 234 (1993)] A protein structure has several characteristic features that can be classified. A feature can be defined as a quantity associated with a certain set of atoms of the protein structure. One feature class comprises stereo- chemical properties on an atomic level bond angles and bond lengths dihedral angles disulphide bonds and angles Lennard- Jones interactions Coulomb interactions On a residue level following feature can be deduced from a structure distance between equivalent backbone atoms main- chain atomic distances main- chain and side- chain dihedral angles neighbouring residues solvent accessibility For each feature a probability density function (pdf) is derived. These functions, their parameters and their dependence of more fundamental protein structure properties were calculated from many homologous protein structures or are known from experiments. Once for each feature of a protein structure a pdf p is known, a pdf for the whole molecular structure can be constructed as a product p = p Here it is assumed that the features are independent of each other. Some feature restraints are independent of the alignment of the target sequence with the templates such as bond lengths. Other feature restraints are dependent of the alignment for example the distance of equivalent C atoms. This molecular probability density function can be used to calculate the most probable structure of an amino acid sequence given its alignment with the sequence(s) of known template structures.

For computational reasons the objective function F = ln p is minimized during modelling instead of maximizing the probability of the molecular pdf with respect to the Cartesian coordinates of the atoms. A flow chart of the MODELLER comparative modelling process is given in the online manual (http://salilab.org/modeller/manual/) Further reading Baker, D. and Sali, A. Protein Structure Prediction and Structural Genomics, Science 294, 93 (2001) Sali, A. and Blundell, T.L. Comparative Protein Modelling by Satisfaction of Spatial Restraints, J. Mol. Biol. 234, 779-815 Shen, M. and Sali, A. Statistical potential for assessment and prediction of protein structures, Protein Science 15, 2507-2524 (2006) MODELLER online manual (http://salilab.org/modeller/manual/): Introduction, Automated comparative modelling, Methods pk a calculations PROPKA software The natural environment of proteins is an aqueous solution. The ph- value of such aqueous solutions inside an organism is regulated. Since some amino acids of the proteins are titrable, the ph value of the solution determines the net charge of the protein and the charge state of individual amino acid residues. The charge of the whole protein and/or of sites on the protein surface strongly affects the behaviour of the protein in solution. The charge on certain residues might control protein- ligand or protein- protein interaction. Therefore the ph is crucial to the function of the protein machinery in an organism. The charge of a titrable group of a protein residue at a given ph can only be determined if the pka of this particular group is known. Experimental values of the pka of single/individual amino acids in aqueous solutions are known. But if the amino acid is incorporated into a different environment like a protein structure these pka values can shift. There are several possible contributions Coulomb interactions between charged groups H- bonds desolvation effects These shifts can be calculated from a given structure with the PROPKA software.

Further reading Hui Li et al. Very Fast Empirical Prediction and Rationalization of Protein pka Values, PROTEINS 61, 704-721 (2005) Olsson, M.H.M et al. PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical pka Predictions, J. Chem. Theory Comput. (7), 525-537 (2011) Surface potential calculations Adaptive Poisson- Boltzmann Solver (APBS) software The surface potential of proteins can reveal information about binding sites, aggregation behaviour and stability in solution. The Poisson equation for non- homogenous materials reads ε r φ r = ρ(r) with the position dependent dielectric function ε, charge distribution ρ and electrostatic potential φ. To account for ions in the aqueous environment of proteins the Debye- Hückel model is used. According to this model the distribution n ± of an ion species in a solvent differs from the bulk density n n ± = n exp( U ± /k T) where U ± is the free energy of potential of mean force. The free energy of potential of mean force is approximated by the product of ion charge and the electrostatic potential. This approximation results in the Poisson- Boltzmann equation ε r φ r = ρ r ε r κ r sinh φ r The electrostatic potential is given now in units of k T/e. The term κ r is closely related to the Debye- Hückel screening parameter (up to a multiplicative constant) κ r ~ Ie k Tε r The inverse of the screening parameter is called the Debye length and gives a length scale of the screening of electrostatic interaction in solutions with ions. If the electrostatic potential is small compared to the thermal energy the hyperbolic sine can be approximated and results in the linearized Poisson- Boltzmann equation. These equations above can be solved numerically with different techniques. The APBS uses a finite difference technique.

Further reading Honig, B. and Nicholls, A. Classical Electrostatics in Biology and Chemistry, Science 268, 1144-1149 (1995) Davis, M.E. and McCammon, J.A. Electrostatics in Biomolecular Structure and Dynamics, Chem. Rev. 90, 509-521 (1990) Baker, N.A. et al. Electrostatics of nanosystems: Application to microtubules and the ribosome, PNAS 98, 10037-10041 (2001) Protein- Protein Docking ATTRACT Many biological processes rely on the formation or decoy of protein complexes. As mentioned above the preparation of crystals of single proteins for X- ray structure determination is challenging. Even more challenging is the crystallization of proteins in a particular complex state. The docking program ATTRACT takes the following approach to propose reasonable protein complex geometries. In a first step the complexity of the problem is reduced. The atoms of a high resolution protein structure are replaced by pseudo atoms. The protein backbone atoms are substituted with two pseudo atoms: one at the nitrogen and the other at the carbonyl oxygen site. The short side- chains (Ala, Ser, Thr, Val, Leu, Ile, Asn, Asp) are represented by one pseudo atom with position at the geometric centre of all side- chain heavy atoms. The large side- chains (Arg, Lys, Glu, Gln, His, Met, Phe, Tyr, Trp) are built of two pseudo atoms. The first is positioned in the middle between the C and C atoms of the side- chain. The second is placed at the geometric centre of the remaining heavy atoms. To each pseudo atom a radius R and a van der Waals interaction parameter A is assigned. Additionally the acidic and basic residues receive a charge of - /+ 1. For example: Alanin: R " = 1.878 A, A " = 1 [RT. ]; Tryptophan: pseudo atom 1 R "# = 2.528 A, A "# = 1.5 [RT. ]; pseudo atom 2 R "# = 1.888 A, A "# = 2.6 [RT. ]. On the one hand this simplification saves computational time and one the other hand this procedure leads to a smoothening of the protein surface. Since the most probable binding geometry of two proteins has to be determined, it is necessary to generate and evaluate many different contact orientations of the two proteins. So in a next step the position of one protein (called receptor) is kept fixed while the other protein (called ligand) is placed in many different orientations (relative to the receptor) around the receptor. This results in many (in the order of 10 ) different geometries of the problem. Each of the positions of the two proteins described above is used as a starting orientation for an energy minimization. The interaction energy between the two proteins is stated as

V r " = A A (R + R ) r " A A R + R r " 2V r "# + A A R + R r " A A R + R r " + A A R + R r " + + A A R + R r " q q ε r " r " ; attractive residues q q ε r " r " ; repulsive res. ; r " > r "# + q q ε r " r " ; rep. ; r " r "# The attractive interaction is described as the sum of a soft (6-8) Lennard- Jones potential and Coulomb potential. The latter is screened with a distance dependent dielectric function ε r " = 15 r ". The repulsive interaction is split into two parts. Both parts lead to a repulsive interaction at any distance but with a saddle point at r "#. After the potential has been calculated the energy is minimized with respect to the translational and rotational degrees of freedom of the ligand (the receptor is kept fixed). This is usually done in several consecutive minimization procedures. In a rigid docking approach the surface side chains cannot react to the approach of the ligand although it is known that the surface side chains can rearrange during complex formation. To account for such behaviour several copies of large surface side chains with different dihedral angles are evaluated during the energy minimization. The energetically most favourable side chain rotamer is maintained in the complex structure. Finally all of the calculated complexes are binned. Two complexes have an equal energy minimum if their ligand RMSd value is less than 0.2A. Then the complexes are ranked according to their energy. Further reading Zacharias, M. Protein- protein docking with a reduced protein model accounting for side- chain flexibility, Protein Science 12, 1271-1282 (2003) Zacharias, M. EMBO Practical Course Protein- protein docking with ATTRACT using a reduced protein model (2008) Zacharias, M. ATTRACT Protein- Protein Docking in CAPRI Using a Reduced Protein Model, PROTEINS 60, 252-256 (2005)

Tasks 1. Summarise the properties of the protein β- lactoglobulin (short: blg). See e.g. Kontopidis, G. et al. Invited Review: b- Lactoglobulin: Binding Properties, Structure and Function, Journal of Dairy Science 87, 785-796 (2004) 2. Consider the two following groups of amino acids: (D, E, K) and (V, I, L). a. What are their physico- chemical differences? b. What is the average BLOSUM62 score in the two groups? c. What is the average BLOSUM62 score between the two groups? d. What might be a cause for the results? 3. Perform an alignment of the sequence GNYLW and DDGRW manually using the Smith- Waterman algorithm and the Needleman- Wunsch algorithm as described in Durbin et al. 4. Search homologous structures to the protein sequence of blg of the Eastern grey kangaroo (UniProt identifier: P11944) using the Smith- Waterman and Needleman- Wunsch algorithm implementation in the FASTA program package. a. Comment the results. b. Choose a template for modelling of the target sequence. c. Make an alignment of the target and the template sequence using the Smith- Waterman algorithm and the Needleman- Wunsch algorithm. Compare the results. 5. Modelling of the protein a. Summarize the basic principle of MODELLER b. Use the alignment from 4c) and the structure from 4b) to compute models for P11944 c. Analyse the models using i. restraint violations ii. DOPE potential iii. RMSd values iv. PROCHECK software (Laskowski, R.A., J. Appl. Cryst. 26,283-291,1993) 6. Use PROPKA to predict the pka values of the amino acids of the protein model a. Summarize the basic principle of PROPKA b. Comment the results 7. Surface potential calculations with APBS a. Summarize the basic principle of APBS b. Perform surface potential calculations for two appropriate ph values. Choose the ph values according to the dimerization behaviour of blg. c. Perform surface potential calculations for two ionic strengths.

d. Comment the results. 8. Systematic docking with ATTRACT a. Summarize the basic principle of ATTRACT. b. Calculate possible dimer structures of your protein model. c. Analyse the interface with VMD. Look at the amino acid composition and area of the interface. d. Compare your results with the dimer of a blg crystal structure. last revision 15/10/12 (ml)