Protein Secondary Structure Assignment and Prediction

1 Protein Secondary Structure Assignment and Prediction Defining SS features - Dihedral angles, alpha helix, beta stand (Hydrogen bonds) Assigned manually by crystallographers or Automatic DSSP (Kabsch & Sander,1983) STRIDE (Frishman & Argos, 1995) Continuum - PCE: Protein Continuum Electrostatics (Andersen et al.) Structure Prediction methods Dihedral Angles From ϕ phi - dihedral angle about the N-Cα bond ψ psi - dihedral angle about the Cα-C bond ω - dihedral angle about the C-N (peptide) bond

α Helix phi(deg) psi(deg) H-bond pattern ------------------------------------------------------------------------------------------------- right-handed alpha-helix -57.8-47.0 i+4 π-helix -57.1-69.7 i+5 3-10 helix -74.0-4.0 i+3 (ω is 180 deg in all cases) 2 π-helix α-helix 3/10helix α-helix 3/10helix π-helix

3 Beta Strand phi(deg) psi(deg) omega (deg) ------------------------------------------------------------------ beta strand -120 120 180 ----------------------------------------------------------------- Hydrogen bond patterns in beta sheets. Here a four-stranded beta sheet is drawn schematically which contains three anti-parallel and one parallel strand. Hydrogen bonds are indicated with red lines (anti-parallel strands) and green lines (parallel strands) connecting the hydrogen and receptor oxygen. Secondary Structure Types * H = alpha helix * B = residue in isolated beta-bridge * E = extended strand, participates in beta ladder * G = 3-helix (3/10 helix) * I = 5 helix (p helix) * T = hydrogen bonded turn * S = bend Automatic assignment programs

DSSP ( http://www.cmbi.kun.nl/gv/dssp/ ) STRIDE ( http://www.hgmp.mrc.ac.uk/registered/option/stride.html) 4 Secondary Structure Prediction What to predict? All the 8 types of structures Simple alignments Heuristic Methods (e.g., Chou-Fasman, 1974) Neural Networks (different inputs) Raw Sequence (late 80 s) Blosum matrix (e.g., PhD, early 90 s) Position specific alignment profiles (e.g., PsiPred, late 90 s) Multiple networks balloting, probability conversion, output expansion Improvement of accuracy 1974 Chou & Fasman ~50-53% 1978 Garnier 63% 1987 Zvelebil 66% 1988 Quian & Sejnowski 64.3% 1993 Levine 69% 1994 Rost & Sander 70.8-72.0% 1997 Frishman & Argos <75% 1999 Cuff & Barton 72.9% 1999 Jones 76.5% 2000 Petersen et al. 77.9% Simple Alignments Solved structures homologous to query needed

Homologous proteins have ~88% identical (3 state) secondary structure If no homologue can be identified alignment will give almost random results 5 Chou-Fasman - 1974 - based on the propensities of amino acids to adopt secondary structures based on the observation of their location in 15 protein structures determined by X-ray diffraction - these statistics derive from the particular stereochemical and physicochemical properties of the amino acids. See for example, glycine and proline. These statistics have been refined over the years by a number of authors (including Chou and Fasman themselves) using a larger set of proteins. Rather than a position by position analysis the propensity of a position is calculated using an average over 5 or 6 residues surrounding each position. On a larger set of 62 proteins the base method reports a success rate of 50%. Propensity is defined by where i the 20 aa types, fiϕ Pi ϕ = [1] f ϕ ϕ conformations - α helix, β strand and coiled coil f ϕ denotes the fraction of all the aa (20 types) belong to the ϕ th type SS, and f iϕ denotes the fraction of the ith type aa belongs to the ϕth type SS, that is ni ϕ Nϕ f iϕ = and fϕ =, N i N T N i, N ϕ = the number of type i and ϕ th aa, n iϕ = the number of type i aa belonging to the ϕ th type SS, N T = the total number of aa.

6 H strong, h intermediate, I weak, I insensitive, b break, B strongly breaking http://www.brc.dcs.gla.ac.uk/~drg/courses/bioinformatics/slides/slides7/sld015.htm

Chou-Fasman parameters for the 20 amino acids 7 - each aa is assigned several conformational parameters, P(a), P(b), and P(turn) - represent the propensity of each aa to participate in alpha, beta sheet and turns - each aa is assign 4 turn parameters, f(i), f(i+1), f(i+2) and f(i+3), which corresponding to the frequency with which the aa was observed in the first, second, third and fourth position of a hairpin turn - The algorithm for assigning secondary structure proceeds as follows: - Identify alpha - Identify beta - Identify turns (a) for each residue i, calculate the turn propensity, P(t) = f(i)*f(i+1)*f(i+2)*f(i+3) (b) predict a hairpin turn starting at each position i that satisfies the following criteria: (i) P(t) > 7.5 e-5 (ii) The average P(turn) value for the four aa at positions i through i+3 > 100 (iii) a) P( turn) > P ( < P( b) over the four aa in position i through i+3 Example Given a protein seq. CAENKLDHVADCC, use the Chou-Fashman method to predict the turn. aa P(t) Average P(turn) Average P(a) Average P(b) Is it a turn? C = 0.149*0.076*0.077*0.091 103.75 107.5 82 No = 7.9e-5 A = 0.06*0.06*0.191*0.095 99.25 118.5 70.75 No = 6.5e-5 E N K

8 GOR method 1978 Garnier, Osguthorpe and Robson (GOR) introduce a more sophisticate method and defined an information measure Eq.[2] is same as Eq.[1], Bayes Rule I(S;a) = log[p(s a)/p(s)] [2] I(S;a) = log[p(s,a)/(p(s)p(a))] [3] where S is secondary structure, a is amino acid, P(S a) is the conditional probability of conformation S given residue a, and P(a) is the probability of a.a. a, and P(S) is the probability of conformation S. I( S; a) niϕ Nϕ = log( / ) same as Eq.[1] N N i GOR define information gain, I( S;a), in order to avoid data size and sampling variation; Substitute Eq. [3] into [4], one obtained T I( S;a) = I(S;a) - I(not S;a) [4] P( S, a) 1 P( S) I( S; a) = log[ ] + log[ ] [5] 1 P( S, a) P( S) where the probability of finding a.a. a not in conformation S is 1-P(S,a), and of not finding any aa in conformation S is 1-P(S). GOR method considered a sliding window of size 17 with a.a. a X in the center, and considered the effects of the eight a.a. on the left and right hand sides of a.a. a X, hence values for I( S ϕ ;a X ) in Eq.[5] is calculated according to I ( Sϕ, a X ) I( Sϕ, a X + i ) 8 i= 8 where ϕ is the conformations, and predict state with highest gain. Amino acid preferences in α-helix Amino acid preferences in β-strand

9 Amino acid preferences in coil - 1978 Garnier improved the method by using statistically significant pair-wise interactions as a determinant of the statistical significance. This improved the success rate to 62% - 1993 Levin improved the prediction level by using MSA. The reasoning is as follows. Conserved regions in a MSA provides a strong evolutionary indicator of a role in the function of the protein. Those regions are also likely to have conserved structure, including secondary structure and strengthen the prediction by their joint propensities. This improved the success rate to 69%. - 1993, 1994 Rost and Sander combined neural networks with MSA. The idea of a neural net is to create a complex network of interconnected nodes, where progress from one node to the next depends on satisfying a weighted function that has been derived by training the net with data of known results, in this case protein sequences with known secondary structures. The success rate is 72%. Neural Network Methods such as PHD, Pred2ary PHD combine MSA and the optimization strength of the Neural Network formulism

10 Consensus Approach Jpred at EMBL - it runs prediction methods such as PHD, PREDATOR, DSC, NNSSP, ZPRED and MULPRED - if sufficient methods predict an identical SSE for the position, that structure is taken as the consensus prediction for the position - if no consensus reached, the PHD predication is taken

11 Summary and Current Statue of SS prediction - Various different induction techniques over same data, give modest improvements Linear discriminant Analysis (LDA) Decision trees Neural networks - GOR method improves 8-9% points (to about 64% correct residue by residue). - Similar improvement for NNs (to ~ 68%) - Prediction quality has not improved much even with huge growth of training data. - Secondary structure is not completely determined by local forces Long distance interactions do not appear in sliding window Secondary Structure Prediction Tools ProfPHD - http://cubic.bioc.columbia.edu PSIPRED - http://bioinf.cs.ucl.ac.uk/psipred/ BCM PSSP - http://dot.imgen.bcm.tmc.edu:9331/pssprediction/pssp.html PedictProtein - http://cubic.bioc.columbia.edu/predictprotein/ o PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia University HNN - http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_nn.html o Hierarchical Neural Network method (Guermeur, 1997 Jpred - http://jura.ebi.ac.uk:8888/ o A consensus method for protein secondary structure prediction at EBI GORIV - http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_gor4.html o GOR Secondary Structure Prediction (Garnier et.al.,1996)

Threading (sequence-structure alignment approach) 12 based on the observation that many protein structures in the PDB are very similar. For example, there are many 4-helical bundles, TIM barrels, globins, etc. in the set of solved structures it is conjectured there are only a limited number of unique protein folds in nature Estimates that are < 1000 different protein folds Methodology ii- instead of aligning a sequence to a sequence, we align a sequence to a string of descriptors that describe the 3D environment of the target structure - for each residue position in the structure, we determine: - how buried it is - the fraction of surrounding environment that is polar - the local secondary structure (α-helix, β-sheet or other) - For each position in the structure, we categorize it into one of 18 environment classes using these characteristics (see Figure 15.3) Area Buried Very buried, Highly polar Exposed Fraction of polar - the key observation is that different amino acids prefer different environments

13 - For all proteins, we can tabulate the number of times we see a particular residue in a particular environment class, and use this to compute a score for each environment class and each amino acid pair. In particular, we compute a log-odds score of This gives us an 18x20 table as follows: - build a 3D profile for a particular structure using this table. Namely, for each position in our structure, we determine its environment class, and the score of a particular amino acid in this position depends on the table we built above. Thus, for example, if the first position in our structure has environment class B 1 β, the score of having a tyrosine (Y) in that position is 0.07. Thus, for example, if there are n positions in our structure, we build a table as follows (the gap penalties are chosen to discourage gaps in positions within α-helices and β-sheets); Then to align a sequence s with a structure, we align the sequence with the descriptors of the 3D environment of the target structure. To find the best alignment, we use a 2D dynamic programming matrix as for regular sequence alignment: Thus, to use the 3D profile method for fold recognition, for a particular sequence we calculate its score (using dynamic programming) for all structures. Significance of a score for a particular structure is given by scoring a large sequence database against the structure and calculating where µ is the mean score for that structure, and s is the standard deviation of the scores. The advantages of the 3D profile method over regular sequence alignment is that environmental tendencies may be more informative than simple amino acid similarity, and that structural information is actually used. Additionally, this is a fast method with reasonably good performance. The major disadvantage of this method is that it assumes independence between all positions in the structure.

14 My research proposal Knowledge based fold recognition method relies on the extraction of statistical parameters from an experimental determined protein structure database and it has demonstrated some successes. Protein Structure Learning Data Set and Database Construction A set of protein structures will be selected from the DSSP database by referring to the SCOP classes as our input data set. In the SCOP database, proteins are classified in a hierarchy according to their evolutionary origin and structural similarity. There are >25000 PDB entries and >55000 domains. According to the SCOP release 1.65, the number of protein folds, superfamilies and families for the all-α, all-β, and alpha and beta proteins (α/β, α+β) are given by the table below; We will extract the environment information, for example, the residue solvent accessible surface areas (buried (B), partly buried (PB) and exposed (E)), number of hydrogen bondings and secondary structures (such as α-helix, β-sheet and coil) data from the DSSP database. For instance, one can use the following nine environment classes, that is (B, PB, E) α,β, where c stands for the coil region. The DSSP database utilizes the DSSP program to define secondary structure, geometrical features and solvent accessible area of proteins given atomic coordinates in PDB. Figure 1 is part of the secondary structure definition by the program DSSP for deoxyhemoglobin protein (1A00).

15 Fig. 1 Part of the secondary structure definition by the program DSSP for deoxyhemoglobin protein (1A00). We use the CGI language, PHP, to extract the environment information from DSSP. Embedding the PHP results in HTML, we have developed a web base interface for data retrieval, and using the molecular graphic tool, Rasmol, for protein 3D structure demonstration (Figure 2). Fig 2 Part of environment information for protein, 1A00, from DSSP and the 3D structure view. 3D structure view of protein, 1A00, using molecule graphics tool RasMol.

16 Fig. 3 Environment data for protein, 1A00, where res_sum, aa, rsas, sec_summary, three_turns, four_turns and five_turns represent the residue number, amino acid type, residue solvent accessible area surface, secondary structure summary, three helix, four helix and five helix respectively. Structure profile method (a) 3D profile method and residue environment The 3D profile method uses structural information. Instead of doing sequences alignment, the 3D profile method align a sequence to a string of descriptors that describe the 3D environment of the target structure. That is for each residue position in the structure we determine: solvent accessible surface area (buried, partly buried or exposed), the local secondary structure (α-helix, β-sheet and coil), and the fraction of surrounding environment that is polar (O and N) The basic assumption of this method is that the environment of a particular residue is expected to be more conserve than the actual residue itself, and so the method is able to detect more distant sequence-structure relationship than purely sequence-based method. (b) Residues environment score matrix The probability, P(i,m), associated with residue i in an environment m (in our study it is the solvent accessible area A) is given by P ( i, m) = n( i, m) / N( i) where n(i,m) is the number of residue i with solvent accessible area A, and N(i) is the total number of residue i. For instance, one can compute the probability of having residue i s solvent accessible surface area in a buried, partly buried or exposed environment with one of the three secondary structures. The cutoffs used in defining buried (B), partly buried (PB) and exposed (E) are taken to be 0-10%, 11-40% and >40% of the maximal accessibility for the residues. Thus, each position i the 3D protein structure is assigned to one of nine environment classes. Our study will consider four classes of proteins: all-α protein, all-β protein, α+β protein and α/β

17 protein (multi-domain proteins, membrane and cell surface proteins and small proteins will also be considered in case of need), which belong to the SCOP database. The residue solvent accessible surface area and secondary structures data are retrieved from the DSSP database. The matrix score element of the 3D structure profile, M ij, for environment class i and residue j is given by; M ij = ln P P( residue j in environment i) ( residue j in any environment) The denominator is obtained from the residues frequency in the DSSP database, where j is one of the nine environment classes; (B, PB, E) α,β,c. Given the scoring matrix for a class of protein family, we can build a 3D profile for a particular structure using this matrix. That is, for each position in the known protein structure, one can determine its environment class, and the score of a particular residue in this position is given by the score matrix value. As an illustration, we had computed the 3D profile for sperm whale myoglobin sequences, which belong to the globins protein family according to SCOP classification scheme. We consider three environment classes in our preliminary study and more classes will be consider in order to model the residue environment more precisely. This result is tabular in Table 1. Table 1. Environment classes (E, PB, E) and score value, S ij, of the sperm whale myoglobin sequences in comparison with the hydrophobicity KD scale. Residue Buried (B) Partly buried (PB) Exposed (E) KD scale I -2.27-1.99 4.27 4.5 V -2.29-1.95-0.88 4.2 L -1.68-0.93 0.39 3.8 F -2.7-2.33 3.48 2.8 C -1.81-2.11 1.85 2.5 M -3.84-2.42-0.73 1.9 A -1.15-0.43-1.46 1.8 G -1.63-1.8-1.05-0.4 T -1.82 0.6-3.18-0.7 S -1.46-2.47-2.24-0.8 W -4.31 0.62 x -0.9 Y -2.85-3.4-1.2-1.3 P x -2.25-3.36-1.6 H -1.46-1.04-1.77-3.2 N x -4.49-1.82-3.5 D x -2.46-2.51-3.5 E x -1.18-2.04-3.5 Q 0.7-2.54-2.85-3.5 K 5.07-0.87-1.74-3.9 R 1.97-3.29-2.39-4.5

18 A large negative score value indicates a strong preference for the particular environment whereas large positive score value indicates an aversion. It is evident from the Table 1 that residues P, N, D, and E were not found at the buried state. Furthermore, residues Q, K and R all have a positive score, which is an indication of aversion, hence, these are the polar residues. Similarly, residues I, V, L, F and M are found to prefer reside in the buried state by examining the exposed column score values (a large positive or small negative value) in Table 1, these are the hydrophobic residues. These two conclusions are well consistent with the experimental determined hydrophobicity KD scale results given in the last column in Table 1. Figure 4. Environment classes (E, PB, E) and score value, S ij, of the sperm whale myoglobin sequences Given the score matrix (Table 1) one can build a 3D profile for a particular structure using this score matrix. For each position in the structure we determine its environment class and the score value of a particular residue in this position depends on the score matrix we built. For instance, if the first position in our structure has the environment class buried, the score of having residue K in that position is 5.07. Thus, if there are n residues in the structure, we can build a profile for the known protein structure. For example, one can construct the following chart, Position in fold Environment class.. Q K R 1 B 0.7 (5.07) 1.97 2 PB -2.54-0.87-3.29....

19 To align a sequence with a structure, one aligns the sequence with the descriptors of the 3D environment of the known protein structure using the dynamic programming algorithm in order to find the optimal alignment. SSE prediction servers Network Protein Sequence Analysis http://npsa-pbil.ibcp.fr Meta-servers Integrate predictions from several other servers Significantly better predictions than any individual approach http://bioinfo.pl/meta/ provides access to various fold recognition and local structure prediction methods Fold Recognition and Structure Predication Tools UCLA-DOE http://fold.doe-mbi.ucla.edu/ Protein 3D-Structure Prediction Server

20 3D-PSSM http://www.sbg.bio.ic.ac.uk/~3dpssm/ o Protein fold recognition using 1D and 3D sequence profiles coupled with secondary structure information (Foldfit) 123D http://123d.ncifcrf.gov/123d+.html o combines sequence profiles, secondary structure prediction, and contact capacity potentials to thread a protein sequence through the set of structures FFAS http://bioinformatics.burnham-inst.org/pages/servers/index.html o Fold assignment method is based on the profile-profile matching algorithm

Online Hypertext Book A Guide to Structure Prediction (ICRF) http://speedy.embl-heidelberg.de/gtsp/ Protein Secondary Structure Prediction with Neural Networks: A Tutorial (UCL) http://www.biochem.ucl.ac.uk/~shepherd/ssindex.html BioComputing Hypertext Coursebook (VSNS-BCD, Germany) http://www.techfak.unibielefeld.de/bcd/curric/welcome.html Coils-(Lupa s method) o http://www.ch.embnet.org/software/coils_form.html Paircoil- (Berger's method) o http://nightingale.lcs.mit.edu/cgi-bin/score Multicoil - Prediction of two- and three-stranded coiled coils o http://nightingale.lcs.mit.edu/cgi-bin/multicoil 21 References http://www.cbs.dtu.dk/phdcourse/cookbooks/presentation_course.ppt Mona Singh, Lecture 15 The Threading Approach to Tertiary Structure Prediction.

22 Transmembrane Segments - form a distinct topological class due to the presence of one or more transmembrane seq. segments - the transmembrane segments are subject to severe restrictions imposed by the lipid bilayer of the cell membrane Coiled-coil structures

23 Protein Secondary Structure Assignment and Prediction - Assignment (1) Proof Equation 5. (2) Given a protein seq. CAENKLDHVADCC, use the Chou-Fashman method to predict the turn. Fill in the boxes for the residues E, N and K. aa P(t) Average P(turn) Average P(a) Average P(b) Is it a turn? C = 0.149*0.076*0.077*0.091 103.75 107.5 82 No = 7.9e-5 A = 0.06*0.06*0.191*0.095 99.25 118.5 70.75 No = 6.5e-5 E N K