8 Protein secondary structure

Size: px

Start display at page:

Download "8 Protein secondary structure"

Gladys Hensley
5 years ago
Views:

1 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, Protein secondary structure Sources for this chapter, which are all recommended reading: Introduction to Protein Structure, Branden & Tooze, V.V. Solovyev and I.N. Shindyalov. Properties and prediction of protein secondary structure. In Current Topics in Computational Molecular Biology, T. Jiang, Y. Xu and M.Q. Zhang (editors), MIT press, chapter 15, pages , 22. D.W. Mount. Bioinformatics: Sequences and Genome analysis, Cold Spring Harbor Press, Chapter 9: Protein classification and structure prediction. pages , Proteins A protein is a chain of amino acids joined by peptide bonds. It is usually produced by a ribosome that moves along an mrna and adds amino acids according to the codons that it encounters in the mrna. Here are the 2 standard amino acids: Name 3-letter 1-letter Alanine Ala A Cysteine Cys C Aspartic acid Asp D Glutamic acid Glu E Phenylalanine Phe F Glycine Gly G Histidine His H Isoleucin Ile I Lysine Lys K Leucine Leu L Here is a classification of these amino acids: Name 3-letter 1-letter Methionine Met M Asparagine Asn N Proline Pro P Glutamine Gln Q Arginine Arg R Serine Ser S Threonine Thr T Valine Val V Tryptophan Trp W Tyrosine Tyr Y Here are two amino acids within a polypeptide chain: R H H O N # C "! N C C H H O R

2 14 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, 211 Neighboring amino acids are joined by a peptide bond between the C=O and NH groups. A chain of repeated N-C α -C s make up the backbone of the protein. In such a polypeptide chain, each amino acid has two rotational degrees of freedom: the rotational angle φ ( phi ) of the bond between N and C α, and the rotational angle ψ ( psi ) of the bond between C α and C. Both bonds are free to rotate, subject to spatial constraints posed by adjacent R groups. The third angle Ω of the peptide bond between the C=O and NH groups is nearly always 18, which implies the planarity of the peptide bond. (Image source: wiki.cmbi.ru.nl) Polypeptide chains of amino-acids are called protein sequences. called a peptide sequence. A short chain or fragment is also A protein (sequence) starts with a free NH group (the N-terminus) and ends with a free COOH group (the C-terminus). Example: N-terminus C-terminus The amino acid sequence here is: R 1 R 2 R Hierarchy of protein structure We distinguish between four levels of protein structure (Linderstrom-Lang & Schnellman 1959): Primary structure: The sequence of amino acid residues in a polypeptide chain. Secondary structure: Helices and β-sheets that are formed by hydrogen bonds between the C=O and NH groups of the backbone. Tertiary structure: The three dimension structure of a polypeptide chain, consisting of secondary structure elements linked by loops and stabilized (primarily) by side-chain interactions. Quaternary structure: The aggregation of different polypeptide chains into a functional protein.

The tertiary structure of proteins is of great interest, as the shape of a protein determines much, if not all, of its function.

8.2 The Holy Grail of Bioinformatics Central biochemical assumption: sequence specifies 3D-structure. Hence, we would like to be able to determine the structure of a protein from its sequence.

.. MATGDERFYAEHLMPTLQGLLDPESAHR LAVRFTSLGLLPRARFQDSDMLEVRVLGH KFRNPVGIAAGFDKHGEAVDGLYKMGFGF VEIGSVTPKPQEGNPRPRVFRLPEDQAVIN RYGFNSHGLSVVEHRLRARQQKQAKLTED GLPLGVNLGKNKTSVDAAEDYAEGVRVLG

3 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, (Image source: Protein Structure and Function, GA Petsko and D Ringe (24)) The values of all pairs of rotation angles φ and ψ determines the tertiary structure of a protein. The tertiary structure of proteins is of great interest, as the shape of a protein determines much, if not all, of its function. Here is the structure of myoglobin, the first experimentally derived structure: The experimental determination of protein structure via x-ray crystallography or NMR is difficult and time-consuming. 8.2 The Holy Grail of Bioinformatics Central biochemical assumption: sequence specifies 3D-structure. Hence, we would like to be able to determine the structure of a protein from its sequence. The Holy Grail of Bioinformatics: Develop an algorithm that can reliably predict the structure (and thus function) of a protein from its amino acid sequence... MATGDERFYAEHLMPTLQGLLDPESAHR LAVRFTSLGLLPRARFQDSDMLEVRVLGH KFRNPVGIAAGFDKHGEAVDGLYKMGFGF VEIGSVTPKPQEGNPRPRVFRLPEDQAVIN RYGFNSHGLSVVEHRLRARQQKQAKLTED GLPLGVNLGKNKTSVDAAEDYAEGVRVLG PLADYLVVNVSSPNTAGLRSLQGKAELRR LLTKVLQERDGLRRVHRPAVLVKIAPDLTS QDKEDIASVVKELGIDGLIVTNTTVSRPAGL QGALRSETGGLSGKPLRDLSTQTIREMYAL TQGRVPIIGVGGVSSGQDALEKIRAGASLVQ LYTALTFWGPPVVGKVKRELEALLKEQGFG GVTDAIGADHRR...? We will return to this problem in the next chapter Secondary structure of proteins Regular features of the main chain of a protein give rise to the secondary structure. Determining the secondary structure is an important first step toward determining the three-

16 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, 211 dimensional structure. There are two main types of (repetitive) secondary structure elements, called α-helices and β-sheets (L.

4 16 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, 211 dimensional structure. There are two main types of (repetitive) secondary structure elements, called α-helices and β-sheets (L. Pauling 1951), corresponding to specific choices of the φ and ψ angles along the chain Ramachandran Plot In a Ramachandran plot pairs of torsion angles (φ,ψ) are plotted in a scatter plot. Certain torsion angle pairs are energetically particularly favorable. The following is a Ramachandran plot of observed pairs of angles in a collection of known protein structures: (Image source: Wikimedia Commons) The pairs near φ = 6 and ψ = 4 correspond to α-helices. The pairs near ( 9, 12 ) correspond to β-strands α-helices Helices arise when hydrogen bonds occur between (the C=O group of) the amino acid at position i and (the NH group of) the amino acid at position i + k (with k = 3, 4 or 5), for a run of consecutive values of i. Here is the bonding pattern of an α-helix: C=O NH NH C=O C=O NH C=O NH C=O C=O NH NH C=O NH C NH! C=O Usually, k = 4 and the resulting structure is called an α-helix. (φ, ψ) = ( 58, 47) and there are 3.6 residues per turn. The (idealized) torsion angles are Seldomly, k = 3 and then we have a 3 1 -helix. The (idealized) torsion angles are (-74, -4) and there are 3. residues per turn. Very rarely, k = 5 and then we have a π-helix. The idealized torsion angles are (-57, -7) and there are 4.4 residues per turn.

3 β-sheets So-called β-sheets consist of β-strands

are held together by H bonds: There are two possible

(Image source: Wikimedia Commons) In a parallel

in an anti-parallel sheet, chains run in alternating

5 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, (Image source:???) β-sheets So-called β-sheets consist of β-strands that are runs of 5-1 consecutive amino acids, which are held together by H bonds: There are two possible configurations of β-sheets. (Image source: Wikimedia Commons) In a parallel β-sheet, all chains run in the same direction, while in an anti-parallel sheet, chains run in alternating directions: (Source: Wikimedia Commons) Example of an anti-parallel β-sheet (variable light chain of an immunoglobulin):

18 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, 211 8.3.4 Loops All other (non-repetitive) structures are called loops.

Hairpin loops joining two anti-parallel β-strands may be as short as two amino acids. Loops lie on the surface of the structure. Turns are narrow 18 loops that contain at least 3 amino acids.

6 18 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, Loops All other (non-repetitive) structures are called loops. Loops are regions of a protein chain that lie between α-helices and β-sheets. The lengths and threedimensional structure of loops can vary. Hairpin loops joining two anti-parallel β-strands may be as short as two amino acids. Loops lie on the surface of the structure. Turns are narrow 18 loops that contain at least 3 amino acids. A region of secondary structure that is not a helix, a sheet, or a recognizable turn is called a coil. 8.4 Classification of protein structures Proteins are classified to reflect both structural and evolutionary relatedness. A typical classification scheme will employ different hierarchical levels, such as: 1. Folds: Based on major structural similarities. 2. Superfamilies: Based on probable evolutionary relationships. 3. Families: Based on clear evolutionary relationships. Mount describes six principal classes of protein structures based on the three-dimensional arrangement of secondary structures, four taken from Levitt and Chothia (1976), and two additional ones taken from the SCOP database (Murzin et al., 1995): (1) A member of class α consists of a bundle of α-helices connected by loops on the surface of the proteins, e.g.: (The four letter codes are PDB accession numbers.) Hemoglobin (3hhb) (2) A member of class β consists of β-sheets, usually two sheets in close contact forming a sandwich. Examples are enzymes, transport proteins and antibodies, e.g.:

mainly of β-sheets with intervening α-helices.

: Tryptophan synthase β subunit (2tsy) (4) A member of class α + β consists of segregated

: G-specific endonuclease(1rnb) (5) This class consists of all multi-domain (α and β)

(6) Membrane and cell-surface proteins and peptides, e.g.

7 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, T-cell receptor CD8 (1cd8) (3) A member of class α/β consists mainly of β-sheets with intervening α-helices. This class contains many metabolic enzymes, e.g.: Tryptophan synthase β subunit (2tsy) (4) A member of class α + β consists of segregated α-helices and β-sheets, e.g.: G-specific endonuclease(1rnb) (5) This class consists of all multi-domain (α and β) proteins with domains from more than one of the above four classes. (6) Membrane and cell-surface proteins and peptides, e.g.: Integral membrane light-harvesting complex (1kzu) Databases The databases SCOP ( and CATH ( cathdb.info) both contain a hierarchical classification of protein domains by their structures. 37. structures in the PDB 971 folds in SCOP (release 1.71) Number of folds in the six classes in SCOP :

8 11 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, 211 Class No of folds α 226 β 149 α/β 134 α + β 286 Multi-domain protein 48 Membrane and cell surface proteins Computing the secondary structure of a known 3D structure Given the positions of the main chain atoms of a protein, the DSSP (definition of secondary structure of proteins) program 1 determines the secondary structure of the protein. (It also computes geometrical features and solvent exposure.) Note that the program does not predict protein secondary structures from sequences, but rather it computes them from coordinates. The DSSP algorithm proceeds as follows: First determine which C=O and NH groups in the main chain are joined by hydrogen bonds. This decision is based on an electrostatic model using the following energy calculation: ( ) 1 E = q 1 q 2 r(on) + 1 r(ch) 1 r(oh) 1 f, r(cn) with q 1 =.42e and q 2 =.2e, where e = esu (electrostatic unit) is the unit electron charge, r(ab) is the inter-atomic distance between atom A in the first amino acid and atom B in the second in Angstroms, f = 332 is a constant called the dimensionality factor, and E is the energy in kcal/mol. Hydrogen bonds have a binding energy of about 3kCal/mol, however DSSP assigns an H-bond between C=O of residue i and NH of residue j if E <.5kCal/mol. Any H-bond detected in this way is called a k-turn, if it connects the C=O group of amino acid i to the NH group of amino acid i + k, where k = 3, 4 or 5, and a bridge, if it connects residues that are not close to each other in the sequence. Here are some of the patterns that are used to identify secondary structure elements: 3-turn NH - C α - C=O NH - C α - C=O NH - C α - C=O NH - C α - C=O parallel bridge NH - C - C=O NH - C - C=O NH - C - C=O NH - C - C=O NH - C - C=O NH - C - C=O NH - C - C=O NH - C - C=O anti-parallel bridge NH - C - C=O NH - C - C=O NH - C - C=O NH - C - C=O C=O - C - NH C=O - C - NH C=O - C - NH C=O - C - NH An α-helix is identified as a consecutive run of (at least two) 4-turns. Any two helices that are offset by two or three residues are concatenated into a single helix. 1 Kabsch and Sander, Dictionary of Protein Secondary Structure: Pattern recognition of Hydrogen-Bonded and Geometrical Fatures. Biopolymers 22, , 1983.

9 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, A β-sheet corresponds to a sequence of bridges between consecutive residues in two different regions of the chain. More precisely, we need to introduce two types of patterns: a ladder is a set of one or more consecutive bridges of the same type, and a sheet is one or more ladders connected by shared residues. Detected sheets are then defined to be β-sheets. To allow for irregularities, β-bulges are introduced, in which two perfect ladders or bridges can be connected through a gap of one residue on one side and four on the other. This is how α-helices and β-sheets are defined, detected and annotated in practice. 8.6 Secondary structure prediction from sequences Secondary structure prediction problem: Assume we are given a protein sequence, e.g.: MATVAERCPICLEDPSNYSMALPCL HAFCYVCITRWIRQNPTCPLCKVPV ESVVHTIESDSEFGDQLI The secondary structure prediction problem is to assign a secondary structure type to each amino acid in the sequence, e.g. S (for strand), H (for helical), C (for coils or loops): MATVAERCPICLEDPSNYSMALPCL SSSCCCC HAFCYVCITRWIRQNPTCPLCKVPV SSS-CCHHHHHHHH---CCCC---- ESVVHTIESDSEFGDQLI --SS SSP and discriminant-analysis The secondary structure prediction program (SSP) developed by Solovyev and Salamov (1991, 1994) 2 is aimed at getting the location of entire α-helices and β-strands correct rather than assigning each individual residue to the correct type of secondary structure. The SSP algorithm is based on the assumption that secondary structures can be identified by statistical properties associated with an α-helix or β-strand. The SSP algorithm is based on the assumption that secondary structures can be identified by statistical properties of five regions associated with an α-helix or β-strand, namely the N l region, N-terminal, internal, C-terminal and C r regions, respectively, as indicated here: N! helix or " strand N l N internal C C r C The singleton characteristic The singleton characteristic is an average of single-residue preferences. Using a database of known protein structures, for every amino acid a the preference of being in a specific segment of type k 2 see second item in literature list of chapter

10 112 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, 211 (e.g.,an α-helix or a β-strand) is calculated as S k (a) = P k (a) P (a), where P (a) and P k (a) are the proportions of amino acids of type a that are contained in the whole database and in segments of type k, respectively (see P.Y. Chou and G.D. Fasman 1978). Consider a sequence of amino acids A = a 1... a L. Choose start and end positions p and q in the sequence, and a structure type k (e.g, α-helix). The singleton characteristic S k (p, q) is defined as: p 1 i=p m p+m 1 S N l (a i ) + S N (a i ) + i=p S k (p, q) = q m i=p+m 1 (q p + 1) + 2m S internal (a i ) + q i=q m+1 S C (a i ) + q+m i=q+1 S Cr (a i ). Here, m is a pre-chosen parameter that determines the size of the non-internal segments N l, N, C and C r. It usually equals 3 or The doublet characteristic The doublet characteristic is similar to the singlet characteristic. The hope is to obtain a better discrimination by considering pairs of amino acids separated by d =, 1, 2 or 3 other residues. The preference for a particular type of secondary segment k for a pair of amino acids of type a and b, separated by d other residues, is defined as: D k (a, b, d) = P k (a, b, d) P (a, b, d), where P (a, b, d) is the proportion of pairs of amino acids a and b whose positions differ by d in a segment, and P k (a, b, d) is same value restricted to those segments of type k, in the given training database. The average preference of a segment a p a p+1... a q to be in a particular secondary structure k is denoted by D k (p, q, d) and is obtained as the normalized sum of all the pair characteristics occurring in the N l, N, internal, C and C r segments The hydrophobic moment Secondary structure prediction can be aided by examining the periodicity of amino acids with hydrophobic side chains in the protein chain. Tables assigning a hydrophobicity value h(a) (Kyte and Doolittle 1982) to each amino acid a are used to the determine the hydrophobicity of different regions of a protein: 5 4 hydrophobicity 3 2 1!1!2!3!4!5 A R N D C Q E G H I L K M F P S T W Y V amino acid

11 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, Here, a positive values means hydrophobe, whereas a negative value means hydrophile. Observation: Helices often lie on the surface of a protein and there is a tendency for hydrophobic residues to face the core of the protein and for polar and charged amino acids to face the aqueous environment on the outside of the helix. The hydrophobic moment is calculated for a segment and different angles of rotation per residue (from 18 o ) and measures how well the peptide separates hydrophobic and hydrophilic regions in a pattern that is typical for a helix or strand (Eisenberg et al., 1984). For a given segment a p a p+1... a q of sequence, the hydrophobic moment for an angle ω is defined as: q M ω (p, q) = h(a i ) cos(iω) where h(a) denotes the hydrophobicity of the amino acid a. i=p 2 2 q + h(a i ) sin(iω) Here, hydrophobicity is treated as a vector or a quantity with both a magnitude (positive or negative!) and a direction. The hydrophobic moment is the length of the sum of these individual hydrophobicity vectors. In the context of predicting α-helices and β-sheets, the angles considered are ω = 1 and ω = 16, respectively. We use ω(k) to denote the angle associated with the structure type k. 1 i=p 1 2, Combining the discriminant functions The SSP method for secondary structure prediction uses a linear combination of all three described discriminant functions (LDF, linear discriminating function): Z k (p, q) = α k 1 S k (p, q) + α k 2 D k (p, q, d) + α k 3 M ω(k) (p, q) Given a threshold c k, this function classifies a segment of sequence a p a p+1... a q into class 1 (i.e., is structure of type k), if Z k (p, q) > c k, or class 2 (i.e., is not structure of type k), if Z k (p, q) c k. For each type of structure k, the method of linear discriminant analysis is used to to determine the coefficients (α k 1, αk 2, αk 3 ) and the threshold constant ck. For a given training set, the goal is to maximize the ratio of the between-class variation of Z k to within-class variation. (We will skip the details (see Fisher, 1936).) The SSP algorithm Given a protein sequence A = a 1 a 2... a L, the SSP algorithm predicts secondary structures in the following way: Algorithm (SSP for α-helices)

12 114 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, Determine a seed α-helix consisting of a segment a p a p+1... a q of five residues with an average singleton characteristic higher than a pre-given threshold t. 2. Compute the value of Z k (p, q) for k = α-helix. 3. While Z k (p, q) > c k, extend the segment by one residue, up to a maximal extension of 15 residues in each direction. 4. The extended segment that gives rise to the highest LDF score is considered a potential α-helix. A similar seed-and-extend strategy is used to determine potential β-strand segments. Here the length of the initial seed is 3. The result of the two seed-and-extend phases is a set of potential α-helices and β-strands. To obtain a final prediction, overlapping pieces are assigned to the secondary structure types that have the higher LDF value. Non-overlapping remainders of such pieces with lower LDF values are retained as predictions, if they are still long enough. SSP server: Measuring prediction accuracy How to determine the accuracy of computational methods that need to be trained on a database of solved structures? Here is a very simple cross-validation method: Definition (Leave one out (LOO) cross validation) Assume that we have a training set consisting of n datasets and want to evaluate the performance of some computational method M. In the leave-one-out procedure, for each dataset D repeat the following: Train the method M on all datasets except D ( leave one out ). Run the method M on D. Determine whether the method M produced the correct answer on D. Report the accuracy of the method M as the proportion of correct answers. To evaluate the performance of a secondary structure prediction, one possibility is to assess the level of single-residue accuracy. However, this may be problematic, for example, a clearly wrong prediction such as αβαβα... in an α-helix region will still give rise to a score of 5% correct residue predictions. Thus, in practice one also evaluates the number of correctly predicted α-helices and β-strands, considering a structure to be correctly predicted, if it contains more than a pre-defined number of correctly predicted residues, often just Performance of different characteristics of SSP An experimental evaluation of secondary structure predictions was performed on 126 non-homologous proteins with known three-dimensional structures (Rost and Sander 1993), the secondary structure of which was assigned using the DSSP program.

13 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, Different combinations of characteristics were compared with each other, giving rise to the following results 3. Characteristics used Q (%) Singleton characteristic (S) 58.5 S + hydrophobic moment M 61.4 Doublet characteristic (D) 62.2 D + M 64.8 S + D + M Neural networks Neural networks are used for classification problems for which there exist a good supply of training data and little understanding of the structure of the problem at hand. They are inspired by the biology of the brain. A neural network is a graph in which nodes represent neurons and edges represent connections between the neurons. Signals flow through the network and are processed by the neurons. Connections can be weak or strong depending on their weight. These weights are usually set by supervised training. We destinguish between recurrent architectures that contain directed loops, and feed-forward architectures that do not contain directed loops. An architecture is called layered if elements are grouped in layers and connections between elements are defined through the layers. Input data is presented to an input layer and the output is read from an output layer. Other layers are called hidden: input layer hidden layer output layer In Bioinformatics mostly layered feed-forward neural nets are used. The neuron is the universal basic element of a neural network. One commonly used type is the perceptron: x 2 x w 2 w 1 f(σw i x i ) y x r w r In a feed-forward neural net, a node y is fed from r nodes x 1,..., x r by edges (x i, y) with weights w i. It processes these inputs and fires a signal of strength f(x), where x = r i=1 w ix i. Here is a very simple example of a neural net whose task it is to determine whether x 1 > 1 2 x 2: 3 T. Jiang, Y. Xu and M.Q. Zhang (editors), 22, page 383

14 116 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, 211 X1!"#$% Y &$%#$% X 2 (!' It takes two numbers x 1 and x 2 as input and produces a signal y = 2x 1 + ( 1)x 2 as output that is positive, if 2x 1 > x 2, and negative, if 2x 1 < x 2. To mimic the firing of a neuron, we would like the output of the node labeled y to be 1, if 2x 1 > x 2 and, if 2x 1 < x 2. This could be realized using a simple step function { 1 if 2x1 > x y = 2 else. However, it is better to use a continuous function for this purpose, such as a step-like sigmoidal 1 function of the form: f(x) = sgm(x) = 1+exp( x), which looks like this: #!"+ #,-#./1-!22!"*!")!"(!"'!"&!"%!"$!"#!!#!!'! ' #! Constructing a neural network There are two steps to constructing a neural network. The first step is to design the topology of the network. This involves determining the number of input nodes and output nodes and how they are associated with external variables. Additionally, the number of internal or hidden (layers of) nodes must be determined. Finally, nodes have to be connected using edges. The second step is called training. Supervised training requires a training set consisting of input data points for which the desired output is known. Each such data point is presented to the neural net and then the weights in the net are slightly modified using a gradient descent method so as to increase the performance of the network (as discussed below). The goal is to set the weights of the edges so that the number of correct results produced for a given training data set is maximized. 8.9 The PHD neural network The PHD (PHD-sec) algorithm by Rost and Sander 4 uses a neural network to predict the secondary structure of a given residue. The model consists of three processing units: the input layer, the output layer and a hidden layer. The units of the input layer the amino acids read a small segment (13-17 residues) of sequence around the position of interest, obtained using a sliding window. There are 21 input units per sequence position, namely one per amino acid and one for padding at the beginning and end of the sequence. Given a single sequence, the input unit corresponding to a given amino acid at a given position is set to 1. 4 B. Rost and C. Sander, Prediction of protein secondary structure at better than 7% accuracy. J Mol Biol 232, , 1993.

Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, 211 117 Then signals are sent to units in the hidden layer, which process them and pass them on to the units of the output layer.

15 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, Then signals are sent to units in the hidden layer, which process them and pass them on to the units of the output layer. The final output determines which of the three types of secondary structure is assigned to the central residue. The PHD paper describes three successive neural networks, PHD-sec for secondary structure prediction, PHD-htm for predicting transmembrane helices and PHD-acc for solvent accessibility : Here is a simplified depiction of PHD-sec: input sequence input layer window hidden layer output layer L S W T K C Y A V S G A P 1... Hj Ok (Rost, 1996) predicted structure 1 α β coil (Adapted from Mount, 21) If the input to the neural net consists of a sequence profile, then each input unit is set to the frequency of the associated amino acid at the given position. Additionally, two input units are used to count insertions and deletions. The predictions obtained for adjacent windows are then post-processed by applying rules or additional neural nets to obtain a final prediction. Experimental studies show that the PHD method applied to sequences obtains a single-residue accuracy of 7.8%. Application to sequence profiles gives rise to an accuracy of 72% (Rost and Sand 1994). The PHD algorithm uses sequences from the HSSP (homology-derived secondary structure of proteins) database for training (Sander and Schneider, 1991). 8.1 Training the PHD neural network A method called back-propagation can be used to train such neural networks. For example, consider the output node O k shown in the network above and assume that it predicts whether the central residue lies in an α-helix. The output signal O k predicts an α-helix, if it is close to 1, or not, if it is

16 118 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, 211 close to. Presented with a training data point, we know whether or not the central residue actually lies in an α-helix, and thus, what the desired output D k of O k should be. Consider one of the hidden units H j that is connected to O k and emits a signal H j that is modified by the weight w jk. The signal arriving at O k is w jk H j : Hj Hj W jkhj When training the network, the main question is: how should we alter w jk so as to bring the value O k = sgm( j w jkh j ) of node O k closer to the desired value D k? Ok Assume that the network has p input, q hidden and r output nodes. In this case, the output of a hidden node H j is given by: ( p ) H j = sgm w Ii H j I i. The output of an output node O k is given by: q O k = sgm w Hj O k H j. i=1 j=1 Hence, ( q p ) O k = sgm w Hj O k sgm w Ii H j I i, for k = 1, 2,..., r. j=1 i=1 This allows us to calculate the output for a given input set. A training set specifies t pairs of inputs and desired outputs, (I 1 1, I1 2,..., I1 p, D 1 1,..., D1 r),..., (I t 1, It 2,..., It p, D t 1,..., Dt r). The mean square error is defined as: E = t q=1 i=1 r (D q i Oq i )2, which is straight-forward to calculate using the previous equation. The gradient descent method specifies that we repeatedly do the following: Choose some weight w ij in the network and modify it by a small amount w ij = n E/ w ij, so as to decrease the error E. The factor n is the training rate (.3). For example, in the case of an edge jk attaching a hidden node H j to an output node O k, the partial derivative of the error E with respect to w jk is given by which will we not show here. E/ w jk = (O k D k )O k (1 O k )H j, So in this case, the weight w jk is modified by this amount: w jk = n(o k D k )O k (1 O k )H j. Secondary structure prediction web server:

17 Grundlagen der Bioinformatik, SoSe 11, D. Huson, June 6, Summary The most important features of the secondary structure of a protein are its α-helices and β- strands. The DSSP program defines the secondary structure elements of a protein based on the 3D coordinates of the atoms in the protein. The SSP program uses a linear discriminant function to predict secondary structure from sequence. The program PhD addresses the same problem using a neural network.

7 Protein secondary structure

78 Grundlagen der Bioinformatik, SS 1, D. Huson, June 17, 21 7 Protein secondary structure Sources for this chapter, which are all recommended reading: Introduction to Protein Structure, Branden & Tooze,