Protein Secondary Structure Prediction

Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler

Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms Shortcomings of existing approaches Current research

Protein structure prediction Sequence of 984 amino acids: PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQ HIV reverse transcriptase 3D coordinates of 7404 atoms:

Abstracting the problem 3D coords of all atoms: 3D coords of C-α backbone: 3D coords of secondary structure elements: C-α groups:

Secondary structure prediction for protein folding Sequence of amino acids: Predict structural segments: Goal: Recover 3D coords PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEK EGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQD FWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTA FTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFK KQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLT TPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQ KLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAEL ELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQE PFKNLKTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTP KFKLPIQKETWETWWTEYWQATWIPEWEFVNTPPLVKLWYQLE KEPIVGAETFYVDGAANRETKLGKAGYVTNKGRQKVVPLTNTT NQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQPDKSESE LVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEK EGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQD FWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTA FTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFK KQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLT TPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQ KLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAEL ELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQE PFKNLKTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTP KFKLPIQKETWETWWTEYWQATWIPEWEFVNTPPLVKLWYQ The secondary structure prediction problem: Given a protein sequence: NWVLST VLSTAADMQGVVTDGMASGLDKD... D... Predict a secondary structure sequence: LLEEEE EEEELLLLHHHHHHHHHHLHHHL... H = α-helix E = Extended β- strand L = Loop/coil

Defining the secondary structure of a protein sequence α-helix and anti-parallel β sheet: Residue Sequence: Secondary Structure: NWVLSTAADMQGVVTDGMASFLDKD...... LLEEEELLLLHHHHHHHHHHLHHHL Fig. 1: Syntactic formulation of secondary structure problem

Abstracted version of protein structure prediction Sequence of 984 amino acids: 3D coords of 179 structural elements: PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQLCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQ

The Secondary Structure Prediction problem Given a protein sequence: NWVLST VLSTAADMQGVVTDGMASGLDKD... D... Predict a secondary structure sequence: LLEEEE EEEELLLLHHHHHHHHHHLHHHL... 3-state problem: {ARNDCQEGHILKMFPSTWYV} n -> {L,H,E} n

Chou-Fasman method (Chou & Fasman, 1974) Calculate propensity for each amino acid to be in helix, strand, coil ( ) P( A) P A S = n A,S n S n A n Classify by propensity

Chou-Fasman prediction Search for nucleation sites helix nucleation: score > 4 in window of 6 where Propagate to until termination criteria met helix termination: tetrapeptide with mean propensity < 1 Rules for conflict resolution, exceptions Accuracy: 70-80% (< 55% Nishikawa, 83)

GOR (Garnier et al,1978) Calculate information content for each amino acid: I( S; R) = log[ P( S R) P( S) ] (same as CF propensity ) Information difference ( ) = I( S;R) I ( S;R) ( ) P ( S, R) I S; R = log P S, R (likelihood ratio) [ ] + log[ P ( S) P( S) ] Predict max using window of size 17 Accuracy: 64% (< 55% Nishikawa, 83)

Window-based prediction For each position in a protein sequence:...lstaadmqgvvtdgmasgldkd... TDGMASGLDKD... Predict its secondary structure based on a local window:...lstaadmqgv MQGVVTDGMASGLDKD... Slide window along sequence:...lstaadmqgvv QGVVTDGMASGLDKD......LSTAADMQGVVT GVVTDGMASGLDKD... GLDKD......LSTAADMQGVVTD VVTDGMASGLDKD...

Modeling structural correlations NAIVE-BAYES CLASSIFIER Conditional independence models: { L,H,E } A R N D C Q E G H I L K M F P S T W Y P(A H) V... P(A E) P(A L) HELIX STRAND LOOP i-4 i-3 i-2 i-1 i i+1 i+2 i+3 i+4 Pair-wise dependence: A R N D A R N D C Q E G H I L K M F P S T W Y V * * * KLINGER : STRUCTURAL CORRELATIONS... i i+1 i+2 i+3 i+4 W Y V

Hydrolase (β-lactamase)( with amphipathic α-helix

Amphipathic α-helix: hydrophobic side chains

Amphipathic α-helix: side chain periodicity Sequence: NLAKMVVKTAEAILKD

Structural correlations in β-strands 5 4 3 2 1 0-1 -2-3 -4 E E E E

PHD (Rost & Sander, 1993) MULTI-LAYER PERCEPTRON (FULLY-CONNECTED) Neural network based 2 levels: Sequence -> Structure Structure -> Structure i-4 Uses multiple sequence alignment Amino acid frequencies Structure Predictions -> Conservation weight Post-processing by dynamic programming i-3 i-2 H L E...... i-1 i i+1 i+2 i+3 i+4 <-...Amino Acid Sequence... ->

Special case: helical transmembrane proteins Membrane proteins biologically important Difficult to determine experimentally Easier to predict Constraints imposed by lipid bilayer Strong hydrophobicity signal Cytoplasmic residues positively charged 2-state bacteriorhodopsin Accuracy: 95% (Multiple alignment)

Predator (Frishman & Argos, 1996) Nearest-neighbor classifier Represent subsequence as a vector Define a distance metric Find closest vectors in training set, vote Adds non-local terms for hydrogen bonding propensity (helices, sheets) Accuracy: 68% single sequence; 75% multiple alignment

SSPAL (Salamov & Solovyev, 1997) Nearest-neighbor can be viewed as fixed length, non-gapped local alignment Find K (= 50) best non-overlapping local alignments with known structures Predict each position by consensus of alignment structures, weighted by score Accuracy: Single sequence: 71.2% Multiple sequence alignment: 73.5%

Evaluation of secondary structure prediction Large database of protein sequences: Known structures X-ray crystallography, NMR Gold-standard assignment Non-homologous < 25-30% identity Cross-validation

What to measure? Q3 (3-state accuracy) percent residues correct Matthews correlation coefficient adjust for prevalence Segment-based measures Rost et al 94, Taylor 84, Presnell et al 92

Bayesian Segmentation of Protein Secondary Structure Scott C.Schmidler 1,3 Jun S. Liu Douglas L. Brutlag 2 3 1 2 3 Section on Medical Informatics Department of Statistics Department of Biochemistry Stanford University

Goals Improved secondary structure prediction > 70% accuracy (75% MSA) accurate estimates of prediction variability Combining structural data with scientific knowledge

Bayesian structure prediction Model-based structure prediction Probabilistic modeling of segments Hydrophobicity patterns Side chain interactions Helical capping Predict structure to maximize probability Optimal segmentation of protein sequence L E L H L H L Doug Brutlag, 2000... NW VLST AADM QGVVTDGMAS F LDK D...

Model Joint distribution: P( R, S,T ) = P( S,T ) P R [ S j 1 +1:S j ] S, T Conditional independence model for inter- segment residues: Markovian dependence in S,T: ( ) = P T j T j 1 P R, S,T m j =1 m j =1 ( )P( S j S j 1,T j )P R Sj 1 +1:S j [ ] S j 1, S j,t j Example: L E L H L H L R = NWVLSTAADMQGVVTDGMASFLDKD SS = LLEEEELLLLHHHHHHHHHHLHHHL

Position-specific preferences Helix N-cap model, positions 1&2 Helix internal position model 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 A R N D C Q E G H I L K M F P S T W Y V X 0 A R N D C Q E G H I L K M F P S T W Y V Amino acid Amino acid Strand internal position model Loop/coil N-cap model, positions 1&2 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 A R N D C Q E G H I L K M F P S T W Y V 0 A R N D C Q E G H I L K M F P S T W Y V Amino acid Amino acid

Segment likelihood Modeling correlations among properties j ( ) = P Helix R [i] H i P R [k: j] S q 1,S q, Helix P R [k: j] S q 1,S q,strand i = k ( ) = P Strand ( R [i ] H i ) i = k ( ) = P Loop R [i ] H i P R [k: j] S q 1,S q, Loop j ( )P Helix H i H i 2, H i 3,H i 4, H i 7 ( ) P Strand H i H i 2, H i 3 j i =k ( )P Loop H i H i 2 ( ) ( ) H1 H2 H3 H4 H5 H6 R1 R2 R3 R4 R5 R6 HELIX MODEL

Segment length distributions 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Helix Strand Segment length

Probabilistic model Structureand positionspecific frequencies: P( Sequence Structure) P( Structure) 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Segment length priors: Helix Strand Segment length Conditional independence of inter-segment residues: L E L H Markovian dependence in segment types: L H L

Benefits Explicit probabilistic model: Sound semantics Coherent treatment of uncertainty/noisy data Explicit, testable assumptions Fully Bayesian prediction: Averaged over all possible models Accurate estimates of uncertainty

Example prediction: Cytochrome C5 (1cc5) True: Predicted:

Prediction confidence 100 95 90 85 80 75 70 65 0.4 0.5 0.6 0.7 0.8 0.9 Prediction threshold 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 0 0.4 0.5 0.6 0.7 0.8 0.9 Prediction threshold 0 0.4 0.5 0.6 0.7 0.8 0.9 Prediction threshold %H %E %L

Evaluation 453 structures selected from the Brookhaven Protein Data Bank < 2.5A resolution < 25% sequence similarity Cross-validation results (single sequence): Bayesian Segmentation algorithm: Marginal mode prediction: 68.8% MAP segmentation prediction: 64.2% Previous best-published: 68% (Frishman&Argos 96) 71% (Salamov&Solovyev 97)

Beyond secondary structure: predicting 3D contacts Model captures local dependencies Structure-specific residue propensities Intra-segment side chain correlations Tertiary interactions β-sheets Coiled-coils Disulfide bonds

β-sheet side chain correlations Odds ratio: P A i, A j Struct ( ) P A i Struct ( )P A j Struct ( ) Charged-pair interaction in Glutaredoxin Disulfide bonds Stabilizing pairs from (Smith & Regan, Science 1995) Hydrophobic side chains

Incorporating non-local information Segment interaction models Replace terms P ( Segment j ) and P Segment k with ( ) P( Segment j,segment k) Parallel β-sheet in 1nzy β H L β L L E L E L H L E L E but computation... L

Prediction of β-strand Contact Map for 5pti Predicted contacts: True contacts: Pairing and register of β-hairpin correctly predicted

Previous Approaches: Hubbard (1995) Predicted contacts (single sequence): Predicted contacts (multiple sequence alignment):

Flavodoxin (5nul) Predicted contacts: True contacts: β-strands well-predicted but poor specificity in strand pairing

Future work Models for subclasses of segments Environment: amphipathic/buried/exposed Structure: 3-10 helices, β, γ turns Model selection Multiple sequence alignment information

Conclusions Probabilistic modeling of protein structure Prediction by segmentation of sequence Independent segment models perform comparably to existing approaches General framework for modeling non-local interactions to predict 3D contacts